dbm630 lecture06
DESCRIPTION
TRANSCRIPT
DBM630: Data Mining and
Data Warehousing
MS.IT. Rangsit University
1
Semester 2/2011
Lecture 6
Classification and Prediction Decision Tree and Classification Rules
by Kritsada Sriphaew (sriphaew.k AT gmail.com)
Topics
2
What Is Classification, What Is Prediction?
Decision Tree
Classification Rule: Covering Algorithm
Data Warehousing and Data Mining by Kritsada Sriphaew
What Is Classification?
Case
A bank loans officer needs analysis of her data in order to learn which loan applicants are “safe” and which are “risky” for the bank
A marketing manager needs data analysis to help guess whether a customer with a given profile will buy a new computer or not
A medical researcher wants to analyze breast cancer data in order to predict which one of three specific treatments a patient receive
The data analysis task is classification, where the model or classifier is constructed to predict categorical labels
The model is a classifier
3 Data Warehousing and Data Mining by Kritsada Sriphaew
What Is Prediction? Suppose that the marketing manager would like to predict how
much a given customer will spend during a sale at the shop
This data analysis task is numeric prediction, where the model constructed predicts a continuous value or ordered values, as opposed to a categorical label
This model is a predictor
Regression analysis is a statistical methodology that is most often used for numeric prediction
4 Data Warehousing and Data Mining by Kritsada Sriphaew
5
How does classification work?
Data classification is a two-step process
In the first step, -- learning step or training phase A model is built describing a predetermined set of data classes or concepts
Data tuples used to build the classification model are called training data set
If the class label is provided, this step is known as supervised learning, otherwise called unsupervised learning
The learned model may be represented in the form of classification rules, decision trees, Bayesian, mathematical formulae, etc.
Data Warehousing and Data Mining by Kritsada Sriphaew
6
How does classification work?
In the second step,
The learned model is used for classification
Estimate the predictive accuracy of the model using hold-out data set (a test set of class-labeled samples which are randomly selected and are independent of the training samples)
If the accuracy of the model were estimate based on the training data set -> the model tends to overfit the data
If the accuracy of the model is considered acceptable, the model can be used to classify future data tuples or objects for which the class label is unknown
In the experiment, there are three kinds of dataset, training data set, hold-out data set (or validation data set), and test data set
Data Warehousing and Data Mining by Kritsada Sriphaew
7
Issues Regarding Classification/Prediction
Comparing classification methods
The criteria to compare and evaluate classification and prediction methods
Accuracy: an ability of a given classifier to correctly predict the class label of new or unseen data
Speed: the computation costs involved in generating and using the given classifier or predictor
Robustness: an ability of the classifier or predictor to make correct predictions given noisy data or data with missing values
Scalability: an ability to construct the classifier or predictor efficiently given large amounts of data
Interpretability: the level of understanding and insight that is provided by the classifier or predictor – subjective and more difficult to assess
Data Warehousing and Data Mining by Kritsada Sriphaew
8
Decision Tree A decision tree is a flow-chart-like tree structure,
each internal node denotes a test on an attribute, each branch represents an outcome of the test leaf node represent classes Top-most node in a tree is the root node Instead of using the complete set of features jointly to make a decision, different
subsets of features are used at different levels of the tree during making a decision
Age?
student? Credit_rating?
no yes
yes
yes no
<=30 31…40
>40
no yes excellent fair
The decision tree
represents the concept buys_computer
Classification – Decision Tree
9
Decision Tree Induction
Normal procedure: greedy algorithm by top down in recursive divide-and-conquer fashion First: attribute is selected for root node and branch is
created for each possible attribute value
Then: the instances are split into subsets (one for each
branch extending from the node)
Finally: procedure is repeated recursively for each branch, using only instances that reach the branch
Process stops if All instances for a given node belong to the same class
No remaining attribute on which the samples may be further partitioned majority vote is employed
No sample for the branch to test the attrbiute majority vote is employed
Classification – Decision Tree
10
Decision Tree Representation (An Example)
The decision tree (DT) of the weather example is:
outlook
windy humidity
sunny rainy overcast
yes
high normal
no yes
false true
yes no
Decision Tree Induction
Outlook Temp. Humid. Windy Play
sunny hot high false N
sunny hot high true N
overcast hot high false Y
rainy mild high false Y
rainy cool normal false Y
rainy cool normal true N
overcast cool normal true Y
sunny mild high false N
sunny cool normal false Y
rainy mild normal false Y
sunny mild normal true Y
overcast mild high true Y
overcast hot normal false Y
rainy mild high true N Classification – Decision Tree
11
An Example (Which attribute is the best?)
There are four possibilities for each split
Classification – Decision Tree
12
Criterions for Attribute Selection
Which is the best attribute? The one which will result in the smallest tree
Heuristic: choose the attribute that produces the “purest”
nodes
Popular impurity criterion: information gain
Information gain increases with the average purity
of the subsets that an attribute produces
Strategy: choose the attribute with the highest
information gain is chosen as the test attribute
for the current nodes
Classification – Decision Tree
13
Computing “Information” Information is measured in bits
Given a probability distribution, the information required to predict an event is the distribution‟s entropy
Entropy gives the information required in bits (this can involve fractions of bits!)
Information gain measures the goodness of split
Formula for computing expected information:
Let S be a set consisting of s data instances, the class label attribute has n distinct classes, Ci (for i = 1, …, n)
Let si be the number of instances in class Ci
The expected information or entropy is
info([s1,s2,…,sn]) = entropy(s1/s, s2/s, sn/s) = - S pi(log2 pi) where pi is the probability that the instance belongs to class, pi = si/s
Formula for computing information gain: Find an information gain of attribute A
gain(A) = info. before splitting – info. after splitting
Classification – Decision Tree
14
Expected Information for “Outlook” “Outlook” = “sunny”:
info([2,3]) = entropy(2/5,3/5) = -(2/5)log2(2/5) - (3/5)log2(3/5) = 0.971 bits
“Outlook” = “overcast”: info([4,0]) = entropy(1,0) = -(1)log2(1) - (0)log2(0) = 0 bits
“Outlook” = “rainy”: info([3,2]) = entropy(3/5,2/5) = - (3/5)log2(3/5) - (2/5)log2(2/5) = 0.971 bits
Expected information for attribute “Outlook”: info([2,3],[4,0],[3,2]) = (5/14)info([2,3]) + (4/14)info([4,0]) + (5/14)info([3,2]) = [ (5/14)0.971 ] +[ (4/14)x0 ] +[ (5/14)x0.971 ] = 0.693 bits
Outlook Temp. Humid. Windy Play
sunny hot high false N
sunny hot high true N
overcast hot high false Y
rainy mild high false Y
rainy cool normal false Y
rainy cool normal true N
overcast cool normal true Y
sunny mild high false N
sunny cool normal false Y
rainy mild normal false Y
sunny mild normal true Y
overcast mild high true Y
overcast hot normal false Y
rainy mild high true N
Classification – Decision Tree
15
Information Gain for “Outlook” Information gain:
info. before splitting – info. after splitting
gain(”Outlook”) = info([9,5]) - info([2,3],[4,0],[3,2])
= 0.940-0.693
= 0.247 bits
Information gain for attributes from weather data:
gain(”Outlook”) = 0.247 bits
gain(”Temperature”) = 0.029 bits
gain(“Humidity”) = 0.152 bits
gain(“Windy”) = 0.048 bits
Classification – Decision Tree
16
Gain(outlook) =
info([9,5]) -
info([2,3],[4,0],[3,2])
= 0.247
An Example of Gain Criterion (Which attribute is the best?)
Gain(humidity) =
info([9,5]) -
info([3,4],[6,1])
= 0.152
Gain(humidity) =
info([9,5]) -
info([6,2],[3,3])
= 0.048
Gain(outlook) =
info([9,5]) -
info([2,2],[4,2],[3,1])
= 0.029
The best
Classification – Decision Tree
17
Continuing to Split
If “Outlook” = “sunny”
gain(”Temperature”) = 0.571 bits
gain(“Humidity”) = 0.971 bits
gain(“Windy”) = 0.020 bits
“Temperature” = “hot”: info([0,2]) = entropy(0,1) = -(0)log2(0) - (1)log2(1)
= 0 bits
“Temperature” = “mild”: info([1,1]) = entropy(1/2,1/2) = -(1/2)log2(1/2) - (1/2)log2(1/2)
= 1 bits
“Temperature” = “cool”: info([1,0]) = 0 bits
Expected information for attribute “Temperature”: info([0,2],[1,1],[1,0])
= (2/5)info([0,2]) + (2/5)info([1,1]) + (1/5)info([1,0])
= 0 + [ (2/5)x1 ] + 0
= 0.4 bits
Information gain gain(“Temperature”) = info(2,3) - info([0,2],[1,1],[1,0]) = –(2/5)log2(2/5) – (3/5)log2(3/5) – 0.4 = 0.971 – 0.4 = 0.571 bits
Classification – Decision Tree
18
The Final Decision Tree
Note: not all leaves need to be pure; sometimes identical
instances have different classes
Splitting stops when data can‟t be split any further
Classification – Decision Tree
19
Properties for a Purity Measure
Properties we require from a purity measure: When node is pure, measure should be zero
When impurity is maximal (i. e. all classes equally likely),
measure should be maximal
Measure should obey multistage property (i. e. decisions can be
made in several stages):
measure([2,3,4]) =
measure([2,7]) + (7/9) measure([3,4])
Entropy is the only function that satisfies all
three properties!
Classification – Decision Tree
20
Some Properties for the Entropy The multistage property:
entropy(p,q,r) = entropy(p,q+r) + [(q+r)/(p+q+r)] × entropy(q, r)
Ex.: info(2,3,4) can be calculated as
= {- (2/9)log2(2/9) + (7/9)log2(7/9)}–{7/9}*{[(3/7)log2(3/7) + (4/7)log2(4/7)}
= - (2/9)log2(2/9) - (7/9) [ log2 (7/9) + (3/7)log2(3/7) + (4/7)log2(4/7) ]
= - (2/9)log2(2/9)
- (7/9) [ (3/7)log2(7/9) + (4/7)log2(7/9) +(3/7)log2(3/7) + (4/7)log2(4/7) ]
= - (2/9)log2(2/9)
- (7/9) [ (3/7)log2(7/9) + (3/7)log2(3/7) +(4/7)log2(7/9) + (4/7)log2(4/7) ]
= - (2/9)log2(2/9)
- (7/9) [ (3/7)log2(7/9 x 3/7) + (4/7)log2(7/9 x 4/7) ]
= - (2/9)log2(2/9) - (7/9) [ (3/7)log2(3/9) + (4/7)log2(4/9) ]
= - (2/9)log2(2/9) - (3/9)log2(3/9) - (4/9)log2(4/9)
Classification – Decision Tree
21
A Problem: Highly-Branching Attributes
Problematic: attributes with a large number of values (extreme case: ID code)
Subsets are more likely to be pure if there is a large number of values
Information gain is biased towards choosing attributes with a large number of values
This may result in overfitting (selection of an attribute that is non-optimal for prediction) and fragmentation
Classification – Decision Tree
22
Example: Highly-Branching Attributes
Entropy Split
info(ID)
= info([0,1],[0,1],
[1,0],…,[0,1])
= 0 bits
gain(ID) = 0.940 (max.)
ID
A M
no yes yes no
B N
ID Outlook Temp. Humid. Windy Play
A sunny hot high false N
B sunny hot high true N
C overcast hot high false Y
D rainy mild high false Y
E rainy cool normal false Y
F rainy cool normal true N
G overcast cool normal true Y
H sunny mild high false N
I sunny cool normal false Y
J rainy mild normal false Y
K sunny mild normal true Y
L overcast mild high true Y
M overcast hot normal false Y
N rainy mild high true N
Classification – Decision Tree
23
Modification: The Gain Ratio As a Split Info.
Gain ratio: a modification of the information gain that reduces its bias
Gain ratio takes number and size of branches into account when choosing an attribute It corrects the information gain by taking the
intrinsic information of a split into account
Intrinsic information: entropy of distribution of instances into branches
(i.e. how much info do we need to tell which branch an instance belongs to)
Classification – Decision Tree
24
Computing the Gain Ratio
Example: intrinsic information (split info) for ID code
info([1,1,…,1] = 14*( (-1/14)log(1/14) ) =3.807
Value of attribute decreases as intrinsic information gets larger
Definition of gain ratio: gain_ratio(“Attribute”) = gain(“Attribute”) intrinsic_info (“Attribute”)
Example: gain_ratio(“ID”) = gain(“ID”) = 0.970 bits intrinsic_info (“ID”) 3.807 bits = 0.246
Classification – Decision Tree
25
Gain Ratio for Weather Data
Classification – Decision Tree
26
Gain Ratio for Weather Data(Discussion)
“Outlook” still comes out top
However: “ID” has greater gain ratio Standard fix: ad hoc test to prevent splitting on
that type of attribute
Problem with gain ratio: it may overcompensate May choose an attribute just because its intrinsic
information is very low
Standard fix: only consider attributes with greater than average information gain
Classification – Decision Tree
27
Avoiding Overfitting the Data
The naïve DT algorithm grows each branch of
the tree just deeply enough to perfectly classify
the training examples.
This algorithm may produce trees that overfit
the training examples but do not work well for
general cases.
Reason: the training set may has some noises
or it is too small to produce a representative
sample of the true target tree (function).
Classification – Decision Tree
28
Avoid Overfitting: Pruning
Pruning simplifies a decision tree to prevent overfitting to noise
in the data
Two main pruning strategies:
1. Prepruning: stops growing a tree when no statistically significant
association between any attribute and the class at a particular node.
Most popular test: chi-squared test, only statistically significant
attributes where allowed to be selected by information gain procedure
2. Postpruning: takes a fully-grown decision tree and discards unreliable
parts by two main pruning operations, i.e., subtree replacement and
subtree raising with some possible strategies, e.g., error estimation,
significance testing, MDL principle.
Prepruning is preferred in practice because of early stopping
Classification – Decision Tree
29
Subtree Replacement
Bottom-up: tree is considered for replacement once all its
subtrees have been considered
Classification – Decision Tree
30
Subtree Raising
Deletes node and redistributes instances
Slower than subtree replacement (Worthwhile?)
Classification – Decision Tree
31
Tree to Rule vs. Rule to Tree
outlook
windy humidity
sunny rainy overcast
yes
high normal
no yes
false true
yes no
If outlook=sunny & humidity=high then class=no
If outlook=sunny & humidity=normal then class=yes
If outlook=overcast then class=yes
If outlook=rainy & windy=false then class=yes
If outlook=rainy & windy=true then class=no
Tree Rule
If outlook=sunny & humidity=high then class=no
If humidity=normal then class=yes
If outlook=overcast then class=yes
If outlook=rainy & windy=true then class=no
Rule Tree
outlook=rainy & windy=true & humidity=normal ?
outlook=rainy & windy=false & humidity=high ?
? Question:
Classification Rules
Classification Rule: Algorithms
32
Two main algorithms are:
Inferring Rudimentary rules 1R: 1-level decision tree
Covering Algorithms: Algorithm to construct the rules Pruning Rules & Computing Significance
Hypergeometric Distribution vs. Binomial Distribution Incremental Reduce-Error Pruning
Classification Rules
33
Inferring Rudimentary Rules (1R rule) 1R learns a 1-level decision tree
Generate a set of rules that all test on one particular attribute Focus on each attribute
Pseudo-code
Note: “missing” can be treated as a separate attribute value 1R’s simple rules performed not much worse than much more complex
decision trees.
• For each attribute,
• For each value of the attribute, make a rule as
follows:
• count how often each class appears
• find the most frequent class
• make the rule assign that class to this attribute-value
• Calculate the error rate of the rules
• Choose the rules with the smallest error rate
(Holte, 93)
Classification Rules
Classification Rules 34
An Example: Evaluating the Weather Attributes (Nominal, Ordinal)
Outlook Temp. Humidity Windy Play
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
1R chooses the attribute that
produces rules with the smallest
number of errors, i.e., rule sets
of attribute “Outlook” or
“Humidity”
Attribute Rule Error Total Error
Outlook
(O)
O = sunny no
O = overcast yes
O = rainy yes
2/5
0/4
2/5
4/14
Temp.
(T)
T = hot no
T = mild yes
T = cool yes
2/4
2/6
1/4
5/14
Humidity
(H)
H = high no
H = normal yes
3/7
1/7
4/14
Windy
(W)
W = false yes
W = true no
2/8
3/6
5/14
35
An Example: Evaluating the Weather Attributes (Numeric)
Attribute Rule Error Total
Error
Outlook
(O)
O = sunny no
O = overcast yes
O = rainy yes
2/5
0/4
2/5
4/14
Temp.
(T)
T <= 77.5 yes
T > 77.5 no
3/10
2/4
5/14
Humidity
(H)
H <= 82.5 yes
82.5<H<=95.5 no
H > 95.5 yes
1/7
2/6
0/1
3/14
Windy
(W)
W = false yes
W = true no
2/8
3/6
5/14
1R chooses the attribute that
produces rules with the smallest
number of errors, i.e., rule set of
attribute “Humidity”
Outlook Temp. Humidity Windy Play
sunny 85 85 false no
sunny 80 90 true no
overcast 83 86 false yes
rainy 70 96 false yes
rainy 68 80 false yes
rainy 65 70 true no
overcast 64 65 true yes
sunny 72 95 false no
sunny 69 70 false yes
rainy 75 80 false yes
sunny 75 70 true yes
overcast 72 90 true yes
overcast 81 75 false yes
rainy 71 91 true no
Classification Rules
36
Dealing with Numeric Attributes
Numeric attributes are discretized: the range of the attribute is divided into a set of intervals Instances are sorted according to attribute’s values Breakpoints are placed where the (majority) class changes
(so that the total error is minimized)
Example: Temperature from weather data
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Y | N | Y Y Y | N N Y | Y Y | N | Y Y | N
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Y N Y Y Y | N N Y Y Y | N Y Y N
min=3
Left-to-right
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Y N Y Y Y N N Y Y Y | N Y Y N
Merge
same
category
Classification Rules
37
Covering Algorithm Separate-and-conquer algorithm Focus on each class in turn Seek a way to covering all instances in the class More rules could be added for perfect rule set Comparing to decision tree (DT):
Decision tree Divide-and-conquer Focus on all classes at each step Seek an attribute to split on that best separates the classes
DT can be converted into a rule set Straightforward conversion: rule set overly complex More effective conversions are not trivial
In multiclass situations, covering algorithm concentrates on one class at a time whereas DT learner takes all classes into account
Separate-and-conquer: selects the test that
maximizes the number of covered positive examples
and minimizes the number of negative examples
that pass the test. It usually does not pay any
attention to the examples that do not pass the test.
Divide-and-conquer: optimize for all outcomes of
the test.
Classification Rules
38
Constructing Classification Rule (An Example)
Rule so far
Rule after adding new item
Classification Rules
If x<=1.2 then class = b
If x> 1.2 then class = b
If x> 1.2 & y<=2.6 then class = b
More rules could be added for
“perfect” rule set
y > 2.6
x > 1.2
b
b
n
n
y
y
? Decision Tree
x
y
x
y
x
y
1.2
2.6
Instance space
1.2
a a
a a a
b
b b b
b b
b
a a
a a a
b
b b b
b b
b
a a
a a a
b
b b b
b b
b
b b b
39
A Simple Covering Algorithm Generates a rule by adding tests that maximize rule’s
accuracy, even each new test reduces the rule’s coverage Similar to situation in decision trees: problem of selecting
an attribute to split Decision tree inducer maximizes overall purity. Covering algorithm maximizes rule accuracy.
Goal: maximizing accuracy t: total number of instances covered by rule
p: positive examples of the class covered by rule
t-p: number of errors made by rule
One option: select test that maximizes the ratio p/t
We are finished when p/t = 1 or the set of instances cannot be split any further.
Classification Rules
40
An Example: Contact Lenses Data
First try to find a rule for “hard”
age Spectacle
prescription
astigmati
sm
Tear prod.
rate
Recom.
lenses
young myope no reduced none
young myope no normal soft
young myope yes reduced none
young myope yes normal hard
young hypermyope no reduced none
young hypermyope no normal soft
young hypermyope yes reduced none
young hypermyope yes normal hard
pre-presbyopic myope no reduced none
pre-presbyopic myope no normal soft
pre-presbyopic myope yes reduced none
pre-presbyopic myope yes normal hard
pre-presbyopic hypermyope no reduced none
pre-presbyopic hypermyope no normal soft
pre-presbyopic hypermyope yes reduced none
pre-presbyopic hypermyope yes normal none
Age Spectacle
prescription
astigma
tism
Tear prod.
rate
Recom.
lenses
presbyopic myope no reduced none
presbyopic myope no normal none
presbyopic myope yes reduced none
presbyopic myope yes normal hard
presbyopic hypermyope no reduced none
presbyopic hypermyope no normal soft
presbyopic hypermyope yes reduced none
presbyopic hypermyope yes normal none
Classification Rules
41
An Example: Contact Lenses Data (Finding a good choice)
Rule we seek: If ? then recommendation = hard
Possible tests: Age = Young 2/8
Age = Pre- presbyopic 1/8
Age = Presbyopic 1/8
Spectacle prescription = Myope 3/12
Spectacle prescription = Hypermetrope 1/12
Astigmatism = no 0/12
Astigmatism = yes 4/12
Tear production rate = Reduced 0/12
Tear production rate = Normal 4/12 OR
Classification Rules
42
Modified Rule and Resulting Data
Rule with best test added: If astigmatics = yes then recommendation = hard age Spectacle
prescription
astigmati
sm
Tear prod.
rate
Recom.
lenses
young myope no reduced none
young myope no normal soft
young myope yes reduced none
young myope yes normal hard
young hypermyope no reduced none
young hypermyope no normal soft
young hypermyope yes reduced none
young hypermyope yes normal hard
pre-presbyopic myope no reduced none
pre-presbyopic myope no normal soft
pre-presbyopic myope yes reduced none
pre-presbyopic myope yes normal hard
pre-presbyopic hypermyope no reduced none
pre-presbyopic hypermyope no normal soft
pre-presbyopic hypermyope yes reduced none
pre-presbyopic hypermyope yes normal none
Age Spectacle
prescription
astigma
tism
Tear prod.
rate
Recom.
lenses
presbyopic myope no reduced none
presbyopic myope no normal none
presbyopic myope yes reduced none
presbyopic myope yes normal hard
presbyopic hypermyope no reduced none
presbyopic hypermyope no normal soft
presbyopic hypermyope yes reduced none
presbyopic hypermyope yes normal none
• The underlined rows match with the
rule.
• Anyway, we need to refine the rule
since they are not all correct,
according to the rule.
Classification Rules
43
Further Refinement Current State: If astigmatism = yes and ? then recommendation = hard
Possible tests: Age = Young 2/4
Age = Pre- presbyopic 1/4
Age = Presbyopic 1/4
Spectacle prescription = Myope 3/6
Spectacle prescription = Hypermetrope 1/6
Tear production rate = Reduced 0/6
Tear production rate = Normal 4/6
Classification Rules
Classification Rules: Covering Algorithm 44
Modified Rule and Resulting Data Rule with best test added: If astigmatics = yes and tear prod. rate = normal then
recommendation = hard age Spectacle
prescription
astigmati
sm
Tear prod.
rate
Recom.
lenses
young myope no reduced none
young myope no normal soft
young myope yes reduced none
young myope yes normal hard
young hypermyope no reduced none
young hypermyope no normal soft
young hypermyope yes reduced none
young hypermyope yes normal hard
pre-presbyopic myope no reduced none
pre-presbyopic myope no normal soft
pre-presbyopic myope yes reduced none
pre-presbyopic myope yes normal hard
pre-presbyopic hypermyope no reduced none
pre-presbyopic hypermyope no normal soft
pre-presbyopic hypermyope yes reduced none
pre-presbyopic hypermyope yes normal none
Age Spectacle
prescription
astigma
tism
Tear prod.
rate
Recom.
lenses
presbyopic myope no reduced none
presbyopic myope no normal none
presbyopic myope yes reduced none
presbyopic myope yes normal hard
presbyopic hypermyope no reduced none
presbyopic hypermyope no normal soft
presbyopic hypermyope yes reduced none
presbyopic hypermyope yes normal none
• The underlined rows match with
the rule.
• Anyway, we need to refine the rule
since they are not all correct,
according to the rule.
45
Further Refinement
Current State: If astigmatism = yes and tear prod. rate = normal and ? then
recommendation = hard
Possible tests: Age = Young 2/2
Age = Pre- presbyopic 1/2
Age = Presbyopic 1/2
Spectacle prescription = Myope 3/3
Spectacle prescription = Hypermetrope 1/3
Tie between the first and the fourth test We choose the one with greater coverage
Classification Rules
ITS423: Data Warehouses and Data
Mining 46
Modified Rule and Resulting Data Final rule with best test added: If astigmatics = yes and tear prod.rate = normal and spectacle prescription = myope then recommendation = hard
• The blue rows match with the rule.
• All three rows are „hard‟.
• No need to refine the rule since the
rule becomes perfect.
age Spectacle
prescription
astigmati
sm
Tear prod.
rate
Recom.
lenses
young myope no reduced none
young myope no normal soft
young myope yes reduced none
young myope yes normal hard
young hypermyope no reduced none
young hypermyope no normal soft
young hypermyope yes reduced none
young hypermyope yes normal hard
pre-presbyopic myope no reduced none
pre-presbyopic myope no normal soft
pre-presbyopic myope yes reduced none
pre-presbyopic myope yes normal hard
pre-presbyopic hypermyope no reduced none
pre-presbyopic hypermyope no normal soft
pre-presbyopic hypermyope yes reduced none
pre-presbyopic hypermyope yes normal none
Age Spectacle
prescription
astigma
tism
Tear prod.
rate
Recom.
lenses
presbyopic myope no reduced none
presbyopic myope no normal none
presbyopic myope yes reduced none
presbyopic myope yes normal hard
presbyopic hypermyope no reduced none
presbyopic hypermyope no normal soft
presbyopic hypermyope yes reduced none
presbyopic hypermyope yes normal none
Classification Rules
47
Finding More Rules Second rule for recommending “hard lenses”: (built from
instances not covered by first rule) If age = young and astigmatism = yes and
tear production rate = normal then
recommendation = hard
These two rules cover all “hard lenses”:
Process is repeated with other two classes, that is “soft lenses” and “none”.
(1) If astigmatics = yes & tear.prod.rate = normal & spectacle.prescr = myope
then recommendation = hard
(2) If age = young and astigmatism = yes and tear production rate = normal then recommendation = hard
Classification Rules
48
Pseudo-code for PRISM Algorithm For each class C
• Initialize E to the instance set
• While E contains instances in class C
• Create a rule R with an empty left-hand-side that
predicts class C
• Until R is perfect (or there are no more
attributes to use) do
• For each attribute A not mentioned in R, and
each value v,
• Consider adding the condition A = v to the
left-hand side of R
• Select A and v to maximize the accuracy p/t
(break ties by choosing the condition with
the largest p)
• Add A = v to R
• Remove the instances covered by R from E
Classification Rules
49
Order Dependency among Rules PRISM without outerloop generates a decision list for
one class Subsequent rules are designed for rules that are not
covered by previous rules Here, order does not matter because all rules predict the
same class
Outer loop considers all classes separately No order dependence implied
Two problems are overlapping rules default rule required
Classification Rules
50
Separate-and-Conquer Methods like PRISM (for dealing with one class) are separate-and-
conquer algorithms: First, a rule is identified
Then, all instances covered by the rule are separated out Finally, the remaining instances are “conquered”
Difference to divide-and-conquer methods: Subset covered by rule doesn’t need to be explored any further
Variety in separate-and-conquer approach. Search method (e. g. greedy, beam search, ...) Test selection criteria (e. g. accuracy, ...) Pruning method (e. g. MDL, hold-out set, ...) Stopping criterion (e. g. minimum accuracy) Post- processing step
Also: Decision list vs. one rule set for each class
Classification Rules
51
Good Rules and Bad Rules (overview) Sometimes it is better not to generate perfect rules that guarantee to give the
correct classification on all instances in order to avoiding overfitting.
How do we decide which rules are worthwhile?
How do we tell when it becomes counterproductive to continue adding terms to a rule to exclude a few pecky instances of the wrong type?
Two main strategies of pruning rules Global pruning (post-pruning) Incremental pruning (pre-pruning)
Three pruning criteria
MDL principle (Minimum Description Length)
Statistical significance INDUCT Error on hold-out set (reduced-error pruning)
Create all perfect rules then prune
Prune a rule when generating
Rule size + Exception
Classification Rules
52
Hypergeometric Distribution
P
p
T-P
t-p
T
t Hypergeometric Distribution
The p examples out of t examples selected by the
rule are correctly covered
The class contains P examples
The rule selects t examples
The dataset contains T examples
Classification Rules
53
Computing Significance
We want the probability that a random rule does at least as well (statistical significance of rule):
),min(
)(Pt
pi t
T
it
PT
i
P
C
CCRm
),min(
)(Pt
pi
t
T
it
PT
i
P
Rm
Or
)!(!
!Here,
qpq
p
q
p
Classification Rules
54
Good/Bad Rules by Statistical significance (An Example)
If astigmatism = yes and
tear production rate = normal then recommendation = hard success fraction = 4/6
no information success fraction = 4/24
probability of 4/24 4/6 = 0.0014
If astigmatism = yes and tear prod. rate = normal and age = young
then recommendation = hard success fraction = 2/2 no information success fraction = 4/24
probability of 4/24 2/2 = 0.022
2
3
If astigmatism = yes then recommendation = hard success fraction = 4/12 no information success fraction = 4/24
probability of 4/24 4/12 = 0.047
1
“Reduced
probability”
means better
“Increased
probability”
means worse
0.047 0.0014
0.0014 0.022
The Best Rule
P = p = 4, T = 24, t=12 4424−412−42412
= 1∗208
24!
12!∗12!
= 20!
8!∗12!24!
12!∗12!
= 20! ∗ 12!
8! ∗ 24!= 0.047
),min(
)(Pt
pi
t
T
it
PT
i
P
Rm
55
Good/Bad Rules by Statistical significance (Another Example)
If astigmatism = yes and tear production rate = normal
then recommendation = none success fraction = 2/6
no information success fraction = 15/24
probability of 15/24 2/6 = 0.985
If astigmatism = no and tear production rate = normal
then recommendation = soft success fraction = 5/6 no information success fraction = 5/24
probability of 5/24 5/6 = 0.0001
If tear production rate = reduced then recommendation = none success fraction = 12/12 no information success fraction = 15/24
probability of 15/24 12/12 = 0.0017
4
5
6
Good Rule Low Probability
Bad Rule High Probability
Classification Rules
56
The Binomial Distribution Approximation: can use sampling with replacement instead of sampling
without replacement Dataset contains T examples
Class contains P examples
p examples are correctly covered
Rule selects t examples
itiPt
pi T
P
T
P
i
tRm
1)(
),min(
Classification Rules
57
Pruning Strategies
For better estimation, a rule should be evaluated on data not used for training.
This requires a growing set and a pruning set Two options are
Reduced-error pruning for rules builds a full unpruned rule set and simplifies it subsequently
Incremental reduced-error pruning simplifies a rule immediately after it has been built.
Classification Rules
58
INDUCT (Incremental Pruning Algorithm)
Initialize E to the instance set
Until E is empty do
For each class C for which E contains an instance
Use basic covering algorithm to create best perfect rule for C
Calculate significance m(R) for rule and significance
m(R-) for rule with final condition omitted
If (m(R-) < m(R)), prune rule and repeat previous step
From the rules for the different classes, select
the most significant one
(i.e. the one with smallest m(R))
Print the rule
Remove the instances covered by rule from E
Continue
INDUCT’s significance computation for a rule: • Probability of completely random rule with same coverage performing at least as well. • Random rule R selects t cases at random from the dataset • We want to know how likely it is that p of these belong to the correct class? • This probability is given by the hypergeometric distribution
Classification Rules
59
Example:
RID age income student Credit_rating Class:buys_computer
1 youth High No Fair No
2 youth High No Excellent No
3 middle_age High No Fair Yes
4 senior Medium No Fair Yes
5 senior Low Yes Fair Yes
6 senior Low Yes Excellent No
7 middle_age Low Yes Excellent Yes
8 youth Medium No Fair No
9 youth Low Yes Fair Yes
10 senior Medium Yes Fair Yes
11 youth Medium Yes Excellent Yes
12 middle_age Medium No Excellent Yes
13 middle_age High Yes Fair Yes
14 senior medium no Excellent No
Classification task is to predict whether a customer will buy a computer
Data Warehousing and Data Mining by Kritsada Sriphaew