Download - Data Mining Techniques: Classification
![Page 1: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/1.jpg)
Data Mining Techniques:Classification
![Page 2: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/2.jpg)
Classification
• What is Classification?– Classifying tuples in a database
– In training set E• each tuple consists of the same set of multiple attributes
as the tuples in the large database W
• additionally, each tuple has a known class identity
– Derive the classification mechanism from the training set E, and then use this mechanism to classify general data (in W)
![Page 3: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/3.jpg)
Learning Phase
• Learning– The class label attribute is credit_rating– Training data are analyzed by a classification algorithm– The classifier is represented in the form of classification rules
![Page 4: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/4.jpg)
Testing Phase
• Testing (Classification)– Test data are used to estimate the accuracy of the classification rules
– If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples
![Page 5: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/5.jpg)
Classification by Decision Tree
A top-down decision tree generation algorithm: ID-3 and its extended version C4.5 (Quinlan’93): J.R. Quinlan, C4.5 Programs for Machine Learning, Morgan Kaufmann, 1993
![Page 6: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/6.jpg)
Decision Tree Generation• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
• Attribute Selection– Favoring the partitioning which makes the majority
of examples belong to a single class
• Tree Pruning (Overfitting Problem)– Aiming at removing tree branches that may lead to
errors when classifying test data• Training data may contain noise, …
![Page 7: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/7.jpg)
Eye Hair Height OrientalBlack Black Short YesBlack White Tall YesBlack White Short YesBlack Black Tall YesBrown Black Tall YesBrown White Short YesBlue Gold Tall NoBlue Gold Short NoBlue White Tall NoBlue Black Short No
Brown Gold Short No
1 2 3 4 5 6 7 8 91011
Another Examples
![Page 8: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/8.jpg)
Decision Tree
![Page 9: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/9.jpg)
Decision Tree
![Page 10: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/10.jpg)
Decision Tree Generation
• Attribute Selection (Split Criterion)– Information Gain (ID3/C4.5/See5)– Gini Index (CART/IBM Intelligent Miner)– Inference Power
• These measures are also called goodness functions and used to select the attribute to split at a tree node during the tree generation phase
![Page 11: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/11.jpg)
Decision Tree Generation
• Branching Scheme– Determining the tree branch to which a sample
belongs– Binary vs. K-ary Splitting
• When to stop the further splitting of a node– Impurity Measure
• Labeling Rule– A node is labeled as the class to which most sa
mples at the node belongs
![Page 12: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/12.jpg)
Decision Tree Generation Algorithm: ID3
(7.1) Entropy
ID: Iterative Dichotomiser
![Page 13: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/13.jpg)
Decision Tree Algorithm: ID3
![Page 14: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/14.jpg)
Decision Tree Algorithm: ID3
![Page 15: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/15.jpg)
Decision Tree Algorithm: ID3
![Page 16: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/16.jpg)
Decision Tree Algorithm: ID3
yes
![Page 17: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/17.jpg)
Decision Tree Algorithm: ID3
![Page 18: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/18.jpg)
Exercise 2
![Page 19: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/19.jpg)
Decision Tree Generation Algorithm: ID3
![Page 20: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/20.jpg)
Decision Tree Generation Algorithm: ID3
![Page 21: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/21.jpg)
Decision Tree Generation Algorithm: ID3
![Page 22: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/22.jpg)
How to Use a Tree• Directly
– Test the attribute value of unknown sample against the tree.
– A path is traced from root to a leaf which holds the label
• Indirectly– Decision tree is converted to classification rules– One rule is created for each path from the root
to a leaf– IF-THEN is easier for humans to understand
![Page 23: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/23.jpg)
Generating Classification Rules
![Page 24: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/24.jpg)
Generating Classification Rules
![Page 25: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/25.jpg)
Generating Classification Rules• There are 4 decision rules are generated by the tree
– Watch the game and home team wins and out with friends then bear
– Watch the game and home team wins and sitting at home then diet soda
– Watch the game and home team loses and out with friend then bear
– Watch the game and home team loses and sitting at home then milk
• Optimization for these rules– Watch the game and out with friends then bear– Watch the game and home team wins and sitting at home then diet
soda– Watch the game and home team loses and sitting at home then
milk
![Page 26: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/26.jpg)
Decision Tree Generation Algorithm: ID3
• All attributes are assumed to be categorical (discretized)
• Can be modified for continuous-valued attributes– Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of intervals
– A V | A < V
• Prefer Attributes with Many Values• Cannot Handle Missing Attribute Values• Attribute dependencies do not consider in this al
gorithm
![Page 27: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/27.jpg)
Attribute Selection in C4.5
![Page 28: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/28.jpg)
Handling Continuous Attributes
![Page 29: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/29.jpg)
Handling Continuous Attributes
![Page 30: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/30.jpg)
Handling Continuous Attributes
Sorted By Sorted By
First Cut
Second
Cut
Third
Cut
![Page 31: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/31.jpg)
Handling Continuous Attributes
Root
Price On Date T+1> 18.02
Price On Date T+1<= 18.02
Price On Date T> 17.84
Price On Date T<= 17.84
Price On Date T+1> 17.70
Price On Date T+1<= 17.70
Buy
Sell
Buy Sell
First Cut
Second Cut
Third Cut
![Page 32: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/32.jpg)
Exercise 3:分析房價
SF : Square Feet
ID Location Type Miles SF CM Home Price (K)
1 Urban Detached 2 2000 50 High
2 Rural Detached 9 2000 5 Low
3 Urban Attached 3 1500 150 High
4 Urban Detached 15 2500 250 High
5 Rural Detached 30 3000 1 Low
6 Rural Detached 3 2500 10 Medium
7 Rural Detached 20 1800 5 Medium
8 Urban Attached 5 1800 50 High
9 Rural Detached 30 3000 1 Low
10 Urban Attached 25 1200 100 Medium
CM : No. of Homes in Community
![Page 33: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/33.jpg)
Unknown Attribute Values in C4.5
Training
Testing
![Page 34: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/34.jpg)
Unknown Attribute Values Adjustment of Attribute
Selection Measure
![Page 35: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/35.jpg)
Fill in Approach
![Page 36: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/36.jpg)
Probability Approach
![Page 37: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/37.jpg)
Probability Approach
![Page 38: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/38.jpg)
Unknown Attribute Values Partitioning the
Training Set
![Page 39: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/39.jpg)
Probability Approach
![Page 40: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/40.jpg)
Unknown Attribute Values Classifying an
Unseen Case
![Page 41: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/41.jpg)
Probability Approach
![Page 42: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/42.jpg)
Evaluation – Coincidence MatrixCost = $190 * (closing good account) + $10 * (keeping bad account open)
Accuracy (正確率 ) = (36+632) / 718 = 93.0%
Precision (精確率 ) for Insolvent = 36/58 = 62.01%Recall (捕捉率 ) for Insolvent = 36/64 = 56.25%F Measure = 2 * Precision * Recall / (Precision + Recall ) = 2 * 62.01% * 56.25% / (62.01% + 56.25% ) = 0.7 / 1.1826 = 0.59
Cost = $190 * 22 + $10 * 28 = $4,460
Decision Tree Model
![Page 43: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/43.jpg)
Decision Tree Generation Algorithm: Gini Index
• If a data set S contains examples from n classes, gini index, gini(S), is defined as
where pj is the relative frequency of class Cj in S.
• If a data set S is split into two subsets S1 and S2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index, gini(S), is defined as
n
1j
2jp1gini(S)
)Tgini(NN)Tgini(
NN(S)gini 2
21
1split
![Page 44: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/44.jpg)
Decision Tree Generation Algorithm: Gini Index
• The attribute provides the smallest ginisplit(S) is chosen to split the node
• The computation cost of gini index is less than information gain
• All attributes are binary splitting in IBM Intelligent Miner– A V | A < V
![Page 45: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/45.jpg)
Decision Tree Generation Algorithm: Inference Power
• A feature that is useful in inferring the group identity of a data tuple is said to have a good inference power to that group identity.
• In Table 1, given attributes (features) “Gender”, “Beverage”, “State”, try to find their inference power to “Group id”
![Page 46: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/46.jpg)
![Page 47: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/47.jpg)
![Page 48: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/48.jpg)
Naive Bayesian Classification
• Each data sample is a n-dim feature vector– X = (x1, x2, .. xn) for attributes A1, A2, … An
• Suppose there are m classes– C = {C1, C2,.. Cm}
• The classifier will predict X to the class Ci that has the highest posterior probability, conditioned on X– X belongs to Ci iff P(Ci|X) > P(Cj|X) for all 1<
=j<=m, j!=i
![Page 49: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/49.jpg)
Naive Bayesian Classification
• P(Ci|X) = P(X|Ci) P(Ci) / P(X)– P(Ci|X) = P(Ci X) / P(X) ; P(X|Ci) = P(Ci X) / P(Ci)∪ ∪
=> P(Ci|X) P(X) = P(X|Ci) P(Ci)
• P(Ci) = si / s– si is the number of training sample of class Ci – s is the total number of training samples
• Assumption: Independent between Attributes– P(X|Ci) = P(x1|Ci) P(x2|Ci) P(x3|Ci) ...
P(xn|Ci)
• P(X) can be ignored
![Page 50: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/50.jpg)
Naive Bayesian Classification
Classify X=(age=“<=30”, income=“medium”, student=“yes”, credit-rating=“fair”)– P(buys_computer=yes) = 9/14
– P(buys_computer=no)=5/14
– P(age=<30|buys_computer=yes)=2/9
– P(age=<30|buys_computer=no)=3/5
– P(income=medium|buys_computer=yes)=4/9
– P(income=medium|buys_computer=no)=2/5
– P(student=yes|buys_computer=yes)=6/9
– P(student=yes|buys_computer=no)=1/5
– P(credit-rating=fair|buys_computer=yes)=6/9
– P(credit-rating =fair|buys_computer=no)=2/5
– P(X|buys_computer=yes)=0.044
– P(X|buys_computer=no)=0.019
– P(buys_computer=yes|X) = P(X|buys_computer=yes) P(buys_computer=yes)=0.028
– P(buys_computer=no|X) = P(X|buys_computer=no) P(buys_computer=no)=0.007
![Page 51: Data Mining Techniques: Classification](https://reader035.vdocument.in/reader035/viewer/2022062720/5681351e550346895d9c7d46/html5/thumbnails/51.jpg)
Homework Assignment