![Page 1: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/1.jpg)
Decision Tree Algorithms
Rule Based
Suitable for automatic generation
![Page 2: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/2.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-2
Decision trees• Logical branching• Historical:
– ID3 – early rule- generating system
• Branches:– Different possible
values• Nodes:
– From which branches emanate
![Page 3: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/3.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-3
Goal-Driven Data Mining
• Define goal– Identify fraudulent cases
• Develop rules identifying attributes attaining that goal– IF attorney = Smith, THEN better check
![Page 4: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/4.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-4
Tree Structure• Sorts out data
– IF THEN rules– Loan variables
• Age: {young, middle, old}• Income: {low, average, high}• Risk: {low, medium, high}
• Exhaustive tree enumerates all combinations– 81 combinations – classify all
![Page 5: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/5.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-5
Types of Trees
• Classification tree– Variable values classes– Finite conditions
• Regression tree– Variable values continuous numbers– Prediction or estimation
![Page 6: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/6.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-6
Rule Induction• Automatically process data
– Classification (logical, easier)– Regression (estimation, messier)
• Search through data for patterns & relationships– Pure knowledge discovery
• Assumes no prior hypothesis• Disregards human judgment
![Page 7: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/7.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-7
Example• Three variables:
– Age– Income– Risk
• Outcomes:– On-time– Late
![Page 8: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/8.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-8
CombinationsVariable Value Cases OT Late Pr(OT)Age Young 12 8 4 0.67
Middle 5 4 1 0.80Old 3 3 0 1.00
Income Low 5 3 2 0.60Average 9 7 2 0.78High 6 5 1 0.83
Risk High 9 5 4 0.55Average 1 0 1 0.00Low 10 10 0 1.00
![Page 9: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/9.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-9
Basis for Classification
• If a category has all outcomes of a certain kind, that makes a good rule– IF income = High, they always paid
• ENTROPY: Measure of content – Actually measure of randomness
![Page 10: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/10.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-10
Entropy formulaInformation = -{p/(p+n)}log2 {p/(p+n)}-{n/(p+n)}log2 {n/(p+n)}
The lower the measure, the greater the information content
Can use to automatically select variable with most productive rule potential
![Page 11: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/11.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-11
Entropy• Young
- 8/12 x -0.390 – 4/12 x -0.528 x 12/20: 0.551
• Middle- 4/5 x -0.258 – 1/5 x -0.464 x 5/20: 0.180
• Old- 3/3 x 0 – 0/3 x 0 x 3/20: 0.000
SUM 0.731Income 0.782Risk 0.446
![Page 12: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/12.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-12
Rule
1. IF(Risk = Low) THEN OT2. ELSE LATE
![Page 13: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/13.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-13
All Rules
1. IF Risk=Low OT2. IF Risk NOT Low & Age=Middle Late3. IF Risk NOT Low & Age NOT Middle &
Income=High Late4. ELSE OT
![Page 14: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/14.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-14
Sample Case
• Age 36 Middle• Income $70K/year Average• Risk:
– Assets $42K– Debts $40K– Wants $5K Average
• Rule 2 applies, says Late
![Page 15: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/15.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-15
Fuzzy Decision Trees
• Have assumed distinct (crisp) outcomes• Many data points not that clear• Fuzzy: Membership function represents
belief (between 0 and 1)• Fuzzy relationships have been
incorporated in decision tree algorithms
![Page 16: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/16.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-16
Fuzzy ExampleAge Young 0.3 Middle 0.9 Old 0.2Income Low 0.0 Average 0.8 High 0.3Risk Low 0.1 Average 0.8 High 0.3• Definitions:
– Sum will not necessarily equal 1.0– If ambiguous, select alternative with larger
membership value– Aggregate with mean
![Page 17: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/17.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-17
Fuzzy Model• IF Risk=Low Then OT
– Membership function: 0.1• IF Risk NOT Low & Age=Middle Then Late
– Risk MAX(0.8, 0.3)– Age 0.9– Membership function: Mean = 0.85
![Page 18: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/18.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-18
Fuzzy Model cont.
• IF Risk NOT Low & Age NOT Middle & Income=High THEN Late– Risk MAX(0.8, 0.3) 0.8– Age MAX(0.3, 0.2) 0.3– Income 0.3– Membership function: Mean = 0.433
![Page 19: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/19.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-19
Fuzzy Model cont.
• IF Risk NOT Low & Age NOT Middle & Income NOT High THEN Late– Risk MAX(0.8, 0.3) 0.8– Age MAX(0.3, 0.2) 0.3– Income MAX(0.0, 0.8) 0.8– Membership function: Mean = 0.633
![Page 20: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/20.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-20
Fuzzy Model cont.
• Highest membership function is 0.633, for Rule 4
• Conclusion: On-time
![Page 21: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/21.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-21
Applications
• Inventory Prediction• Clinical Databases• Software Development Quality
![Page 22: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/22.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-22
Inventory Prediction• Groceries
– Maybe over 100,000 SKUs– Barcode data input
• Data mining to discover patterns– Random sample of over 1.6 million records– 30 months– 95 outlets– Test sample 400,000 records
• Rule induction more workable than regression– 28,000 rules– Very accurate, up to 27% improvement
![Page 23: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/23.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-23
Clinical Database• Headache
– Over 60 possible causes• Exclusive reasoning uses negative rules
– Use when symptom absent• Inclusive reasoning uses positive rules• Probabilistic rule induction expert system
– Headache: Training sample over 50,000 cases, 45 classes, 147 attributes
– Meningitis: 1200 samples on 41 attributes, 4 outputs
![Page 24: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/24.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-24
Clinical Database• Used AQ15, C4.5
– Average accuracy 82%• Expert System
– Average accuracy 92%• Rough Set Rule System
– Average accuracy 70%• Using both positive & negative rules from
rough sets– Average accuracy over 90%
![Page 25: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/25.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-25
Software Development Quality• Telecommunications company• Goal: find patterns in modules being
developed likely to contain faults discovered by customers– Typical module several million lines of code– Probability of fault averaged 0.074
• Apply greater effort for those– Specification, testing, inspection
![Page 26: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/26.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-26
Software Quality• Preprocessed data• Reduced data• Used CART
– (Classification & Regression Trees)– Could specify prior probabilities
• First model 9 rules, 6 variables– Better at cross-validation– But variable values not available until late
• Second model 4 rules, 2 variables– About same accuracy, data available earlier
![Page 27: Decision Tree Algorithms Rule Based Suitable for automatic generation](https://reader035.vdocument.in/reader035/viewer/2022062412/5a4d1afd7f8b9ab059985191/html5/thumbnails/27.jpg)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-27
Decision Trees
• Very effective & useful• Automatic machine learning
– Thus unbiased (but omit judgment)• Can handle very large data sets
– Not affected much by missing data• Lots of software available