decision trees and association rules prof. sin-min lee department of computer science
TRANSCRIPT
![Page 1: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/1.jpg)
Decision Trees and Association Rules
Prof. Sin-Min Lee
Department of Computer Science
![Page 2: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/2.jpg)
Data Mining: A KDD Process
– Data mining: the core of knowledge discovery process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
![Page 3: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/3.jpg)
Data Mining process model -DM
![Page 4: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/4.jpg)
Search in State SpacesSearch in State Spaces
![Page 5: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/5.jpg)
![Page 6: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/6.jpg)
![Page 7: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/7.jpg)
![Page 8: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/8.jpg)
![Page 9: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/9.jpg)
Decision Trees
•A decision tree is a special case of a state-space graph.
•It is a rooted tree in which each internal node corresponds to a decision, with a subtree at these nodes for each possible outcome of the decision.
•Decision trees can be used to model problems in which a series of decisions leads to a solution.
•The possible solutions of the problem correspond to the paths from the root to the leaves of the decision tree.
![Page 10: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/10.jpg)
Decision Trees•Example: The n-queens problem
•How can we place n queens on an nn chessboard so that no two queens can capture each other?
•Q•Q
•x
•x
•x
•x
•x
•x
•x•x
•x
•x•x
•x
•x
•x
•x•x
•x
•x
•x
•x•x
•x
•x
•x
•x
•x
•x
A queen can move any A queen can move any number of squares number of squares horizontally, vertically, and horizontally, vertically, and diagonally.diagonally.
Here, the possible target Here, the possible target squares of the queen Q are squares of the queen Q are marked with an marked with an xx..
![Page 11: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/11.jpg)
•Let us consider the 4-queens problem.
•Question: How many possible configurations of 44 chessboards containing 4 queens are there?
•Answer: There are 16!/(12!4!) = (13141516)/(234) = 13754 = 1820 possible configurations.
•Shall we simply try them out one by one until we encounter a solution?
•No, it is generally useful to think about a search problem more carefully and discover constraints on the problem’s solutions.
•Such constraints can dramatically reduce the size of the relevant state space.
![Page 12: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/12.jpg)
Obviously, in any solution of the n-queens problem, Obviously, in any solution of the n-queens problem, there must be there must be exactly one queen in each columnexactly one queen in each column of of the board. the board.
Otherwise, the two queens in the same column could Otherwise, the two queens in the same column could capture each other.capture each other.
Therefore, we can describe the solution of this problem Therefore, we can describe the solution of this problem as a as a sequence of n decisionssequence of n decisions: :
Decision 1: Place a queen in the first column.Decision 1: Place a queen in the first column.
Decision 2: Place a queen in the second column.Decision 2: Place a queen in the second column.......Decision n: Place a queen in the n-th column.Decision n: Place a queen in the n-th column.
![Page 13: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/13.jpg)
Backtracking in Decision Trees
•Q
•Q
•Q
•Q
•Q
•Q
•Q
•Q
•Q
•Q
•Q
•Q
•Q
•Q
•Q
•Q
•Q
•Q
place 1place 1stst queen queen
place 2place 2ndnd queen queen
place 3place 3rdrd queen queen
place 4place 4thth queen queen
empty boardempty board
![Page 14: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/14.jpg)
![Page 15: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/15.jpg)
![Page 16: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/16.jpg)
Neural NetworkMany inputs and a single outputTrained on signal and background sampleWell understood and mostly accepted in HEP
Decision TreeMany inputs and a single output
Trained on signal and background sample
Used mostly in life sciences & business
![Page 17: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/17.jpg)
![Page 18: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/18.jpg)
![Page 19: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/19.jpg)
![Page 20: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/20.jpg)
Decision treeBasic
Algorithm• Initialize top node to all examples• While impure leaves available
– select next impure leave L
– find splitting attribute A with maximal information gain
– for each value of A add child to L
![Page 21: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/21.jpg)
Decision treeFind good
split• Sufficient statistics to compute info gain: count matrix
outlook temperature humidity windy playsunny hot high FALSE nosunny hot high TRUE noovercast hot high FALSE yesrainy mild high FALSE yesrainy cool normal FALSE yesrainy cool normal TRUE noovercast cool normal TRUE yessunny mild high FALSE nosunny cool normal FALSE yesrainy mild normal FALSE yessunny mild normal TRUE yesovercast mild high TRUE yesovercast hot normal FALSE yesrainy mild high TRUE no
play don't playsunny 2 3
overcast 4 0rainy 3 2
outlook
play don't playhigh 3 4
normal 6 1humidity
play don't playhot 2 2mild 4 2cool 3 1
temperature
play don't playFALSE 6 2TRUE 3 3
windy
gain: 0.25 bits
gain: 0.16 bits
gain: 0.03 bits
gain: 0.14 bits
![Page 22: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/22.jpg)
Decision trees
• Simple depth-first construction
• Needs entire data to fit in memory
• Unsuitable for large data sets
• Need to “scale up”
![Page 23: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/23.jpg)
Decision Trees
![Page 24: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/24.jpg)
Planning Tool
![Page 25: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/25.jpg)
Decision Trees
• Enable a business to quantify decision making
• Useful when the outcomes are uncertain
• Places a numerical value on likely or potential outcomes
• Allows comparison of different possible decisions to be made
![Page 26: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/26.jpg)
Decision Trees
• Limitations:– How accurate is the data used in the construction of the
tree?
– How reliable are the estimates of the probabilities?
– Data may be historical – does this data relate to real time?
– Necessity of factoring in the qualitative factors – human resources, motivation, reaction, relations with suppliers and other stakeholders
![Page 27: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/27.jpg)
Process
![Page 28: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/28.jpg)
Advantages
![Page 29: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/29.jpg)
Disadvantages
![Page 30: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/30.jpg)
![Page 31: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/31.jpg)
![Page 32: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/32.jpg)
![Page 33: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/33.jpg)
![Page 34: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/34.jpg)
Trained Decision
Tree
(Binned Likelihood Fit)(Limit)
![Page 35: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/35.jpg)
Decision Trees from Data BaseEx Att Att Att ConceptNum Size Colour Shape Satisfied
1 med blue brick yes2 small red wedge no3 small red sphere yes4 large red wedge no5 large green pillar yes6 large red pillar no7 large green sphere yes
Choose target : Concept satisfiedUse all attributes except Ex Num
![Page 36: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/36.jpg)
Rules from TreeIF (SIZE = large AND ((SHAPE = wedge) OR (SHAPE = pillar AND COLOUR = red) )))OR (SIZE = small AND SHAPE = wedge)THEN NO
IF (SIZE = large AND ((SHAPE = pillar) AND COLOUR = green) OR SHAPE = sphere) )OR (SIZE = small AND SHAPE = sphere)OR (SIZE = medium)THEN YES
![Page 37: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/37.jpg)
![Page 38: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/38.jpg)
![Page 39: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/39.jpg)
Association Rule• Used to find all rules in a basket data• Basket data also called transaction data• analyze how items purchased by customers in a shop are relate
d• discover all rules that have:-
– support greater than minsup specified by user– confidence greater than minconf specified by user
• Example of transaction data:-– CD player, music’s CD, music’s book– CD player, music’s CD– music’s CD, music’s book– CD player
![Page 40: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/40.jpg)
Association Rule• Let I = {i1, i2, …im} be a total set of items D a set of transactions d is one transaction consists of a set of items
–d I
• Association rule:-–X Y where X I ,Y I and X Y = –support = #of transactions contain X Y
D–confidence = #of transactions contain X Y
#of transactions contain X
![Page 41: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/41.jpg)
Association Rule• Example of transaction data:-
– CD player, music’s CD, music’s book– CD player, music’s CD– music’s CD, music’s book– CD player
• I = {CD player, music’s CD, music’s book}• D = 4• #of transactions contain both CD player, music’s CD =2• #of transactions contain CD player =3• CD player music’s CD (sup=2/4 , conf =2/3 );
![Page 42: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/42.jpg)
![Page 43: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/43.jpg)
Association Rule
• How are association rules mined from large databases ?
• Two-step process:-– find all frequent itemsets– generate strong association rules from frequent
itemsets
![Page 44: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/44.jpg)
Association Rules
• antecedent consequent– if then
– beer diaper (Walmart)
– economy bad higher unemployment
– Higher unemployment higher unemployment benefits cost
• Rules associated with population, support, confidence
![Page 45: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/45.jpg)
Association Rules
• Population: instances such as grocery store purchases
• Support– % of population satisfying antecedent and consequent
• Confidence – % consequent true when antecedent true
![Page 46: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/46.jpg)
2. Association rulesSupportEvery association rule has a support and a confidence.
“The support is the percentage of transactions that demonstrate the rule.”
Example: Database with transactions ( customer_# : item_a1, item_a2, … )
1: 1, 3, 5.
2: 1, 8, 14, 17, 12.
3: 4, 6, 8, 12, 9, 104.
4: 2, 1, 8.
support {8,12} = 2 (,or 50% ~ 2 of 4 customers)
support {1, 5} = 1 (,or 25% ~ 1 of 4 customers )
support {1} = 3 (,or 75% ~ 3 of 4 customers)
![Page 47: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/47.jpg)
2. Association rulesSupport
An itemset is called frequent if its support is equal or greater than an agreed upon minimal value – the support threshold
add to previous example:
if threshold 50%
then itemsets {8,12} and {1} called frequent
![Page 48: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/48.jpg)
2. Association rulesConfidenceEvery association rule has a support and a confidence.
An association rule is of the form: X => Y
• X => Y: if someone buys X, he also buys Y
The confidence is the conditional probability that, given X present in a transition , Y will also be present.
Confidence measure, by definition:
Confidence(X=>Y) equals support(X,Y) / support(X)
![Page 49: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/49.jpg)
2. Association rulesConfidence
We should only consider rules derived from itemsets with high support, and that also have high confidence.
“A rule with low confidence is not meaningful.”
Rules don’t explain anything, they just point out hard facts in data volumes.
![Page 50: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/50.jpg)
3. ExampleExample: Database with transactions ( customer_# : item_a1, item_a2, … )
1: 3, 5, 8. 2: 2, 6, 8. 3: 1, 4, 7, 10. 4: 3, 8, 10. 5: 2, 5, 8. 6: 1, 5, 6. 7: 4, 5, 6, 8. 8: 2, 3, 4. 9: 1, 5, 7, 8. 10: 3, 8, 9, 10.
Conf ( {5} => {8} ) ?supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4, then conf( {5} => {8} ) = 4/5 = 0.8 or 80%
![Page 51: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/51.jpg)
3. ExampleExample: Database with transactions ( customer_# : item_a1, item_a2, … )
1: 3, 5, 8. 2: 2, 6, 8. 3: 1, 4, 7, 10. 4: 3, 8, 10. 5: 2, 5, 8. 6: 1, 5, 6. 7: 4, 5, 6, 8. 8: 2, 3, 4. 9: 1, 5, 7, 8. 10: 3, 8, 9, 10.
Conf ( {5} => {8} ) ? 80% Done. Conf ( {8} => {5} ) ? supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4, then conf( {8} => {5} ) = 4/7 = 0.57 or 57%
![Page 52: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/52.jpg)
3. ExampleExample: Database with transactions ( customer_# : item_a1, item_a2, … )
Conf ( {5} => {8} ) ? 80% Done.
Conf ( {8} => {5} ) ? 57% Done.
Rule ( {5} => {8} ) more meaningful then
Rule ( {8} => {5} )
![Page 53: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/53.jpg)
3. ExampleExample: Database with transactions ( customer_# : item_a1, item_a2, … )
1: 3, 5, 8. 2: 2, 6, 8. 3: 1, 4, 7, 10. 4: 3, 8, 10. 5: 2, 5, 8. 6: 1, 5, 6. 7: 4, 5, 6, 8. 8: 2, 3, 4. 9: 1, 5, 7, 8. 10: 3, 8, 9, 10.
Conf ( {9} => {3} ) ? supp({9}) = 1 , supp({3}) = 1 , supp({3,9}) = 1, then conf( {9} => {3} ) = 1/1 = 1.0 or 100%. OK?
![Page 54: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/54.jpg)
3. ExampleExample: Database with transactions ( customer_# : item_a1, item_a2, … )
Conf( {9} => {3} ) = 100%. Done.
Notice: High Confidence, Low Support.
-> Rule ( {9} => {3} ) not meaningful
![Page 55: Decision Trees and Association Rules Prof. Sin-Min Lee Department of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb35503460f94bba0fe/html5/thumbnails/55.jpg)
Association Rules
• Population– MS, MSA, MSB, MA, MB, BA– M=Milk, S=Soda, A=Apple, B=beer
• Support (MS)= 3/6– (MS,MSA,MSB)/(MS,MSA,MSB,MA,MB,
BA)
• Confidence (MS) = 3/5 – (MS, MSA, MSB) / (MS,MSA,MSB,MA,MB)