classification trees hi. what should you get out of this? obtain an introduction to classifiaction...
TRANSCRIPT
Classification trees
Hi
What should you get out of this?
• Obtain an introduction to classifiaction tree modelling
• Allow you to follow our steps and be able to reproduce them at home.
• Understand the usefulness of classification trees.
• A „Manual” explaining our steps will be posted online.
Schedule
• Part 1 – Introduction to Classification Trees• Part 2 – Hands on example of classification
tree• Part 3 – Examples of research using
classification trees• Part 4 – CHAID and Gini Index models
• Part 5 - Assymetric payoff analysis
Let’s Begin!
• Part 1 – Michel van Dijck• Part 2 – Timo van Dockum• Part 3&4 – Stanisław Guner
• Part 5 – Ruud Moers
Part 1
Introduction to classification trees - Theory
Decision tree
• Root Node• Internal Node• Leaf Node
Binary vs. Multiway split
Binary attributes
Nominal attributes
Ordinal Attributes
Continuous attributes
Measures for selecting best split
Criteria for growing a tree
• Maximize Information gain
• Maximize Gain Ratio
Problems with decision trees
• Under fitting – Not enough data– Algorithm is “clueless”
• Over fitting– Too many nodes– Tree pruning
Part 2
How does it work in practise?
# Home Owner Marital Status Annual Income Defaulted Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
Node 0
10 peopleDefaulted BorrowerYes 3No 7
Node 0
10 peopleDefaulted BorrowerYes 3No 7
No home owner
7 peopleDefaulted BorrowerYes 3No 4
Home owner
3 peopleDefaulted BorrowerYes 0No 3
Node 0
10 peopleDefaulted BorrowerYes 3No 7
Single or divorced
6 peopleDefaulted BorrowerYes 3No 3
Married
4 peopleDefaulted BorrowerYes 0No 4
Node 0
10 peopleDefaulted BorrowerYes 3No 7
Income < 100
6 peopleDefaulted BorrowerYes 3No 3
Income ≥ 100
4 peopleDefaulted BorrowerYes 0No 4
Selecting the best split
Impurity:• When all people belong to the same class, the
node has zero impurity.• When all people are equally split between the
two classes, the node has the highest impurity.
Goal: minimizing impurity
Node 0
10 peopleDefaulted BorrowerYes 3No 7
No home owner
7 peopleDefaulted BorrowerYes 3No 4
Home owner
3 peopleDefaulted BorrowerYes 0No 3
p(yes) = 3/7p(no) = 4/7
High impurity
p(yes) = 0p(no) = 1
Zero impurity
Degree of impurity
• Self-information is the amount of information you receive when an event occurs.
• The smaller the probability an event occurs, the greater the self-information.
• You want to minimize expected amount of self-information to make more accurate predictions
• Definition:2log(1/p)(if p increases, 2log(1/p) decreases)
• Entropy is the expected value of the self-information.
• The Entropy of Node 1 is:3/7* 2log(1/(3/7)) + 4/7* 2log(1/(4/7))= 0,99
Node 0
10 peopleDefaulted BorrowerYes 3No 7
Node 1: No home owner
7 peopleDefaulted BorrowerYes 3No 4
p(yes) = 3/7p(no) = 4/7
High impurity
Similarly, the Entropy of Node 2 is:0* 2log(1/0) + 1* 2log(1/1)
= 0
Node 0
10 peopleDefaulted BorrowerYes 3No 7
Node 2: Home owner
3 peopleDefaulted BorrowerYes 0No 3
p(yes) = 0p(no) = 1
Zero impurity
Node 0
10 peopleDefaulted BorrowerYes 3No 7
No home owner
7 peopleDefaulted BorrowerYes 3No 4
Home owner
3 peopleDefaulted BorrowerYes 0No 3
p(yes) = 3/7p(no) = 4/7
High impurity=
0,96
p(yes) = 0p(no) = 1
Zero impurity=0
Impurity = 0,88
• A split is an improvement if the weighted average of the impurities of the subnodes is smaller than the impurity of the root node.
Node 0
10 peopleDefaulted BorrowerYes 3No 7
No home owner
7 peopleDefaulted BorrowerYes 3No 4
Home owner
3 peopleDefaulted BorrowerYes 0No 3
p(yes) = 3/7p(no) = 4/7
High impurity=
0,96
p(yes) = 0p(no) = 1
Zero impurity=0
Impurity = 0,88
Weighted impurity = 7/10 * 0,96 + 3/10 * 0 = 0,672 < 0,88 improvement
Split based on Weighted impurity
No split 0,88
House owner yes/no 0,67
Marital status 0,60
Annual income less or higher than 100 0,60
Let’s go for the split based on income.
Node 0
10 peopleDefaulted BorrowerYes 3No 7
Income < 100
6 peopleDefaulted BorrowerYes 3No 3
Income ≥ 100
4 peopleDefaulted BorrowerYes 0No 4
Impurity = 0No further splitSingle or divorced
4 peopleDefaulted BorrowerYes 3No 1
Married
2 peopleDefaulted BorrowerYes 0No 2
Impurity = 0No further split
Impurity = 0,81
Weighted impurity = 4/6*0,81 + 2/6*0 = 0,54 < 0,67 < 0,88 improvement
Impurity = 0,88
Impurity = 0,67
Part 3
Case Studies
Case 1 – Classifying an Iris
Sertosa Versicolor Verginica
• Not enough knowledge to distinguish them by looking at them?
• Use the petal’s width and is length
• Follow the decision tree
• First Decision Tree constructed for this problem in 1936 by the British scientist Fisher.
• Different trees over the last 80 years
• Accuracy >95%
Case 2 – Classifcation of osteoarthritis
List of characteristics of idiopathic and non-idiopathic OA
M.D.s were surveyed in order to determine the classification atributes out of the list of potential characteristics
X-ray as a first exam
Physical examination as a first step
Repeated Synovial fluid test
Conclusions
Case 3 – Carsh vs Non-crash unit independet classification
Non crash specific factors
They used Gini impurity measure
This equation finds the gain (in terms of impurity) from creating a new node
VIM asseses performance of a variable in producing splits
VIM Performance Speed limit seem to be crucial in distinguishing purer nodes
Lift chart – Generic and baseline model
Lift chart – Generic and crash type models
Conclusions
Part 4
Modelling
Data Transformation
• „not donated” and „donated”• Attributes must characterize donators in other
ways than their label. DONAMT out.• SPSS: Aggerage test and train sheet into one
data set. Distinguish between the two by a new, 0 or 1 attribute „train”.
• Create standardized values of attributes (PCA)
DATA REDUCTION
Correlations
TIMELR TIMECL FRQRES MEDTOR AVGDON LSTDON ANNDON
TIMELR
Pearson
Correlation1 -,063** -,736** -,109** -,333** -,048** -,293**
Sig. (2-tailed) ,000 ,000 ,000 ,000 ,000 ,000
N 8137 8137 8137 8137 8137 8137 8137
TIMECL
Pearson
Correlation-,063** 1 ,056** ,083** ,057** ,031** -,168**
Sig. (2-tailed) ,000 ,000 ,000 ,000 ,005 ,000
N 8137 8137 8137 8137 8137 8137 8137
FRQRES
Pearson
Correlation-,736** ,056** 1 -,025* ,412** -,025* ,353**
Sig. (2-tailed) ,000 ,000 ,025 ,000 ,025 ,000
N 8137 8137 8137 8137 8137 8137 8137
MEDTOR
Pearson
Correlation-,109** ,083** -,025* 1 ,030** ,091** ,014
Sig. (2-tailed) ,000 ,000 ,025 ,007 ,000 ,214
N 8137 8137 8137 8137 8137 8137 8137
AVGDON
Pearson
Correlation-,333** ,057** ,412** ,030** 1 ,642** ,878**
Sig. (2-tailed) ,000 ,000 ,000 ,007 ,000 ,000
N 8137 8137 8137 8137 8137 8137 8137
LSTDON
Pearson
Correlation-,048** ,031** -,025* ,091** ,642** 1 ,618**
Sig. (2-tailed) ,000 ,005 ,025 ,000 ,000 ,000
N 8137 8137 8137 8137 8137 8137 8137
ANNDON
Pearson
Correlation-,293** -,168** ,353** ,014 ,878** ,618** 1
Sig. (2-tailed) ,000 ,000 ,000 ,214 ,000 ,000
N 8137 8137 8137 8137 8137 8137 8137
**. Correlation is significant at the 0.01 level (2-tailed).
*. Correlation is significant at the 0.05 level (2-tailed).
Correlation matrix
• We observe high correlations (above 0,5) between:– 1) FRQRES and TIMELR– 2) ANNDON and AVGDON– 3) LSTDON and AVGDON– 4) ANNDON and LSTDON.
Principal Component Analysis – Scree plot
PCA – Cumulative Variance explainedTotal Variance Explained
Factor Initial Eigenvalues Extraction Sums of Squared Loadings
Total % of Variance Cumulative % Total % of Variance Cumulative %
1 1,959 39,172 39,172 1,631 32,619 32,619
2 1,153 23,063 62,235 ,532 10,642 43,261
3 ,965 19,306 81,542
4 ,674 13,480 95,021
5 ,249 4,979 100,000
Extraction Method: Principal Axis Factoring.
Communalities
Initial Extraction
Zscore(TIMELR) ,559 ,675
Zscore(TIMECL) ,052 ,443
Zscore(FRQRES) ,575 ,804
Zscore(MEDTOR) ,043 ,015
Zscore(ANNDON) ,164 ,225
Extraction Method: Principal Axis Factoring.
Factor loadings – 2 Factors foundFactor Matrixa
Factor
1 2
Zscore(TIMELR) -,818 -,074
Zscore(TIMECL) ,034 ,665
Zscore(FRQRES) ,896 ,018
Zscore(MEDTOR) ,047 ,114
Zscore(ANNDON) ,393 -,267
Extraction Method: Principal Axis Factoring.
a. Attempted to extract 2 factors. More than 100 iterations required. (Convergence=,001). Extraction was terminated.
PCA - Conclusions
• Helps us to understand dimensionality of data• Informs us about uniquness of attributes.
• We know what to expect in terms of complexity of a decision tree
MODEL 1 – SPSS CHAID
SPSS CHAID model
Notice:1) 2/3 levels2) Attributes can be split over a range
SPSS – Right side
SPSS – Left side
Confusion matrixClassification
Sample Observed Predicted
donated not donated Percent Correct
Training
donated 899 510 63,8%
not donated 523 2125 80,2%
Overall Percentage 35,1% 64,9% 74,5%
Test
Donated 837 569 59,5%
not donated 574 2100 78,5%
Overall Percentage 34,6% 65,4% 72,0%
Growing Method: CHAID
Dependent Variable: label
Ratios
• Overall error rate = (574+569)/4080=28,0%• Overall accuracy = (837+2100)/4080=72,0%
• Sensitivity = 837/(837+569)= 59,5%• Specificity = 2100/(2100+574)=78,5% ability to rule
out not donators correctly.• False positive rate = 574/(574+837) = 40,7% • False negative test = 569/(569+2100)=21,3%
0 1000 2000 3000 4000
0
200
400
600
800
1000
1200
1400
1600
Mod
el rep
onse
Number of mailings
Model reponse Naive response
Lift chart
Why to make this model?
• Pros:– Intuitive– Easy to read– Pruned– Provides good knowledge overview
MODEL 2 – RAPIDMINER GINI INDEX
CriteriaBased on Gini criterion
RM Gini index model
Performance
Overall accuracy of different models was very similar
Lift chart
0 1000 2000 3000 40000
200
400
600
800
1000
1200
1400
1600
Exp
ecte
d nu
mbe
r of
don
ator
s (n
)
Cumulative number of sent letters (n)
Number of responses from the model Number of responses from naive classifier
Part 5 – Payoff analysis
Assymetric payoff analysis – different approaches – a challenge
• Method 1: Focus of probabilities and select records based on chance of being a donator.
• Method 2: Include in the cut-off calculation the amount each client group generates.
Method 1
• Select all records with p>0,5 – Classified as donators.
• Add records between 0,3 and 0,5. – More capture.
• Or select at random sub-sample from non-donators (20% records)
– SIMPLE !
Method 2: Assymetrix payoff analysis
• High accuracy previous model
• However, should this be your goal?
Assymetrix payoff analysis
• No, you should aim to get as many profit as possible out of your actions
• Expected revenue of sending a letter to somebody is positive ⇒sending a letter only costs 50 cents
• E.g. For one of the groups only one in ten people is expected to donate, but still when this person is willing to donate 10 euros it will exceed the costs.
Solution• Expected Revenue
• Calculated by– Expected Average Donation per person x
possibility somebody will donate – costs of a letter– Gather the Expected Donation from the overall
average donation ⇒ 7,01
Expected Revenues• Expected Revenue of group at node 10
• 0.068 x 7.01 – 0.5 = - 0.0235⇒ Negative, so do not send a letter• Expected Revenu of group at node 9
• 0.205 x 7.01 - 0,5 = 0.9366 ⇒ Eventhough low percentage is donating , you should send them a letter
Expected Revenues
• Expected Revenues for every group at every node
• Conclusie, if the costs of sending a letter would increase the amount that is given in the table, you stop sending a letter to the group that the node belongs to.
• Consequences Classification tree
Node Number Expected Revenue
1 0.0676
9 0.9366
10 -0.0235
11 1.6444
12 0.5652
13 0.3900
14 1.1749
16 1.4272
17 2.4083
18 3.2353
19 2.4644
20 0.7825
21 3.2913
22 1.6655
23 3.4035
24 4.5107
25 2.8708
26 5.3096
27 1.4692
28 3.1372
Old Classification Tree
Change in Classification Tree
• Tremendous simplification of the tree
• However, (0.182, 0.234] is a strange interval
Further analyses
• Now take into account every group will probably donate another average amount
• Calculate this amount for every specific group and determine the Expected Revenue per person per group
• Also the group that belongs to node 1 will not receive a letter
Node Number Expected Revenue
1-0.3089
9 0.298110 -0.273311 1.329712 0.079613 0.214114 0.744416 1.450517 2.671918 3.716419 2.355320 0.431521 5.106822 2.694623 5.567224 7.925925 6.779026 11.422127 1.403628 3.7041
New Classification Tree
• Still there are some remarks
Analyses new Tree
• Consider whether you are statisfied with forecasting by only using the frequency of response and the time to last response
• The tree does make sense→ If long time no response, most likely nog donation
• Results received with the out-of-sample forecasting using the other dataset
Out-of-sample forecasting* (error)
• If we apply the last model we would in the end have a total revenue of € 22373 and send out 3046 letters
• Compared to the model in which we would only send letters to those we expect to donate we would a have total revenue of € 15544 and send 1477 letters, which is indeed lower.
• The model in which we just use the average donation of all (ex)-donators gives us a total revenue of € 22460 and we would have to send 3270 letters.
• Even better would be sending a letter to everybody, that would lead to a total revenue of € 23257,50
Discussion
• Unfortunately ‘our’ model has been beaten by the most simplistic model of sending a letter to everybody, at least for the data we forecasted
• However, we did deliver some insights of to who you might not want to send letters to if the costs for every letter would increase