decision trees 7.1 cot5230 data mining week 7 decision trees m o n a s h a u s t r a l i a ’ s i n...

Decision Trees 7.1

COT5230 Data Mining

Week 7

Decision Trees

M O N A S HA U S T R A L I A ’ S I N T E R N A T I O N A L U N I V E R S I T Y

Decision Trees 7.2

Outline

Why use Decision Trees?

What is a Decision Tree?

Examples

Use as a data mining technique

Popular Models– CART

– CHAID

– ID3 & C4.5

Decision Trees 7.3

Why use Decision Trees? - 1

Whereas neural networks compute a mathematical function of their inputs to generate their outputs, decision trees use logical rules

IFthe vehicle has a 2-door FRAME ANDthe vehicle has at least six CYLINDERS ANDthe BUYER is less than 50 years old ANDthe COST of the vehicle is >$30,000 ANDthe vehicle COLOUR is red

THENthe BUYER is likely to be a MALE

Decision Trees 7.4


For some applications accuracy of classification or prediction is sufficient, e.g.:

– Direct mail firm needing to find a model for identifying customers who will respond to mail

– Predicting the stock market using past data

In other applications it is better (sometimes essential) that the decision be explained, e.g.:

– Rejection of a credit application

– Medical diagnosis

Humans generally require explanations for most decisions

Decision Trees 7.5


Example: When a bank rejects a credit card application, it is better to explain to the customer

that it was due to the fact that:– He is not a permanent resident of Australia AND

– He has been residing in Australia for < 6 months AND

– He does not have a permanent job.

This is better than saying:

“We are very sorry, but our neural network thinks that you are not a credit-worthy customer !!!” (In which case the customer might become angry and move to another bank)

Decision Trees 7.6

What is a Decision Tree? - 1

Decision Trees 7.7

What is a Decision Tree? - 2

Built from root node (top) to leaf nodes (bottom)

A record first enters the root node

A test is applied to determine to which child node it should go next

– A variety of algorithms for choosing the initial test exist. The aim is to discriminate best between the target classes

The process is repeated until a record arrives at a leaf node

The path from the root to a leaf node provides an expression of a rule

Decision Trees 7.8

Building a Decision Tree - 1

Algorithms for building decision trees (DTs) begin by trying to find the test which does the “best job” of splitting the data into the desired classes

– The desired classes have to be identified at the start

Example: we need to describe the car purchasing profiles of males and females. The DT building algorithm examines the car purchase details database to find the best splitting criterion:

The DT algorithm may discover out that theNo_of _doors variable is best for separating males and females

Engine type No_of_doors Age_of_customer Colour Cost Type

Decision Trees 7.9

Building a Decision Tree - 2

The process is repeated to discover the best splitting criterion for the records assigned to each node

Once built, the effectiveness of a decision tree can be measured by applying it to a collection of previously unseen records and observing the percentage of correctly classified records

Engine type

No_of_doors

Engine type

42

V6 V4

Decision Trees 7.10

Example - 1

Requirement: Classifycustomers who churn(do not renew theirphone contracts)

new

Phone Technology

50 Churners50 Non-churners

old

20 Churners 0 Non-churners

Time a Customer



> 2.3 years<= 2.3 years

Age



20 Churners 0 Non-churners

<= 35 > 35

Decision Trees 7.11

Example - 2

The number of records in a given parent node equals the sum of the records contained in the child nodes

Quite easy to understand how the model is being built (unlike NNs)

Easy use the model– say for a target marketing campaign

Provides intuitive ideas about the customer base– e.g: “Customers who have been with the company for a

couple of years and have new phones are pretty loyal”

Decision Trees 7.12

Use as a data mining technique - 1

Exploration – Analyzing the and splitting criteria selected by the

algorithm may provide interesting insights which can be acted upon

– e.g. if the following rule was identified:IF

time a customer < 1.1 years ANDsales channel = telesales

THEN chance of churn is 65%

– It might be worthwhile conducting a study on the way the telesales operators are making their calls

Decision Trees 7.13


Exploration (continued)– Gleaning information from rules that fail

– e.g. from the car example we obtained the rule:

IFNo_of_doors = 2 ANDEngine_type = (>= V6)AND Customer_Age > 50

THEN there are almost no customers

– Can this rule be useful?» A) Perhaps we can attempt to build up this non-existing

market. If this is possible then we have the edge over competitors since we have a head start in this knowledge

» B) We can remove these customers from our direct marketing campaign

Decision Trees 7.14


Exploration (continued)– Again from the car example we noticed that:

» There was no combination of rules to reliably discriminate between males and females for 4-door V4 engine vehicles

» Do we consider this as an occasion where it was not possible to achieve our objective?

– From this failure we have learnt that it is not important whether the customer is male or female for this category of vehicles

– Perhaps we were asking the wrong question all along - this warrants further analysis

Decision Trees 7.15


Data Pre-processing– Decision trees are very robust at handling different

predictor types (number/categorical), and run quickly. Therefore the can be good for a first pass over the data in a data mining operation

– This will create a subset of the possibly useful predictors which can then be fed into another model, say a neural network

Prediction – Once the decision tree is built it can be then be used

as a prediction tool, by using on a set of new data

Decision Trees 7.16

Popular Decision Tree Models: CART - 1

CART: Classification & Regression Trees, developed in 1984 by a team of researchers (Leo Breiman et al.) from Stanford University

– Used in the DM software Darwin - from Thinking Machines Corporation (recently bought by Oracle)

Uses an entropy metric to determine the split point (Information theory - Shannon 1949)

Measure of disorder =

where p is is the probability of that prediction value occurring in a particular node of the tree

- CART produces a binary tree

)log(pp

Decision Trees 7.17

Popular Decision Tree Models: CART - 2

Consider the “Churn” problem from slide 7.10 – At a certain node there are 100 customers to split

– The algorithm will try each predictor using the measure of disorder (MOD)

– For each predictor the algorithm will calculate the MOD for several values to identify the optimum

– e.g: take age as the predictor and split the records according to age <=32 and age >32

» 100 total customers» 30 who churn» 70 who don’t churn

– CART will select the predictor with the lowest MOD as the split point

0.4512 ))0.7log(0.7 3)(0.3log(0.- MOD

Decision Trees 7.18

Node splitting

A good split

Name Churned? Name Churned?

Jim Yes Bob No

Sally Yes Betty No

Steve Yes Sue No

Joe Yes Alex No

A bad split

Name Churned? Name Churned?

Jim Yes Bob No

Sally Yes Betty No

Steve No Sue Yes

Joe No Alex Yes

Decision Trees 7.19

Popular Decision Tree Models: CHAID

CHAID: Chi-squared Automatic Interaction Detector, developed by J. A. Hartigan in 1975.

– Widely used since it is distributed as part of the popular statistical packages SAS and SPSS

Differs from CART in the way it identifies the split points. Instead of the information measure, it uses chi-squared test to identify the split points (a statistical measure used for identifying independent variables)

All predictors must be categorical or put into categorical form by binning

The accuracy of the two methods CHAID and CART have been found to be similar

Decision Trees 7.20

Popular Decision Tree Models: ID3 & C4.5

ID3: Iterative Dichtomiser, developed by the Australian researcher Ross Quinlan in 1979

– Used in the data mining software Clementine of Integral Solutions Ltd. (taken over by SPSS)

-ID3 picks predictors and their splitting values on the basis of the gain in information provided

– Gain is the difference between the amount of information that is needed to make a correct prediction both before and after the split has been made

– If the amount of information required is much lower after the split is made, then the split is said to have decreased the disorder of the original data

Decision Trees 7.21

ID3 & C4.5 continued - 2

By using the entropy

left right left entropy right entropy start entropy

A ++++- +---- -4/5log(4/5)+ -4/5log(4/5)+ -5/10log(5/10)+ -1/5log(1/5)=.72 -1/5log(1/5)=.72 -5/10(log(5/10)

= 1

B +++++ - -5/9log(5/9)+ -1/1log(1/1)---- -4/9log(4/9)=.99 =0

A B

++++ - + ---- +++++---- -

)log(pp

Decision Trees 7.22

ID3 & C4.5 continued - 3

Split A will be selected

C4.5 introduces a number of extensions to ID3:– Handles unknown field values in training set

– Tree pruning method

– Automated rule generation

Weighted Entropy Gain

A (5/10)*0.72+(5/10)*0.72 0.28 = 0.72

B (9/10)*0.99+(1/10)*0 0.11= 0.89

Decision Trees 7.23

Strengths and Weaknesses

Strengths of decision trees– Able to generate understandable rules

– Classify with very little computation

– Handle both continuous and categorical data

– Provides a clear indication of which variables are most important for prediction or classification

Weaknesses– Not appropriate for estimation or prediction tasks

(income, interest rates, etc.)

– Problematic with time series data (much pre-processing required), can be computationally expensive

decision trees 7.1 cot5230 data mining week 7 decision trees m o n a s h a u s t r a l i a ’ s i n...

Documents

use decision trees

building decision trees

bank slide

decisions slide

male slide

rule slide

customercolourcosttype

root node