dwdm2015 classi
TRANSCRIPT
-
7/24/2019 DWDM2015 Classi
1/26
DATA WAREHOUSING and DATA
MINING
Classification, Trees
Saji K Mathew, PhD
INDIAN INSTITUTE OF TECHNOLOGY MADRAS
Chennai, India
-
7/24/2019 DWDM2015 Classi
2/26
Scenario-I
Imagine you are pursuing a direct marketing program
Direct mail market budget = 12 mn Cost per mailing = 50/- You need to target the customer base cost effectively To maximize profit
Whom do I include? How many do I include?
How do you go?
-
7/24/2019 DWDM2015 Classi
3/26
Scenario-II
Suppose a catalog company has a database of 20mnnames
Suppose they choose to send 2 million copies of SummerBonanza catalogue
Further, suppose the avg. order size is 1500.
Suppose somehowyou could increase the response ratefrom 5% to 6%
This is 20,000 orders more = 30 mn 1% increase means a great lift!
-
7/24/2019 DWDM2015 Classi
4/26
Scoring models: supervised data mining for
classification
Simple linear regression Response=0.1*frequency-0.2*recency
Response=-0.1*age+0.2*income
Logistic regression
Score (probability) =0.1*frequency-0.2*recency
Classification And Regression Trees (CART)
Rules
ANN
-
7/24/2019 DWDM2015 Classi
5/26
Evaluating a scoring model
Fit How well the model fits the data (R2)
Validation
Checks generalizability
Performance How useful is the model for action (Lift)
-
7/24/2019 DWDM2015 Classi
6/26
Lift
Suppose we look at a random 10% of the potentialcustomers, and we expect to get an average R%
response rate (without doing any data mining)
If we select 10% of the likely customers using data
mining and get a higher response rate of G%, then we
realize a lift (=G/R)
-
7/24/2019 DWDM2015 Classi
7/26
Gains table
-
7/24/2019 DWDM2015 Classi
8/26
The cumulative gains chart
Random model
Proposed model
Lift
-
7/24/2019 DWDM2015 Classi
9/26
Classifier performance
Measure Formula
Accuracy, recognition rate
Error rate,
misclassification rate Sensitivity, true positive
rate, recall
Specificity, true negative
rate
Precision
NP
TNTP
+
+
N
TN
NP
FNFP
+
+
P
TP
FPTP
TP
+
yes no total
yes TP FN P
no FP TN N
total P N P+N
predicted class
actual class
-
7/24/2019 DWDM2015 Classi
10/26
Confusion (classification) matrix
Confusion Matrix
Predicted Class
Actual
Class Yes No
Yes 800 50
No 50 100
Accuracy=900/1000
Error: 100/1000
Sensitivity(Accuracy of Yes)=800/850
Specificity (Accuracy of No)=100/150
Precision =800/850
-
7/24/2019 DWDM2015 Classi
11/26
-
7/24/2019 DWDM2015 Classi
12/26
-
7/24/2019 DWDM2015 Classi
13/26
-
7/24/2019 DWDM2015 Classi
14/26
Decision Tree
A decision tree model consists of a set of rules fordividing a large heterogeneous population into
smaller, more homogeneous groups with respect to a
target variable
This structure divides up a large collection of recordsinto successively smaller sets of records with simple
decision rules- resulting sets become more
homogeneous
Target variable is generally categorical, input variables
could be any combination of categorical or metric
-
7/24/2019 DWDM2015 Classi
15/26
Rules (If then) in Fraud detection at
HSBC
3 claims in last 2 years Credit card used in different locations
Credit card used at petrol station and then in high-value
store!
-
7/24/2019 DWDM2015 Classi
16/26
A case of internal fraud
A bank auditor found that the credit card balanceswritten off as uncollectible had an excessive level of
numbers with first two digits 24
The investigation found that $2,500 was an internal
write-off limit One employee was responsible for most of the 24s by
working with friends and having them apply for a card
and then running up a balance to just below $2,500. The
employee then write the debt off. The systematic nature of the fraud was evident from the
first two digits
-
7/24/2019 DWDM2015 Classi
17/26
Growing decision trees
-
7/24/2019 DWDM2015 Classi
18/26
-
7/24/2019 DWDM2015 Classi
19/26
How to grow a tree?
Purity Measures Gini, Entropy etc.
Lift Measures how improved is a class formed by decision tree
rule as compared to original class
-
7/24/2019 DWDM2015 Classi
20/26
20
Gini Index (CART, IBM IntelligentMiner)
If a data set D contains items from n classes, gini index,gini(D) isdefined as
where pj is the relative frequency of classj in D
If a data set D is split on A into two subsets D1 and D2, thegini
indexgini(D) is defined as
Reduction in Impurity:
The attribute provides the smallestginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
=
=n
j
pjDgini
1
21)(
)(||
||)(
||
||)( 2
21
1Dgini
D
DDgini
D
DDginiA +=
)()()( DginiDginiAginiA
=
-
7/24/2019 DWDM2015 Classi
21/26
Calculate Gini for each node and the split
-
7/24/2019 DWDM2015 Classi
22/26
-
7/24/2019 DWDM2015 Classi
23/26
Entropy
Generically,
i= the number of classes
-
7/24/2019 DWDM2015 Classi
24/26
-
7/24/2019 DWDM2015 Classi
25/26
CART in R
Classification and Regression Trees (CART) Developed by statisticians Brieman et al in the 80s
Uses Gini index as the splitting criteria
Package rpart implements CART in R
-
7/24/2019 DWDM2015 Classi
26/26
In September, the company awarded a $1 million prize to a
team of engineers, statisticians and researchers thatimproved the accuracy of its movie recommendation system
by 10%. At the same time the company launched another
$1 million competition with the aim of predicting movie
enjoyment among members who don't often rate what they
watch.