topics in business intelligence k-nn & naive bayes – group 1 isabel van der lijke nathan bok...
Post on 29-Jan-2016
214 Views
Preview:
TRANSCRIPT
TOPICS IN BUSINESS INTELLIGENCEK-NN & Naive Bayes – GROUP 1
Isabel van der LijkeNathan BokGökhan Korkmaz
INTRODUCTION K-NN
k-NN Classifier (Categorical Outcome) Determining Neighbors Classification Rule Example: Riding Mowers Choosing k Setting the Cutoff Value Advantages and shortcomings of k-NN algorithms
2
INTRODUCTION NAIVE BAYES
Basic Classification Procedure Cutoff Probability Method Conditional Probability Naive Bayes Advantages and shortcomings of the naive Bayes
classifier
3
SIMPLE CASE APPLICATION
Depression
4
SIMPLE CASE APPLICATION
Fruits
Example: P(Banana) = 500 / 1000 = 0,5
1-0,5 = 0,5 (Not banana)
New fruit compute all the chances
5
Sweet Not sweet
Total
Banana 350 150 500Orange 150 150 300Other fruit 150 50 200Total 650 350 1000
REAL-LIFE APPLICATION NAIVE BAYES
Medical Data Classification with Naive Bayes Approach Introduction Requirements for systems dealing with medical data An empirical comparison Tables Conclusion
6
TABLE 2:COMPARATIVE ANALYSIS BASED ON PREDICTIVE ACCURACY
7
TABLE 3:COMPARATIVE ANALYSIS BASED ON AREA UNDER ROC CURVE (AUC)
8
REAL-LIFE APPLICATION K-NN
Used to help health care professionals in diagnosing heart disease.
Useful for pattern recognition and classification. Euclidean distance:
Often normalized data due to different variable formats.
9
CASE STUDY
“Our customer is a Dutch charity organization that wants to be able to classify it's supporters to donators and non-donators. The non-donators are sent a single marketing mail a year, whereas the donators receive multiple ones (up to 4).”
Who are the donators? Who are the non-donators?
Application of K-NN & Naive Bayes to training and test dataset. 4000 customers. SPSS, Excel, XLMiner
10
CLEAN-UP
No missing values 1-dimensional outliers removed through sorting
(regarding annual & average donation) 2-dimensional outliers removed through scatterplot
11
12
Variables Kept
Average donation
Frequency of Response
Median Time of Response
Time as client
Variables removedAnnual donationLast donationTime since last response.
13
Normalization of scores into z-scores. Nominal categorization of data Classification through percentiles of z-score & by
manually processing values within the variables.
14
ANALYSIS OF CASE STUDY – K-NN
15
Validation Data Scoring - Summary Report (for k = 13)
16
Error ReportClass # Cases # Errors % Error
0 1083 180 16,62049861
1 536 260 48,50746269
Overall 1619 440 27,17726992
Classification Confusion Matrix Predicted Class
Actual Class 0 1
0 903 180
1 260 276
CHOOSING MODEL FOR K-NN
Accuracy: Proportion of correctly classified instances. Error rate: (1 – Accuracy) Sensitivity: Sensitivity is the proportion of actual
positives which are correctly identified as positives by the classifier.
Specificity: Like sensitivity, but for the negatives.
17
18
M1 M2Selecting everyone in validation data
€711.20 €662.80
Selecting while correcting for sensitivity and specificity
€583.60 €530.80
19
APPLICATION OF MODEL ON TEST DATA
Classification Confusion Matrix
Predicted Class
Actual Class 0 1
0 2300 344
1 654 750
20
Error ReportClass # Cases # Errors % Error
0 2644 344 13,01059
1 1404 654 46,5812
Overall 4048 998 24,65415
21
ANALYSIS OF THE CASE STUDY – NAIVE BAYES
22
Classification Confusion Matrix
Predicted Class
Actual Class 0 1
0 856 229
1 225 309
Error ReportClass # Cases # Errors % Error
0 1085 229 21,10599
1 534 225 42,13483
Overall 1619 454 28,042
M1 = Cfrqres & Cavgdon M2 = Cfrqresp, Cavgdon
& Cmedtor
Classes -->
InputVariables
0 1
Value Prob Value Prob
CFRQRES
1 0,71977 1 0,2974832 0,171465 2 0,2551493 0,06398 3 0,192224 0,044786 4 0,255149
CAVGDON
1 0,632758 1 0,2974832 0,272553 2 0,4713963 0,076136 3 0,164764 0,018554 4 0,066362
23
Model 1 Model 2
Selecting everyone €1072 €1006
Selecting by class €2460,82 €2378.01
APPLICATION OF MODEL ON TEST DATA
Classes -->
InputVariables
0 1
Value Prob Value Prob
CFRQRES
1 0,714502 1 0,3061082 0,173338 2 0,2542613 0,066465 3 0,186084 0,045695 4 0,253551
CAVGDON
1 0,630287 1 0,31252 0,280589 2 0,4616483 0,068353 3 0,1683244 0,02077 4 0,057528
24
25
Classification Confusion Matrix
Predicted
Class
Actual Class 0 1
0 2096 548
1 570 834
Error ReportClass # Cases # Errors % Error
0 2644 548 20,72617
1 1404 570 40,59829
Overall 4048 1118 27,61858
LOOKING AT BOTH MODELS
26
27
QUESTIONS?
28
top related