topics in business intelligence k-nn & naive bayes – group 1 isabel van der lijke nathan bok...

Post on 29-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

TOPICS IN BUSINESS INTELLIGENCEK-NN & Naive Bayes – GROUP 1

Isabel van der LijkeNathan BokGökhan Korkmaz

INTRODUCTION K-NN

k-NN Classifier (Categorical Outcome) Determining Neighbors Classification Rule Example: Riding Mowers Choosing k Setting the Cutoff Value Advantages and shortcomings of k-NN algorithms

2

INTRODUCTION NAIVE BAYES

Basic Classification Procedure Cutoff Probability Method Conditional Probability Naive Bayes Advantages and shortcomings of the naive Bayes

classifier

3

SIMPLE CASE APPLICATION

Depression

4

SIMPLE CASE APPLICATION

Fruits

Example: P(Banana) = 500 / 1000 = 0,5

1-0,5 = 0,5 (Not banana)

New fruit compute all the chances

5

  Sweet Not sweet

Total

Banana 350 150 500Orange 150 150 300Other fruit 150 50 200Total 650 350 1000

REAL-LIFE APPLICATION NAIVE BAYES

Medical Data Classification with Naive Bayes Approach Introduction Requirements for systems dealing with medical data An empirical comparison Tables Conclusion

6

TABLE 2:COMPARATIVE ANALYSIS BASED ON PREDICTIVE ACCURACY

7

TABLE 3:COMPARATIVE ANALYSIS BASED ON AREA UNDER ROC CURVE (AUC)

8

REAL-LIFE APPLICATION K-NN

Used to help health care professionals in diagnosing heart disease.

Useful for pattern recognition and classification. Euclidean distance:

Often normalized data due to different variable formats.

9

CASE STUDY

“Our customer is a Dutch charity organization that wants to be able to classify it's supporters to donators and non-donators. The non-donators are sent a single marketing mail a year, whereas the donators receive multiple ones (up to 4).”

Who are the donators? Who are the non-donators?

Application of K-NN & Naive Bayes to training and test dataset. 4000 customers. SPSS, Excel, XLMiner

10

CLEAN-UP

No missing values 1-dimensional outliers removed through sorting

(regarding annual & average donation) 2-dimensional outliers removed through scatterplot

11

12

Variables Kept

Average donation

Frequency of Response

Median Time of Response

Time as client

Variables removedAnnual donationLast donationTime since last response.

13

Normalization of scores into z-scores. Nominal categorization of data Classification through percentiles of z-score & by

manually processing values within the variables.

14

ANALYSIS OF CASE STUDY – K-NN

15

Validation Data Scoring - Summary Report (for k = 13)

16

Error ReportClass # Cases # Errors % Error

0 1083 180 16,62049861

1 536 260 48,50746269

Overall 1619 440 27,17726992

Classification Confusion Matrix  Predicted Class

Actual Class 0 1

0 903 180

1 260 276

CHOOSING MODEL FOR K-NN

Accuracy: Proportion of correctly classified instances. Error rate: (1 – Accuracy) Sensitivity: Sensitivity is the proportion of actual

positives which are correctly identified as positives by the classifier.

Specificity: Like sensitivity, but for the negatives.

17

18

  M1 M2Selecting everyone in validation data

€711.20 €662.80

Selecting while correcting for sensitivity and specificity

€583.60 €530.80

19

APPLICATION OF MODEL ON TEST DATA

Classification Confusion Matrix

  Predicted Class

Actual Class 0 1

0 2300 344

1 654 750

20

Error ReportClass # Cases # Errors % Error

0 2644 344 13,01059

1 1404 654 46,5812

Overall 4048 998 24,65415

21

ANALYSIS OF THE CASE STUDY – NAIVE BAYES

22

Classification Confusion Matrix

  Predicted Class

Actual Class 0 1

0 856 229

1 225 309

Error ReportClass # Cases # Errors % Error

0 1085 229 21,10599

1 534 225 42,13483

Overall 1619 454 28,042

M1 = Cfrqres & Cavgdon M2 = Cfrqresp, Cavgdon

& Cmedtor

Classes -->

InputVariables

0 1

Value Prob Value Prob

CFRQRES

1 0,71977 1 0,2974832 0,171465 2 0,2551493 0,06398 3 0,192224 0,044786 4 0,255149

CAVGDON

1 0,632758 1 0,2974832 0,272553 2 0,4713963 0,076136 3 0,164764 0,018554 4 0,066362

23

  Model 1 Model 2

Selecting everyone €1072 €1006

Selecting by class €2460,82 €2378.01

APPLICATION OF MODEL ON TEST DATA

Classes -->

InputVariables

0 1

Value Prob Value Prob

CFRQRES

1 0,714502 1 0,3061082 0,173338 2 0,2542613 0,066465 3 0,186084 0,045695 4 0,253551

CAVGDON

1 0,630287 1 0,31252 0,280589 2 0,4616483 0,068353 3 0,1683244 0,02077 4 0,057528

24

25

Classification Confusion Matrix

 Predicted

Class

Actual Class 0 1

0 2096 548

1 570 834

Error ReportClass # Cases # Errors % Error

0 2644 548 20,72617

1 1404 570 40,59829

Overall 4048 1118 27,61858

LOOKING AT BOTH MODELS

26

27

QUESTIONS?

28

top related