7.simple classification
DESCRIPTION
K723 Data MiningTRANSCRIPT
![Page 1: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/1.jpg)
Classification methods
![Page 2: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/2.jpg)
Three Simple Classification Methods
![Page 3: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/3.jpg)
Methods & Characteristics
The three methods: Naïve rule Naïve Bayes K-nearest-neighbor
Common characteristics: Data-driven, not model-driven Make no assumptions about the data
![Page 4: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/4.jpg)
Naïve Rule Classify all records as the majority
class Not a “real” method Introduced so it will serve as a
benchmark against which to measure other results
Y
N
S L
Charge
Size
Fraud40%
Truthful60%
Error rate 40%
![Page 5: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/5.jpg)
Naïve Bayes
![Page 6: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/6.jpg)
Idea of Naïve Bayes: Financial Fraud
Target variable: fraud truthfulPredictors:
Prior pending legal charges (yes/no) Size of firm (small/large)
Y
N
S L
Charge
Size
Classify basedon the majorityin each cell(Conditional Probability)
Error rate 20%
![Page 7: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/7.jpg)
Naïve Bayes: The Basic Idea
For a given new record to be classified, find other records like it (i.e., same values for the predictors)
What is the prevalent class among those records?
Assign that class to your new record
![Page 8: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/8.jpg)
Usage Requires categorical variables Numerical variable must be binned
and converted to categorical Can be used with very large data sets Example: Spell check – computer
attempts to assign your misspelled word to an established “class” (i.e., correctly spelled word)
![Page 9: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/9.jpg)
Exact Bayes Classifier Relies on finding other records that
share same predictor values as record-to-be-classified.
Want to find “probability of belonging to class C, given specified values of predictors.”
Conditional probability P (Y= C| X = (x1, …xp))
![Page 10: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/10.jpg)
Example: Financial Fraud
Target variable: fraud truthfulPredictors:
Prior pending legal charges (yes/no) Size of firm (small/large)
Y
N
S L
Charge
Size
Classify basedon the majorityin each cell
Error rate 20%
![Page 11: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/11.jpg)
Charges? Size Outcomey small truthfuln small truthfuln large truthfuln large truthfuln small truthfuln small truthfuly small fraudy large fraudn large fraudy large fraud
(T,F) Small Large
Charges Yes
(1,1) (0,2)
Charges No (3, 0) (2,1)
P(F|C,S) Small Large
Y 0.5 1
N 0 0.33
Rule Small Large
Y ? Fraud
N Truthful Truthful
Exact Bayes Calculations
![Page 12: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/12.jpg)
Exact Bayes Calculations Goal: classify (as “fraudulent” or as
“truthful”) a small firm with charges filed There are 2 firms like that, one
fraudulent and the other truthful P(fraud|charges=y, size=small) = ½ =
0.50 Note: calculation is limited to the two
firms matching those characteristics
![Page 13: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/13.jpg)
Problem Even with large data sets, may be hard
to find other records that exactly match your record, in terms of predictor values.
![Page 14: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/14.jpg)
Solution – Naïve Bayes Assume independence of predictor
variables (within each class) Use multiplication rule Find same probability that record
belongs to class C, given predictor values, without limiting calculation to records that share all those same values
![Page 15: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/15.jpg)
15
Refining the “primitive” idea: Naïve Bayes
Main idea: Instead of looking at combinations of predictors (crossed pivot table), look at each predictor separately
How can this be done? A probability trick! Based on Bayes’ rule Then make some simplifying assumption And get a powerful classifier!
![Page 16: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/16.jpg)
16
Conditional Probability
A = the event “ X = A” B = the event “Y = B” denotes the probability of A given B
(the conditional probability that A occurs given that B occurred)
)(
)()|(
BP
BAPBAP
)|( BAP
If P(B)>0
A
BA∩B
P(Fraud | Charge) = P(Charge and Fraud) / P(Charge)
![Page 17: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/17.jpg)
17
Bayes’ Rule(Reverse conditioning)
What if I only know the opposite direction?Bayes’ rule gives a neat way to reverse time!
)(
)()|()|(
AP
BPBAPABP
)( BAP
P(Fraud | Charge) = P(Charge | Fraud) P(Fraud) / P(Charge)
P(A∩B) = P(B | A) P(A)= P(A | B) P(B)
A
BA∩B
P(Fraud | Charge) P(Charge)= P(Charge | Fraud) P(Fraud)
![Page 18: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/18.jpg)
18
Using Baye’s rule Flipping the condition:
),...,(
)1()1|,...,(),...,|1(
1
11
p
pp XXP
YPYXXPXXYP
)0()0|,...,()1()1|,...,(
),...,(
11
1
YPYXXPYPYXXP
XXP
pp
p
![Page 19: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/19.jpg)
19
How is this used to solve our problem?
We want to estimate P(Y=1 | X1,…,Xp) But we don’t have enough examples of each
possible profile X1…, Xp in the training set If we had instead P(X1,…,Xp | Y=1), we could
separate it to P(X1|Y=1) ּP(X2|Y=1) ּּּP(Xp|Y=1)
True if we can assume independence between X1,…,Xp
within each class That means we could use single pivot tables! If the dependence is not extreme, it will work reasonably
well
![Page 20: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/20.jpg)
Independence Assumption With Independence Assumption: P(A∩B) = P(A)*P(B) We can thus calculate
P(X1,…,Xp | Y=1) = P(X1|Y=1)*P(X2|Y=1)* ּּּP(Xp|Y=1)
P(X1,…,Xp | Y=0) = P(X1|Y=0)*P(X2|Y=0)* ּּּP(Xp|Y=0)
P(X1,…,Xp ) = P(X1,…,Xp | Y=1)+ P(X1,…,Xp | Y=0)
A
B
A∩B
![Page 21: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/21.jpg)
21
Putting it all together: How it works
1. All predictors must be categorical.2. From the training set create all pivot tables of Y on each
separate X. We can thus obtain P(X), P(X|Y=1),P(X|Y=0)3. For a to-be-predicted observation with predictors X1,X2,…
Xp, software computes the probability of belonging to Y=1 using the formula
Each of the probabilities in the formula is estimated from a pivot table, and estimated P(Y=1) is the proportion of 1’s in training set
4. Use the cutoff to determine classification of this observation. Default: cutoff = 0.5 (classify to group that is most likely)
),...,(
)1()1|()1|()1|(),...,|1(
1
211
p
pp XXP
YPYXPYXPYXPXXYP
![Page 22: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/22.jpg)
Naïve Bayes, cont. Note that probability estimate does not
differ greatly from exact All records are used in calculations, not
just those matching predictor values This makes calculations practical in most
circumstances Relies on assumption of independence
between predictor variables within each class
![Page 23: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/23.jpg)
Independence Assumption
Not strictly justified (variables often correlated with one another)
Often “good enough”
![Page 24: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/24.jpg)
Example: Financial Fraud
Target variable: Fraud TruthfulPredictors:
Prior pending legal charges (yes/no) Size of firm (small/large)
Y
N
S L
Charge
Size
Classify basedon estimated Conditional Probability
![Page 25: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/25.jpg)
Y
N
S L
Charge
Y
N
S L
Charge
Y
N
S L
1 3
3
1
4 2
5
1
P(S,YIT)P(T) = P(S|T)*P(Y|T)P(T) = (4/6)*(1/6)*(6/10) = 0.067P(S,YIF)P(F) = P(S|F)*P(Y|F)P(F)= (1/4)*(3/4)*(4/10) = 0.075
P(F|S,Y) =P(S,Y|F)P(F)/P(S,Y)=P(S,Y|F)P(F)/(P(S,Y|F)P(F)+(P(S,Y,IT)P(T)) = 0.075/(0.075+0.067) = 0.528
![Page 26: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/26.jpg)
Charges? Size Outcomey small truthfuln small truthfuln large truthfuln large truthfuln small truthfuln small truthfuly small fraudy large fraudn large fraudy large fraud
(T,F) Small Large sum
Y (1,1) (0,2) (1,3)
N (3, 0) (2,1) (5,1)
sum (4,1) (2,3) (6,4)
P(C,S|F)P(F) Small Large P(C|F)
Y 0.075 0.225 0.75
N 0.025 0.075 0.25
P(S|F) 0.25 0.75 0.40
P(F|C,S) Small Large
Y 0.528 0.869
N 0.070 0.316
Naïve Bayes Calculations
P(C,S|T)P(T) Small Large P(C|T)
Y 0.067 0.034 0.17
N 0.334 0.164 0.83
P(S|T) 0.67 0.33 0.60
P(F|C,S) = P(C,S|F)P(F)/P(C,S) = P(C|F)P(S|F)P(F)/P(C,S)P(C,S) = P(C,S|F)P(F)+P(C,S|T)P(T)
P(F|C,S) Small Large
Y 0.5 1
N 0 0.33
exact est.
0.075/(0.075+0.067) = 0.528
0.25*0.75*0.40 = 0.075
![Page 27: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/27.jpg)
Example: Financial Fraud
Target variable: Fraud TruthfulPredictors:
Prior pending legal charges (yes/no) Size of firm (small/large)
Y
N
S L
Charge
Size
P(F|C,S) Small Large
Y 0.528 0.869
N 0.070 0.316
Estimated conditional probability
![Page 28: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/28.jpg)
28
Advantages and Disadvantages
The good Simple Can handle large amount of predictors High performance accuracy, when the goal is
ranking Pretty robust to independence assumption!
The bad Need to categorize continuous predictors Predictors with “rare” categories -> zero prob
(if this category is important, this is a problem) Gives biased probability of class membership No insight about importance/role of each
predictor
![Page 29: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/29.jpg)
29
Naïve Bayes in XLMiner Classification> Naïve Bayes
Prior class probabilities
Prob.
0.095333333
0.904666667
Value Prob Value Prob
0 0.374125874 0 0.401621223
1 0.625874126 1 0.598378777
0 0.699300699 0 0.711864407
1 0.300699301 1 0.288135593
Online
CreditCard
Classes-->
1 0Input Variables
1
0
<-- Success Class
Conditional probabilities
According to relative occurrences in training data
Class
P(CC=1| accept=1) = 0.301
Sheet: NNB-Output1
P(accept=1) = 0.095
![Page 30: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/30.jpg)
30
Naïve Bayes in XLMiner Scoring the validation data
XLMiner : Naive Bayes - Classification of Validation Data
0.5
Row Id.Predicted
ClassActual Class
Prob. for 1 (success)
Online CreditCard
2 0 0 0.08795125 0 0
3 0 0 0.08795125 0 0
7 0 0 0.097697987 1 0
8 0 0 0.092925663 0 1
11 0 0 0.08795125 0 0
13 0 0 0.08795125 0 0
14 0 0 0.097697987 1 0
15 0 0 0.08795125 0 0
16 0 0 0.10316131 1 1
Cut off Prob.Val. for Success (Updatable) ( Updating the value here will NOT update value in summary report )
Data range['UniversalBank KNN NBayes.xls']'Data_Partition1'!$C$3019:$O$5018
Sheet: NNB-ValidScore1
![Page 31: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/31.jpg)
K-Nearest Neighbors
![Page 32: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/32.jpg)
Basic Idea For a given record to be classified,
identify nearby records “Near” means records with similar
predictor values X1, X2, … Xp
Classify the record as whatever the predominant class is among the nearby records (the “neighbors”)
![Page 33: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/33.jpg)
How to Measure “nearby”?
The most popular distance measure is Euclidean distance
![Page 34: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/34.jpg)
Choosing k K is the number of nearby neighbors to
be used to classify the new record k=1 means use the single nearest record k=5 means use the 5 nearest records
Typically choose that value of k which has lowest error rate in validation data
![Page 35: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/35.jpg)
K=3
●
●
●
X1
X2
![Page 36: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/36.jpg)
Low k vs. High k Low values of k (1, 3 …) capture local
structure in data (but also noise) High values of k provide more
smoothing, less noise, but may miss local structure
Note: the extreme case of k = n (i.e. the entire data set) is the same thing as “naïve rule” (classify all records according to majority class)
![Page 37: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/37.jpg)
Example: Riding Mowers
Data: 24 households classified as owning or not owning riding mowers
Predictors = Income, Lot Size
![Page 38: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/38.jpg)
Income Lot_Size Ownership60.0 18.4 owner85.5 16.8 owner64.8 21.6 owner61.5 20.8 owner87.0 23.6 owner110.1 19.2 owner108.0 17.6 owner82.8 22.4 owner69.0 20.0 owner93.0 20.8 owner51.0 22.0 owner81.0 20.0 owner75.0 19.6 non-owner52.8 20.8 non-owner64.8 17.2 non-owner43.2 20.4 non-owner84.0 17.6 non-owner49.2 17.6 non-owner59.4 16.0 non-owner66.0 18.4 non-owner47.4 16.4 non-owner33.0 18.8 non-owner51.0 14.0 non-owner63.0 14.8 non-owner
![Page 39: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/39.jpg)
XLMiner Output For each record in validation data (6
records) XLMiner finds neighbors amongst training data (18 records).
The record is scored for k=1, k=2, … k=18.
Best k seems to be k=8. K = 9, k = 10, k=14 also share low
error rate, but best to choose lowest k.
![Page 40: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/40.jpg)
Value of k% Error
Training% Error
Validation
1 0.00 33.33
2 16.67 33.33
3 11.11 33.33
4 22.22 33.33
5 11.11 33.33
6 27.78 33.33
7 22.22 33.33
8 22.22 16.67 <--- Best k
9 22.22 16.67
10 22.22 16.67
11 16.67 33.33
12 16.67 16.67
13 11.11 33.33
14 11.11 16.67
15 5.56 33.33
16 16.67 33.33
17 11.11 33.33
18 50.00 50.00
![Page 41: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/41.jpg)
Using K-NN for Prediction (for Numerical Outcome)
Instead of “majority vote determines class” use average of response values
May be a weighted average, weight decreasing with distance
![Page 42: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/42.jpg)
Advantages Simple No assumptions required about
Normal distribution, etc. Effective at capturing complex
interactions among variables without having to define a statistical model
![Page 43: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/43.jpg)
Shortcomings Required size of training set increases
exponentially with # of predictors, p This is because expected distance to nearest
neighbor increases with p (with large vector of predictors, all records end up “far away” from each other)
In a large training set, it takes a long time to find distances to all the neighbors and then identify the nearest one(s)
These constitute “curse of dimensionality”
![Page 44: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/44.jpg)
Dealing with the Curse
Reduce dimension of predictors (e.g., with PCA)
Computational shortcuts that settle for “almost nearest neighbors”
![Page 45: 7.Simple Classification](https://reader034.vdocument.in/reader034/viewer/2022051513/5457a426b1af9f524b8b472f/html5/thumbnails/45.jpg)
Summary Naïve rule: benchmark Naïve Bayes and K-NN are two
variations on the same theme: “Classify new record according to the class of similar records”
No statistical models involved These methods pay attention to
complex interactions and local structure Computational challenges remain