chapter 7 – k-nearest-neighbor © galit shmueli and peter bruce 2010 data mining for business...
TRANSCRIPT
Chapter 7 – K-Nearest-Neighbor
© Galit Shmueli and Peter Bruce 2010
Data Mining for Business Intelligence
Shmueli, Patel & Bruce1
Characteristics
Data-driven, not model-driven
Makes no assumptions about the data
2 © Galit Shmueli and Peter Bruce 2010
Basic Idea
For a given record to be classified, identify nearby records
“Near” means records with similar predictor values X1, X2, … Xp
Classify the record as whatever the predominant class is among the nearby records (the “neighbors”)
3 © Galit Shmueli and Peter Bruce 2010
How to measure “nearby”?
The most popular distance measure is Euclidean distance
4 © Galit Shmueli and Peter Bruce 2010
Choosing kK is the number of nearby neighbors to be used to classify the new record
K=1 means use the single nearest recordK=5 means use the 5 nearest records
Typically choose that value of k which has lowest error rate in validation data
5 © Galit Shmueli and Peter Bruce 2010
Low k vs. High k
Low values of k (1, 3, …) capture local structure in data (but also noise)
High values of k provide more smoothing, less noise, but may miss local structure
Note: the extreme case of k = n (i.e., the entire data set) is the same as the “naïve rule” (classify all records according to majority class)
6 © Galit Shmueli and Peter Bruce 2010
Example: Riding Mowers
Data: 24 households classified as owning or not owning riding mowers
Predictors: Income, Lot Size
7 © Galit Shmueli and Peter Bruce 2010
Income Lot_Size Ownership60.0 18.4 owner85.5 16.8 owner64.8 21.6 owner61.5 20.8 owner87.0 23.6 owner110.1 19.2 owner108.0 17.6 owner82.8 22.4 owner69.0 20.0 owner93.0 20.8 owner51.0 22.0 owner81.0 20.0 owner75.0 19.6 non-owner52.8 20.8 non-owner64.8 17.2 non-owner43.2 20.4 non-owner84.0 17.6 non-owner49.2 17.6 non-owner59.4 16.0 non-owner66.0 18.4 non-owner47.4 16.4 non-owner33.0 18.8 non-owner51.0 14.0 non-owner63.0 14.8 non-owner8 © Galit Shmueli and Peter Bruce 2010
XLMiner Output
For each record in validation data (6 records) XLMiner finds neighbors amongst training data (18 records).
The record is scored for k=1, k=2, … k=18.
Best k appears to be k=8.
k = 9, k = 10, k=14 also share low error rate, but best to choose lowest k.
9 © Galit Shmueli and Peter Bruce 2010
Value of k% Error
Training% Error
Validation
1 0.00 33.33
2 16.67 33.33
3 11.11 33.33
4 22.22 33.33
5 11.11 33.33
6 27.78 33.33
7 22.22 33.33
8 22.22 16.67 <--- Best k
9 22.22 16.67
10 22.22 16.67
11 16.67 33.33
12 16.67 16.67
13 11.11 33.33
14 11.11 16.67
15 5.56 33.33
16 16.67 33.33
17 11.11 33.33
18 50.00 50.0010 © Galit Shmueli and Peter Bruce 2010
Using K-NN for Prediction (for Numerical Outcome)
Instead of “majority vote determines class” use average of response values
May be a weighted average, weight decreasing with distance
11 © Galit Shmueli and Peter Bruce 2010
AdvantagesSimpleNo assumptions required about Normal
distribution, etc.Effective at capturing complex interactions
among variables without having to define a statistical model
12 © Galit Shmueli and Peter Bruce 2010
Shortcomings Required size of training set increases
exponentially with # of predictors, pThis is because expected distance to nearest neighbor increases with p (with large vector of predictors, all records end up “far away” from each other)
In a large training set, it takes a long time to find distances to all the neighbors and then identify the nearest one(s)
These constitute “curse of dimensionality”
13 © Galit Shmueli and Peter Bruce 2010
Dealing with the Curse
Reduce dimension of predictors (e.g., with PCA)
Computational shortcuts that settle for “almost nearest neighbors”
14 © Galit Shmueli and Peter Bruce 2010
SummaryFind distance between record-to-be-
classified and all other recordsSelect k-nearest records
Classify it according to majority vote of nearest neighbors
Or, for prediction, take the as average of the nearest neighbors
“Curse of dimensionality” – need to limit # of predictors
15 © Galit Shmueli and Peter Bruce 2010
Chapter 8 – Naïve Bayes
© Galit Shmueli and Peter Bruce 2010
Data Mining for Business Intelligence
Shmueli, Patel & Bruce16
Characteristics
Data-driven, not model-driven
Make no assumptions about the data
17 © Galit Shmueli and Peter Bruce 2010
Naïve Bayes: The Basic Idea
For a given new record to be classified, find other records like it (i.e., same values for the predictors)
What is the prevalent class among those records?
Assign that class to your new record
18 © Galit Shmueli and Peter Bruce 2010
Usage
Requires categorical variables
Numerical variable must be binned and converted to categorical
Can be used with very large data sets
Example: Spell check programs assign your misspelled word to an established “class” (i.e., correctly spelled word)
19 © Galit Shmueli and Peter Bruce 2010
Exact Bayes Classifier
Relies on finding other records that share same predictor values as record-to-be-classified.
Want to find “probability of belonging to class C, given specified values of predictors.”
Even with large data sets, may be hard to find other records that exactly match your record, in terms of predictor values.
20 © Galit Shmueli and Peter Bruce 2010
Solution – Naïve BayesAssume independence of predictor
variables (within each class)
Use multiplication rule
Find same probability that record belongs to class C, given predictor values, without limiting calculation to records that share all those same values
21 © Galit Shmueli and Peter Bruce 2010
Calculations1. Take a record, and note its predictor values2. Find the probabilities those predictor
values occur across all records in C13. Multiply them together, then by proportion
of records belonging to C14. Same for C2, C3, etc.5. Prob. of belonging to C1 is value from step
(3) divide by sum of all such values C1 … Cn
6. Establish & adjust a “cutoff” prob. for class of interest
22 © Galit Shmueli and Peter Bruce 2010
Example: Financial Fraud
Target variable: Audit finds fraud, no fraud
Predictors: Prior pending legal charges (yes/no)Size of firm (small/large)
23 © Galit Shmueli and Peter Bruce 2010
Charges? Size Outcomey small truthfuln small truthfuln large truthfuln large truthfuln small truthfuln small truthfuly small fraudy large fraudn large fraudy large fraud
24 © Galit Shmueli and Peter Bruce 2010
Exact Bayes Calculations
Goal: classify (as “fraudulent” or as “truthful”) a small firm with charges filed
There are 2 firms like that, one fraudulent and the other truthful
P(fraud | charges=y, size=small) = ½ = 0.50
Note: calculation is limited to the two firms matching those characteristics
25 © Galit Shmueli and Peter Bruce 2010
Naïve Bayes Calculations
Same goal as before
Compute 2 quantities:Proportion of “charges = y” among frauds, times
proportion of “small” among frauds, times proportion frauds = 3/4 * 1/4 * 4/10 = 0.075
Prop “charges = y” among truthfuls, times prop. “small” among truthfuls, times prop. truthfuls = 1/6 * 4/6 * 6/10 = 0.067
P(fraud | charges, small) = 0.075/(0.075+0.067) = 0.5326 © Galit Shmueli and Peter Bruce 2010
Naïve Bayes, cont.
Note that probability estimate does not differ greatly from exact
All records are used in calculations, not just those matching predictor values
This makes calculations practical in most circumstances
Relies on assumption of independence between predictor variables within each class
27 © Galit Shmueli and Peter Bruce 2010
Independence Assumption
Not strictly justified (variables often correlated with one another)
Often “good enough”
28 © Galit Shmueli and Peter Bruce 2010
Advantages
Handles purely categorical data wellWorks well with very large data setsSimple & computationally efficient
29 © Galit Shmueli and Peter Bruce 2010
Shortcomings
Requires large number of records
Problematic when a predictor category is not present in training data
Assigns 0 probability of response, ignoring information in other variables
30 © Galit Shmueli and Peter Bruce 2010
On the other hand…
Probability rankings are more accurate than the actual probability estimates
Good for applications using lift (e.g. response to mailing), less so for applications requiring probabilities (e.g. credit scoring)
31 © Galit Shmueli and Peter Bruce 2010
Summary
No statistical models involved
Naïve Bayes (like KNN) pays attention to complex interactions and local structure
Computational challenges remain
32 © Galit Shmueli and Peter Bruce 2010