k nearest neighbor presentation
TRANSCRIPT
k Nearest Neighbor
Dessy Amirudin
May 2016Data Science Indonesia
Bootcamp
Introduction
Other Name• K-Nearest Neighbors • Memory-Based Reasoning• Example-Based Reasoning• Instance-Based Learning• Case-Based Reasoning• Lazy Learning
History of kNN
• Has been used in statistical estimation and pattern recognition already in the beginning of 1970’s (non-parametric techniques).
• The outcome decision is based on k nearest neighbor from its evidence
• The nearest neighbor is calculated based on the distance
Application
text mining agriculture
financial healthcare
Source: http://personalexcellence.co/
Distance
• Numerical Data
• Categorical Data
𝐷=√∑𝑖=1𝑛
(𝑥 𝑖− 𝑦 𝑖 )2
Distance – Text Mining
Hamming Distance
•"karolin" and "kathrin" is 3.•"karolin" and "kerstin" is 3.•1011101 and 1001001 is 2.•2173896 and 2233796 is 3.
Regression Formulation
kNN Regression
0 2 4 6 8 10 12 14 16 180
20
40
60
80
100
120
0 2 4 6 8 10 12 14 16 180
5
10
15
20
25
30
35
40
kNN Regression
𝑦 ′= 1𝐾 ∑
𝑖=1
𝐾
𝑦 𝑖
Simple Linear Regression
Exercise 1• Open “simple_regression.R”• Create the simulated data• Follow the instruction
Simulated Data 1
MSE Plot Simple Regression
Plot with K=1
Plot with K=10
Plot with K=100
Simple Linear RegressionIntroduce Non Linearity
Introducing Non Linear Component
MSE Plot Non Linear Problem
Curse of Dimensionality
Exercise 2• Open “boston_knn_class.R”• Load MASS library• Load “Boston” data• Follow the step in the file
kNN Tips• Normalize the input variable• Find the optimum value of K using cross validation
Other experiment
Classification Formulation
kNN Classification
𝑦 ′=argmin𝑣
∑( 𝑥𝑖 , 𝑦 𝑖)∈𝐷𝑧
𝐼 (𝑣=𝑦 𝑖)
Binary Classification
Exercise 3• Open “logistic vs knn v2.R”• Follow the step
Recall on Confusion Table
• Source wikipedia
Multi-class Classification
Exercise 4• Open “multiclass.R”• Follow the step
Assigment
Assignment – Due to Next Week• Increase the accuracy of the Multiclass problem by 10%• In word document, tell what is the improvement that you can obtaind,
what is your method, why it is work, why it doesn’t work
• Submit your code and word document to [email protected] before 23 May 2016 23:59:59
Hint: You can increase the sample size
References
• Hastie T., Tibshirani R., Witten D. and James G. The Introduction of Statistical Learning. Springer. 2014.