k-nearest neighbor classification on spatial data streams using p-trees maleq khan, qin ding,...
TRANSCRIPT
![Page 1: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/1.jpg)
k-Nearest Neighbor Classification on Spatial
Data Streams Using P-trees
Maleq Khan, Qin Ding, William Perrizo; NDSU
![Page 2: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/2.jpg)
Introduction
We explored distance metric based computation using P-trees
Defined a new distance metric, called HOB distance
Revealed some useful properties of P-trees
A new method of nearest neighbor classification using P-tree
- called Closed-KNN
A new algorithm for k-clustering using P-trees
- efficient statistical computation from the P-trees
![Page 3: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/3.jpg)
Overview
1. Data Mining - classification and clustering
2. Various distance metricsMinkowski, Manhattan, Euclidian, Max, Canberra, Cord,
and HOB distance
- Neighborhoods and decision boundaries
3. P-trees and its properties
4. k-nearest neighbor classification- Closed-KNN using Max and HOB distance
5. k-clustering - overview of existing algorithms- our new algorithm- computation of mean and variance from the P-
trees
![Page 4: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/4.jpg)
Data Mining
extracting knowledge from a large amount of data
Functionalities: feature selection, association rule mining, classification & prediction, cluster analysis, outlier analysis, evolution analysis
Information Pyramid
Raw data
Useful Information
Data MiningMore data
less information
![Page 5: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/5.jpg)
Classification
Predicting the class of a data object
Bc3b3a3
Ac2b2a2
Ac1b1a1
ClassFeature3Feature2Feature1
Training data: Class labels are known
Classifiercba
Sample with unknown class:Predicted class Of the Sample
also called Supervised learning
![Page 6: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/6.jpg)
Types of Classifier
Eager classifier: Builds a classifier model in advance
e.g. decision tree induction, neural network
Lazy classifier: Uses the raw training datae.g. k-nearest neighbor
![Page 7: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/7.jpg)
ClusteringThe process of grouping objects into
classes,with the objective: the data objects are
• similar to the objects in the same cluster • dissimilar to the objects in the other clusters.
A two dimensional space showing 3 clusters
Clustering is often called unsupervised
learning or unsupervised classification
the class labels of the data objects are unknown
![Page 8: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/8.jpg)
Distance Metric
Measures the dissimilarity between two data points.
A distance metric is a function, d, of two n-dimensional points
X and Y, such that
d(X, Y) is positive definite: if (X Y), d(X, Y) > 0
if (X = Y), d(X, Y) = 0
d(X, Y) is symmetric: d(X, Y) = d(Y, X)
d(X, Y) holds triangle inequality: d(X, Y) + d(Y, Z) d(X, Z)
![Page 9: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/9.jpg)
Various Distance Metrics
Minkowski distance or Lp distance, pn
i
piip yxYXd
1
1
,
Manhattan distance,
n
iii yxYXd
11 ,
Euclidian distance,
n
iii yxYXd
1
22 ,
Max distance, ii
n
iyxYXd
1max,
(P = 1)
(P = 2)
(P = )
nxxxxX ,,,, 321 nyyyyY ,,,, 321 Let and
![Page 10: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/10.jpg)
An Example
A two-dimensional space:
Manhattan, d1(X,Y) = XZ+ ZY = 4+3 = 7
Euclidian, d2(X,Y) = XY = 5
Max, d(X,Y) = Max(XZ, ZY) = XZ = 4X (2,1)
Y (6,4)
Z
d1 d2 d
1 pp ddFor any positive integer p,
![Page 11: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/11.jpg)
Some Other Distances
Canberra distance
Squared cord distance
Squared chi-squared distance
n
i ii
iic yx
yxYXd
1
,
n
iiisc yxYXd
1
2,
n
i ii
iichi yx
yxYXd
1
2
,
![Page 12: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/12.jpg)
HOB Similarity
Higher Order Bit (HOB) similarity:
HOBS(A, B) = ii
m
sbasiis
1:max
0
Bit position: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
x1: 0 1 1 0 1 0 0 1 x2: 0 1 0 1 1 1 0 1
y1: 0 1 1 1 1 1 0 1 y2: 0 1 0 1 0 0 0 0
HOBS(x1, y1) = 3 HOBS(x2, y2) = 4
A, B: two scalars (integer)
ai, bi : ith bit of A and B (left to right)
m : number of bits
![Page 13: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/13.jpg)
HOB DistanceThe HOB distance between two scalar value A and B:
dv(A, B) = m – HOB(A, B)
The previous example:Bit position: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
x1: 0 1 1 0 1 0 0 1 x2: 0 1 0 1 1 1 0 1
y1: 0 1 1 1 1 1 0 1 y2: 0 1 0 1 0 0 0 0
HOBS(x1, y1) = 3 HOBS(x2, y2) = 4
dv(x1, y1) = 8 – 3 = 5 dv(x2, y2) = 8 – 4 = 4
dv(x1, y1) = 8 – 3 = 5 dv(x2, y2) = 8 – 4 = 4
The HOB distance between two points X and Y:
HOBmaxmax11
,yxm - ,yxdX,Yd ii
n
iiiv
n
iH
In our example (considering 2-dimensional data):
dh(X, Y) = max (5, 4) = 5
![Page 14: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/14.jpg)
HOB Distance Is a Metric
HOB distance is positive definite
if (X = Y), = 0
if (X Y), > 0
YXdH ,
YXdH ,
HOB distance is symmetric
XYdYXd HH ,,
HOB distance holds triangle inequality
ZXdZYdYXd HHH ,,,
![Page 15: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/15.jpg)
Neighborhood of a Point
Neighborhood of a target point, T, is a set of points, S,
such that X S if and only if d(T, X) r
2r
T
X
T
2r
X
2r
T
X
T
2r
X
Manhattan Euclidian Max HOB
If X is a point on the boundary, d(T, X) = r
![Page 16: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/16.jpg)
Decision Boundary decision boundary between points A and B, is locus of the point X satisfying the condition d(A, X) = d(B, X)
B
X
A
D
R2
R1
d(A,X)
d(B,X)
> 45
Euclidian
B
A
Max
Manhattan
< 45
B
A
EuclidianMax
Manhattan
B
A
B
A
Decision boundary for HOB Distance. Perpendicular to the axis that makes maximum distance
![Page 17: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/17.jpg)
Remotely Sensed Imagery Data
An image is a collection of pixels
Each pixel represent an square area in the ground
Several attributes or bands associated with each pixel
ex. red, green, blue reflectance values, soil moisture, nitrate
Band Sequential (BSQ) file: one file for each band
Bit Sequential (bSQ) file: one file each bit of each band
Bi,j is the bSQ file for jth bit of ith band
![Page 18: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/18.jpg)
Peano count-Tree or P-tree
We form one P-tree from each bSQ filePi,j is the basic P-tree for bit j of band I
•Root of the P-tree is the count of 1 bits in the entire image•Root has 4 children with the counts of the 4 quadrants
•Recursively divide the quadrants until there is only one bit in the quadrant unless the node is pure0 or pure1
1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
55 ____________/ / \ \___________ / _____/ \ ___ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101
Pure1 node:
All bits are 1
Root Count
![Page 19: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/19.jpg)
Peano Mask Tree (PMT)
55 ____________/ / \ \___________ / _____/ \ ___ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101
m ____________/ / \ \____________ / ____/ \ ____ \ 1 ____m__ _m__ 1 / / | \ / | \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101
P-tree PMT
0 represents Pure0 node
1 represents pure1 node
m represents mixed node
![Page 20: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/20.jpg)
P-tree ANDing
m
1 m 0 m
Subtree1 Subtree2
m
m 0 0 m
Subtree3 Subtree4
m
m 0 0 m
Subtree3 Subtree5
AND =
ORing and COMPLEMENT operation are performed in similar way
Also there are some other P-tree structured (such as PVT)
and ANDing algorithms that are beyond the scope of this
presentation
![Page 21: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/21.jpg)
Value & Interval P-tree
The value P-tree Pi(v) represents the pixels that have value v for band i.
there is a 1 in Pi(v) at a pixel location, if that pixel have the value v for band i
otherwise there is a 0 in Pi(v).
Let, bj = jth bit of the value v and
and Pi,j = the basic P-tree for band i bit j.
Define Pti,j = Pi,j if bj = 1
= Pi,j if bj = 0
Then Pi(v) = Pti,1 AND Pti,2 AND Pti,3 AND … AND Pti,m
The interval P-tree, Pi(v1, v2) = Pi(v1) OR Pi(v1+1) OR Pi(v1+2) OR … OR Pi(v2)
![Page 22: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/22.jpg)
Notations
P1 & P2 : P1 AND P2
P1 | P2 : P1 OR P2
P´ : COMPLEMENT of P
Pi, j : basic P-tree for band i bit j.
Pi(v) : value P-tree for value v of band i.
Pi(v1, v2) : interval P-tree for interval [v1, v2] of band i.
P0 : is pure0-tree, a P-tree having the root node which is pure0.
P1 : is pure1-tree, a P-tree having the root node which is pure1.
rc(P) : root count of P-tree P
N : number of pixels
n : number of bands
m : number of bits
![Page 23: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/23.jpg)
Properties of P-trees
1. a)
b)
00rc PPP
1rc PPNP
00& PPP
PPP 1&
PPP &
0'& PPP
2. a)
b)
c)
d)
PPP 0|
11| PPP
PPP |
1'| PPP
3. a)
b)
c)
d)
4. rc(P1 | P2) = 0 rc(P1) = 0 and rc(P2) = 0
5. v1 v2 rc{Pi (v1) & Pi(v2)} = 0
6. rc(P1 | P2) = rc(P1) + rc(P2) - rc(P1 & P2)
7. rc{Pi (v1) | Pi(v2)} = rc{Pi (v1)} + rc{Pi(v2)}, where v1 v2
![Page 24: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/24.jpg)
P-tree Header
Header of a P-tree file to make a generalized P-tree structure
1 word 2 words
2 words
4 words 4 words
Format Code
Fan-out
# of levels
Root count Length of the body in
bytes
Body of the P-tree
![Page 25: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/25.jpg)
k-Nearest Neighbor Classification
1) Select a suitable value for k
2) Determine a suitable distance metric
3) Find k nearest neighbors of the sample using the
selected metric
4) Find the plurality class of the nearest neighbors by voting on the class labels of the NNs
5) Assign the plurality class to the sample to be classified.
![Page 26: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/26.jpg)
Closed-KNN
T
T is the target pixels.
With k = 3, to find the third nearest neighbor,
KNN arbitrarily select one point from the
boundary line of the neighborhood
Closed-KNN includes all points on the boundary
Closed-KNN yields higher classification accuracy than traditional KNN
![Page 27: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/27.jpg)
Searching Nearest Neighbors
We begin searching by finding the exact matches.
Let the target sample, T = <v1, v2, v3, …, vn>
The initial neighborhood is the point T.
We expand the neighborhood along each dimension:
along dimension i, [vi] is expanded to the interval [vi – ai , vi+bi],
for some positive integers ai and bi.
Continue expansion until there are at least k points in the neighborhood.
![Page 28: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/28.jpg)
HOB Similarity Method for KNN
In this method, we match bits of the target to the training data
Fist we find matching in all 8 bits of each band (exact matching)
let, bi,j = jth bit of the ith band of the target pixel.
Define Pti,j = Pi,j, if bi,j = 1
= Pi,j, otherwise
And Pvi,1-j = Pti,1 & Pti,2 & Pti,3 & … & Pti,j
Pnn = Pv1,1-8 & Pv2,1-8 & Pv3,1-8 & … & Pvn,1-8
If rc(Pnn) < k, update Pnn = Pv1,1-7 & Pv2,1-7 & Pv3,1-7 & … & Pvn,1-7
![Page 29: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/29.jpg)
An Analysis of HOB Method
Let ith band value of the target T, vi = 105 = 01101001b
[01101001] = [105, 105] 1st expansion
[0110100-] = [01101000, 01101001] = [104, 105] 2nd expansion
[011010- -] = [01101000, 01101011] = [104, 107]
Does not expand evenly in both side: Target = 105 and center of [104, 111] = (104+107) / 2 = 105.5
And expands by power of 2.
Computationally very cheap
![Page 30: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/30.jpg)
Perfect Centering Method
Max distance metric provides better neighborhood by- keeping the target in the center- and expanding by 1 in both side
Initial neighborhood P-tree (exact matching):Pnn = P1(v1) & P2(v2) & P3(v3) & … & Pn(vn)
If rc(Pnn) < k Pnn = P1(v1-1, v1+1) & P2(v2-1, v2+1) & … & Pn(vn-1, vn+1)
If rc(Pnn) < k Pnn = P1(v1-2, v1+2) & P2(v2-2, v2+2) & … & Pn(vn-2, vn+2)
Computationally costlier than HOB Similarity method
But a little better classification accuracy
![Page 31: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/31.jpg)
Finding the Plurality Class
Let, Pc(i) is the value P-trees for the class i
Plurality class = PnniPci
&)(rcmaxarg
![Page 32: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/32.jpg)
Performance
Experimented on two sets of Arial photographs of The
Best Management Plot (BMP) of Oakes Irrigation Test Area
(OITA), ND
Data contains 6 bands: Red, Green, Blue reflectance
values, Soil Moisture, Nitrate, and Yield (class label).
Band values ranges from 0 to 255 (8 bits)
Considering 8 classes or levels of yield values: 0 to 7
![Page 33: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/33.jpg)
Performance – Accuracy
40
45
50
55
60
65
70
75
80
256 1024 4096 16384 65536 262144
Training Set Size (no. of pixels)
Acc
ura
cy (
%)
KNN-Manhattan KNN-Euclidian
KNN-Max KNN-HOBS
P-tree: Perfect Centering (closed-KNN) P-tree: HOBS (closed-KNN)
1997 Dataset:
![Page 34: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/34.jpg)
Performance - Accuracy (cont.)
1998 Dataset:
20
25
30
35
40
45
50
55
60
65
256 1024 4096 16384 65536 262144
Training Set Size (no of pixels)
Acc
ura
cy (
%)
KNN-Manhattan KNN-Euclidian
KNN-Max KNN-HOBS
P-tree: Perfect Centering (closed-KNN) P-tree: HOBS (closed-KNN)
![Page 35: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/35.jpg)
Performance - Time
1997 Dataset: both axis in logarithmic scale
0.00001
0.0001
0.001
0.01
0.1
1
256 1024 4096 16384 65536 262144
Training Set Size (no. of pixels)
Per
Sam
ple
Cla
ssif
icat
ion
tim
e (s
ec)
KNN-ManhattanKNN-EuclidianKNN-MaxKNN-HOBSP-tree: Perfect Centering (cosed-KNN)P-tree: HOBS (closed-KNN)
![Page 36: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/36.jpg)
Performance - Time (cont.)
0.00001
0.0001
0.001
0.01
0.1
1
256 1024 4096 16384 65536 262144Training Set Size (no. of pixels)
Per
Sam
ple
Cla
ssif
icat
ion
Tim
e (s
ec)
KNN-ManhattanKNN-EuclidianKNN-MaxKNN-HOBSP-tree: Perfect Centering (closed-KNN)P-tree: HOBS (closed-KNN)
1998 Dataset : both axis in logarithmic scale
![Page 37: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/37.jpg)
k-Clustering
Partitioning data into k clusters, C1, C2, …, Ck as to minimizes
some criterion function
such as the sum of squared Euclidian distance measured
from the centroid of the cluster or total variance
, ci is the centroid or mean of Ci
or sum of the pair-wise weight
c is the weight function usually the distance between p and
q
k
i Cpi
i
pcd0
22 ,
k
i Cqp i
qpc0 ,
,
![Page 38: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/38.jpg)
k-Means Algorithm
1. Arbitrarily select k initial cluster centers
2. Assign each data point to its nearest center
3. Update the centers by the means of the clusters
4. Repeat step 2 & 3 until no change
Good optimization, very slow
Complexity O(nNkt), n = # of dimension, N = # of data points
k = # of clusters, t = # of iterations
To solve speed issues,
some other algorithms have been proposed sacrificing quality
![Page 39: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/39.jpg)
Divisive Approach
1. Initially consider the whole space as one hyperbox
2. Select a hyperbox to split
3. Select an axis and cut-point
4. Split the selected hyperbox by a hyperplane perpendicular to the selected axis through the selected cut-point
5. Repeat step 2-4 until there are k hyperboxes, each hyperbox is a clusterMean-split algorithm, variance-based algorithm and our
proposed
new algorithm follow the divisive approach
They differ in the strategies for selecting the hyperbox, axis
and cut-point.
![Page 40: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/40.jpg)
Mean-Split Algorithm
The initial hyperbox (the whole space) is assigned a number k
that is, k clusters will be formed from this hyperbox
Let, L = number of clusters assigned to a hyperbox
Li clusters are assigned to the i th sub-hyperbox
where, i = 1, 2 0 1
n = # of points, V = volume
1. Select a hyperbox with L > 1
2. Select the axis with largest spread of projected data
3. Mean of the projected data is the cut-point
Fast but poor optimization
21
1
21
1VV
V
nn
nLL i
i
![Page 41: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/41.jpg)
Variance-Based Algorithm
1. Select the hyperbox with largest variance
2. By checking each point on each dimension of the selected hyperbox
find the optimal cut-point, topt, that gives maximum
variance
reduction on the projected data. twtwtt
opt222
211
2maxarg
where wi and are the weight and variance of the i th interval (i = 1, 2)
ti2
Still computationally costly but optimization is closer to k-means
![Page 42: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/42.jpg)
Our Algorithm
When a new hyperbox is formed find two means m1 and m2
for each dimension using the projected data:
a. Arbitrarily select two values for m1 and m2 (m1 < m2)
b. Update m1 = mean of the interval [0, (m1+m2)/2]
c. Update m2 = mean of the interval [(m1+m2)/2, upper_limit]
d. Repeat step b & c until no change in m1 and m2.
1. Select the hyperbox and axis for which (m2 – m1) is
largest
2. Cut-point = (m1 + m2) / 2
![Page 43: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/43.jpg)
Our Algorithm (cont.)
We represent each cluster by a P-tree
the initial cluster is the pure1-tree, P1
Let Pci is the P-tree for cluster ci
the P-trees for the two new clusters after splitting along axis j:
PCi1 = PCi & Pj(0, (m1+m2)/2)
PCi2 = PCi & Pj((m1+m2)/2, upper_limit)
Note: Pj((m1+m2)/2, upper_limit) = complement of Pj(0, (m1+m2)/2)
![Page 44: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/44.jpg)
Computing Sum & Mean from P-trees
for all points and for dimension or band i:
sum = mean =
For the points in a cluster:
sum = mean =
Here the template P-tree, Pt = P-tree representing the
cluster
1
0,
1 &2n
jtji
jn PPrc
t
n
jtji
jn
Prc
PPrc
1
0,
1 &2
1
0,
12n
jji
jn Prc
N
Prcn
jji
jn
1
0,
12
![Page 45: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/45.jpg)
Computing Variance from P-trees
1
0
1
0,,
22 &2n
j
n
kkiji
kjn PPrc
1
0
1
0,,
22 &&2n
j
n
ktkiji
kjn PPPrc
21 xN 221 x
NVariance = =
For all points in the space:
2x
For the points in a cluster:
2x
![Page 46: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/46.jpg)
Performance
Unlike variance based method, instead of checking each
point on the axis, our method rapidly converges to the
optimal cut point, topt .
avoids scanning database by computing sum and mean
from the root count of the P-trees
very much faster than variance-based method while
optimization as good as variance-based method
![Page 47: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/47.jpg)
Conclusion
Analyzed the effect of various distance metric
Used a new metric, HOB Distance for fast P-tree-based computation
Revealed useful properties of P-trees
using P-trees, a fast new method of KNN, called Closed-KNN, giving higher classification accuracy
Designed a new FAST k-clustering algorithm: computing sum, mean, variance from P-tree without scanning databases
![Page 48: K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f295503460f94c4345d/html5/thumbnails/48.jpg)
Thank You