a fast and scalable nearest neighbor based classification taufik abidin and william perrizo...

A Fast and Scalable Nearest A Fast and Scalable Nearest Neighbor Based Neighbor Based

ClassificationClassification

Taufik Abidin and William PerrizoDepartment of Computer Science North Dakota State University

Given a (large) TRAINING SET, R(A1,…,An, C), with C=CLASSES and {A1…An}=FEATURES

Classification is: labeling unclassified objects based on the training set

kNN classification goes as follows:

ClassificatioClassificationn

Search for the k-Nearest Neighbors

Vote the classTraining Set

Unclassified Object

Machine Learning usually begins by identifying Near Neighbor Set(s), NNS.

In Isotropic Clustering, one identifies round sets (disk shaped NNSs about a center).

In Density Clustering, one identifies cores (dense round NNSs) then pieces them together.

In any Classification based on continuity

we classifying a sample based on its NNS class histogram (aka kNN) or

we identify isotropic NNSs of centroids (k-means) or

we build decision tres with training leafsets and use them to classify samples that fall to that leaf,

we find class boundaries (e.g., SVM) which distinguish NNSs in one class from NNSs in another.

The basic definition of continuity from elementary calculus proves NNSs are fundamental:

>0 >0 : d(x,a)< d(f(x),f(a))< or NNS about f(a), a NNS about a that maps inside it.

So NNS Search is a fundamental problem to be solved. We discuss NNS Search from the a vertical data point of view. With vertically structured data, the only neighborhoods that are easily determined are the cubic or Max neighborhoods (L disks), yet usually we want Euclidean disks. We develop techniques to circumscribe Euclidean disks using the intersections of contour sets, the main ones are coordinate projection contours with intersections form L disks.

Database analysis can be broken down into 2 areas, Querying and Data Mining.

Data Mining can be broken down into 2 areas, Machine Learning and Assoc. Rule Mining

Machine Learning can be broken down into 2 areas, Clustering and Classification.

Clustering can be broken down into 2 areas, Isotropic (round clusters) and Density-based

SOME useful NNSsGiven a similarity, s:RRReals (e.g., s(x,y)=s(y,x) and s(x,x)s(x,y) x,yR ) and an extension to disjoint subsets of R (e.g.,

single/complete/average link...) and CR, a k-disk of C is:

disk(C,r) {xR | s(x,C)r},

skin(C,r) disk(C,r) - C

ring(C,r2,r1) disk(C,r2) - disk(C,r1)

skin(C,r2) - skin(C,r1).

Given a [psuedo] distance, d, rather than a similarity, just reverse all inequalities.

For C = {a}

a

r1

r2

C

r1r2

disk(C,k)C : |disk(C,k)C'|=k and s(x,C)s(y,C) xdisk(C,k), ydisk(C,k). Define its

skin(C,k) disk(C,k) - C skin stands for s k immediate neighbors and is a kNNS of C

cskin(C,k) allskin(C,k)s closed skin, and

ring(C,k) = cskin(C,k) - cskin(C,k-1)

A definition of Predicate trees (P-trees) based on functionals?

Given f:R(A1..An)Y and SY define the uncompressed Functional-P-tree as

Pf, S a bit map given by Pf,S(x)=1 iff f(x)S. .

The predicate for 0Pf,S is the set containment predicate, f(x)S

Pf,S a Contour bit map (bitmaps, rather than lists the contour points).

If f is a local density (ala OPTICS) and {Sk} a partition of Y, {f-1(Sk)} is a clustering! What partition {Sk} of Y should be use? (a binary partition? given by a threshold value).

In OPTICS Sks are the intervals between crossing points of graph(f) and a threshold line

pts below the threshold line are agglomerated into 1 noise cluster.Weather reporters use equi-width interval partitions (of barametric pressure or temp..).

(ls)Pf,S is a compression of Pf,S by doing the following:

1. order or walk R (converts the bit map to a bit vector)

2. equi-width partition R into segments of size, ls (ls=leafsize, the last 1 can be short)

3. eliminate and mask to 0, all pure-zero segments (via a Leaf Mask or LM )

4. eliminate and mask to 1, all pure-one segments (via a Pure1 Mask or PM ) Notes:1. LM is an existential aggregation of R (1 iff that leaf has a 1-bit). Others? (default=existential)2. There are partitioning other than equi-width (but that will be the default).

Compressed Functional-P-trees (with equi-width leaf size, ls)

Doubly Compressed Functional-P-trees with equi-width leaf sizes, (ls1,ls2)

Each leaf of (ls)Pf,S is an uncompressed bit vector and can be compressed the same way:

(ls1,ls2) Pf,S (ls2 is 2nd equi-width segment size and ls2<< ls1)

Recursive compression can continue ad infinitum, (ls1,ls2,ls3) Pf,S (ls1,ls2,ls3,ls4) Pf,S ...

For Ai Real and fi,j(x) jth bit of the ith component, xi

{(*)Pfi,j ,{1} (*)Pi,j}j=b..0 are the basic (*)P-trees of Ai, (* = ls1,...lsk k=0...).

For Ai Categorical, and fi,a(x)=1 if xi=aR[Ai], else 0; then

{(*)Pfi,a,{1} (*)Pi,a}aR[Ai] are the basic (*)P-trees of Ai

For Ai real, the basic P-trees result from binary encoding of individual real numbers (categories).Encodings can be used for any attribute. Note that it is the binary encoding of real attributes, which turns an n-tuple scan into a Log2(n)-column AND (making P-tree technology scalable).

BASIC P-trees

Problems with kNNProblems with kNN Finding k-Nearest Neighbor Set from horizontally struc

tured data (record oriented data) can be expensive for large training set (containing millions or trillions of tuples)– linear to the size of the training set (1 scan)– Closed kNN is much more accurate but requires 2 scans

Vertically structuring the data can help.

6. 1st half of 1st of 2nd is 1

00 0 0 1 1

4. 1st half of 2nd half not 0 00 0 0

2. 1st half is not pure1 0

00 0

1. Whole file is not pure1 0

Horizontal structures(records)

Scanned vertically

P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 1 10

0 1 0 0 1 01

0 0 00 0 0 1 01 10

0 1 0

0 1 0 1 0

0 0 01 0 01

0 1 0

0 0 0 1 0

0 0 10 1

0 0 10 1 01

0 0 00 1 01

0 0 0 0 1 0 010 015. 2nd half of 2nd half is 1

00 0 0 1

R11

00001011

process P-trees using multi-operand logical ANDs.

Vertical Predicate-tree (P-tree) structuring: vertically partition table; compress each vertical bit slice into a basic Ptree;

010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100

R( A1 A2 A3 A4)

A data table, R(A1..An), containing horizontal structures (records) isprocessed vertically (vertical scans)

The basic (1-D) Ptree for R11 is built by recording the truth of the predicate “pure 1” recursively on halves, until purity is reached.

3. 2nd half is not pure1 0 00 0

7. 2nd half of 1st of 2nd not 0

00 0 0 1 10

0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

R[A1] R[A2] R[A3] R[A4] 010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100

Eg, to count, 111 000 001 100s, use “pure111000001100”: 0 23-level

P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =201 21-level

But it is pure (pure0) so this branch ends

Total Total VariationVariation

The Total Variation of a set X, TV(a) is the sum of the squared separations of objects in X about a , defined as follows: TV(a) = xX(x-a)o(x-a)

We will use the concept of functional contours (in particular, the TV contours) in this presentation to identify a well-pruned, small superset of the Nearest Neighbor Set of an unclassified sample (which can then be efficiently scanned)

First we will discuss functional contours in general then consider the specific TV contours.

Given f:R(A1..An)Y and SY , define contour(f,S) f-1(S).

From the derived attribute point of view,Contour(f,S) = SELECT A1..An FROM R* WHERE R*.Af S.

If S={a}, f-1({a}) is Isobar(f, a)

There is a DUALITY betweenfunctions, f:R(A1..An)Y andderived attributes, Af of R given by x.Af f(x)where Dom(Af)=Y

A1 A2 An

x1 x2 xn

: . . .

Y f(x)fA1 A2 An Af

x1 x2 xn f(x): . . .

R R*

A1 A2 An

: : . . .

YSfR

A1..An space

Y

S

graph(f) ={ (a1,...,an,f(a1.an)) | (a1..an)R }

contour(f,S)

TV(a) =xR(x-a)o(x-a) If we use d for a index variable over the dimensions,

= xRd=1..n(xd2 - 2adxd + ad

2) i,j,k bit slices indexes

= xRd=1..n(k2kxdk)2 - 2xRd=1..nad(k2kxdk) + |R||a|2

= xd(i2ixdi)(j2

jxdj) - 2xRd=1..nad(k2kxdk) + |R||a|2

= xdi,j 2i+jxdixdj

- 2 x,d,k2k ad xdk + |R||a|2

= x,d,i,j 2i+j xdixdj

- |R||a|2 2 dad x,k2

kxdk +

TV(a) = i,j,d 2i+j |Pdi^dj| - |R||a|2 k2

k+1 dad |Pdk| +

The first term does not depend upon a. Thus, the simpler derived attribute, TV-TV()(which does not have that 1st term at all) has with identical contours as TV (just a lowered graph).We also find it useful to post-compose a log to reduce the number of bit slices.The resulting functional is called the High-Dimension-ready Total Variation or HDTV(a).


+ dadad )|R|( -2dadd +


- |R|dadad2|R| dadd +

The length of g (a) depends only on the length of a-, so isobars are hyper-circles centered at The graph of g is a log-shaped hyper-funnel:

From equation 7,

f(a)=TV(a)-TV() d(adad- dd) )= |R| ( -2d(add-dd) +

TV(a) = x,d,i,j 2i+j xdixdj

+ |R| ( -2dadd + dadad )

+ dd2 )= |R|( dad

2 - 2ddad

f()=0 and letting g(a) HDTV(a) = ln( f(a) )= ln|R| + ln|a-|2

Taking g / ad (a) =

| a- |2

2( a -)dThe Gradient of g at a = 2/| a- |2 (a -)

= |R| |a-|2 so

go inward and outward along a- by to the points;inner point, b=+(1-/|a-|)(a-) andouter point, c=-(1+/|a-|)(a-).

-contour(radius about a)

a

For an -contour ring (radius about a)

Then take g(b) and g(c) as lower and upper endpoints of a vertical interval.

Then we use EIN formulas on that interval to get a mask P-tree for the -contour(which is a well-pruned superset of the -neighborhood of a) b c

g(b)

g(c)

x1

x2

g(x)

If more pruning is needed (i.e., HDTV(a) contour is still to big to scan) use a dimension projection contour (Dim-i projection P-trees are already computed = basic P-trees of R.A i. Form that contour_mask_P-tree; AND it with the HDTV contour P-tree. The result is a mask for the intersection).

-contour(radius about a)

a

HDTV(b)

HDTV(c)

b c

As pre-processing, calculate basic P-trees for the HDTV derived attribute. To classify a,1. Calculate b and c (which depend upon a and )2. Form the mask P-tree for training points with HDTV-values in [HDTV(b),HDTV(c)] (Note, when the camera ready paper was submitted we were still doing this step by sorting TV(a) values. Now we use the contour approach which speeds up this step considerably. The performance evaluation graphs in this paper are still based on the old method, however.).3. User that P-tree to prune out the candidate NNS.4. If the root count of the candidate set is small enough, proceed to scan and assign class votes using, e.g., a Gaussian vote function, else prune further using a dimension projection).

contour of dimension projection

f(a)=a1

x1

x2

HDTV(x)

If more pruning is needed (i.e., HDTV(a) contour is still to big to scan)

We can also note that HDTV can be further simplified(retaining thesame contour structure) byusing h(a)=|a-|. Since we create the derived attribute by scanning the training set anyway, why not just use this very simple function?Then other functionals leap to mind, e.g., hb(a)=|a-b|

12

3

TV(x15)-TV()

12

34

5

XY

TV-TV()

45

TV()=TV(x33)

TV(x15)

12

34

5

XY

TV

12

34

5

Graphs

hb(a)=|a-b|

b

h(a)=|a-|

HDTV

A principle: A job is not done until the Mathematics is completed (and, of course, until all the paper work is also completed). The Mathematics of a research project always includes1. proofs of killer-ness,2. simplifications (everything is simple once fully understood),3. generalizations (to the widest possible application scope), and4. insights (teasing out the main issues and underlying mega-truths with full drill down). Therefore, we need to ask the following questions at this point:

Should we use the vector of medians (the only good choice of middle point in mulidimensional space, since the point closest to the mean is also influenced by skewness just like the mean).

We will denote the vector of medians as

h(a)=|a-| is an important functional (better than h(a)=|a-|?) If we compute the median of an even number of values as the count-weighted average of the middle two values, then in binary columns, and coincide.

What about the vector of standard deviations, ? (computable with P-trees!) Do we have an improvement of BIRCH here? - generating similar comprehensive statistical measures, but much faster and more focused?)

We can do the same for any rank statistic (or order statistic), e.g., vector of 1st or 3rd quartiles, Q1 or Q3 ; the vector of kth rank values (kth ordinal values).

If we preprocessed to get the basic P-trees of , and each mixed quartile vector (e.g., in 2-D add 5 new derived attributes; , Q1,1, Q1,2, Q2,1, Q2,2; where Qi,j is the ith quartile of the jth column), what does this tell us (e.g., what can we conclude about the location of core clusters? Maybe all we really need is the basic P-trees of the column quartiles, Q1, ..., Qn ?)

Additional Mathematics to enjoy:

Study the Vector Ordinal Disks (VODs) as alternatives to distance and ordinal disks (kNN disks), where VOD(a,k) = {x | xd is one of the [closed] k-Nearest Neighbors of ad for every column, d}. Are they easy to compute from P-trees? Do they offer advantages? When? What? Why?

DataseDatasett

1. KDDCUP-99 Dataset (Network Intrusion Dataset)– 4.8 millions records, 32 numerical attributes– 6 classes, each contains >10,000 records– Class distribution:

– Testing set: 120 records, 20 per class– 4 synthetic datasets (randomly generated):

- 10,000 records (SS-I)- 100,000 records (SS-II)- 1,000,000 records (SS-III) - 2,000,000 records (SS-IV)

Normal 972,780

IP sweep 12,481

Neptune 1,072,017

Port sweep 10,413

Satan 15,892

Smurf 2,807,886

Speed (Scalability) Comparison (k=5, hs=25)

Algorithm

x 1000 cardinality

10 100 1000 2000 4891

SMART-TV 0.14 0.33 2.01 3.88 9.27

P-KNN 0.89 1.06 3.94 12.44 30.79

KNN 0.39 2.34 23.47 49.28 NA

Speed and Speed and ScalabilityScalability

1000 2000 3000 40000

10

20

30

40

50

60

70

80

90

100

Training Set Cardinality (x1000)

Tim

e in

Sec

onds

Running Time Against Varying Cardinality

SMART-TV

PKNNKNN

Machine used:

Intel Pentium 4 CPU 2.6 GHz machine, 3.8GB RAM, running Red Hat Linux

Dataset Dataset (Cont.)(Cont.)

2. OPTICS dataset– 8,000 points, 8 classes (CL-1, CL-2,…,CL-8) – 2 numerical attributes

– Training set: 7,920 points – Testing set: 80 points, 10 per class

CL-1

CL-2

CL-3

CL-6

CL-4

CL-5

CL-7

CL-8

3. IRIS dataset

– 150 samples– 3 classes (iris-setosa, iris-versi

color, and iris-virginica)– 4 numerical attributes– Training set: 120 samples– Testing set: 30 samples, 10 per

class

Dataset Dataset (Cont.)(Cont.)

Overall Classification Accuracy Comparison

Comparison of the Algorithms Overall Classif ication Accuracy

0.00

0.25

0.50

0.75

1.00

IRIS OPTICS SS-I SS-II SS-III SS-IV NI

Dataset

Avera

ge F

-Score

SMART-TV

PKNN

KNN

Datasets SMART-TV PKNN KNN

IRIS 0.97 0.71 0.97

OPTICS 0.96 0.99 0.97

SS-I 0.96 0.72 0.89

SS-II 0.92 0.91 0.97

SS-III 0.94 0.91 0.96

SS-IV 0.92 0.91 0.97

NI 0.93 0.91 NA

Overall Overall AccuracyAccuracy

A nearest-based classification algorithm that starts its classification steps by approximating the Nearest Neighbor Set.

The total variation functional is used prune down the NNS candidate set.

It finishes classification in the traditional way The algorithm is fast. It scales well to very large

dataset. The classification accuracy is very comparable to that of Closed kNN (which is better than kNN).

SummarySummary

a fast and scalable nearest neighbor based classification taufik abidin and william perrizo...

Documents

r diskc

equiwidth partition

ls2 pf

r1 diskc

r2 diskc

leaf mask

equiwidth leaf sizes

equiwidth segment size