performance improvement for bayesian classification on spatial data with p-trees amal s. perera...
TRANSCRIPT
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees
Amal S. PereraMasum H. SeraziWilliam Perrizo
Dept. of Computer ScienceNorth Dakota State University
Fargo, ND 58105
These notes contain NDSU confidential andProprietary material.Patents pending on the P-tree technology
Outline
• Introduction• P-Tree• P-Tree Algebra• Bayesian Classifier• Calculating Probabilities using P-Trees• Band-based vs. Bit-based approach • Sample Data• Classification Accuracy• Classification Time• Conclusion
Introduction
• Classification is a form of data analysis and data mining that can be used to extract models describing important data classes or to predict future data trends.
• Some data classification techniques are:
Decision Tree Induction Bayesian Neural Networks K-Nearest Neighbor
Case Based Reasoning Genetic Algorithm rough sets fuzzy logic techniques
• A Bayesian classifier is a statistical classifier, which uses Bayes’ theorem to predict class membership as a conditional probability that a given data sample falls into a particular class.
Introduction Cont..
• The P-Tree data structure allows us to compute the Bayesian probability values efficiently, without resorting to the naïve Bayesian assumption.
• Bayesian classification with P-Trees has been used successfully in remotely sensed image precision agriculture to predict yield and in genomics (2-yeast hybrid classification) to place in the ACM 02KDD-cup competition. http://www.biostata.wisc.edu/~craven/kddcup/winners.html
• To completely eliminate the naïve assumption, a bit-based Bayesian classification is used instead of a band-based approach.
P-Tree
• Most spatial data comes in a band format called BSQ.
• Each BSQ band is divided into several files, one for each bit position of the data values. This format is called ‘bit Sequential’ or bSQ.
• Each bSQ bit file, Bij (file constructed from the jth bits of ith band), into a tree structure, called a Peano Tree (P-Tree).
• P-Trees represent tabular data in a lossless, compressed, bit-by-bit, recursive, datamining-ready arrangement.
A bSQ file, its raster spatial file and P-Tree
Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count
Level Fan-out QID (Quadrant ID)
1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
55
16 8 15 16
3 0 4 1 4 4 3 4
1 1 1 0 0 0 1 0 1 1 0 1
16 16
55
0 4 4 4 4
158
1 1 1 0
3
0 0 1 0
1
1 1
3
0 1
1111110011111000111111001111111011111111111111111111111101111111
P-Tree Algebra
• Logical operator– And – Or– Complement– Other (XOR, etc)
• Applying this operators we calculate value P-Trees, interval P-Trees, and slice P-Trees.
Ptree: 55 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101
Complement: 9 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 0 ____8__ __1__ 0 / / | \ / | \ \ 1 4 0 3 0 0 1 0 //|\ //|\ //|\ 0001 1101 0010
’ indicates COMPLEMENT
operation
P-Tree Algebra Cont..
• Basic P-Trees can be combined using logical operations to produce P-Trees for the original values at any level of bit precision. Using 8-bit precision for values, Pb11010011 , which counts the numer of occurrences of 11010011 in each quadrant, can be constructed from the basic P-Trees as:
Pb11010011 = Pb1 AND Pb2 AND Pb3’ AND Pb4 AND Pb5’ AND Pb6’ AND Pb7 AND Pb8
AND operation is simply the pixel-wise
AND of the bits
• Similarly, any data set in the relational format can be represented as P-Trees. For any combination of values, (v1,v2,…,vn), where vi is from band-i, the quadrant-wise count of occurrences of this combination of values is given by:
P(v1,v2,…,vn) = P1v1 ^ P2v2 ^ … ^ Pnvn
Bayesian Classifier
Pr(Ci | X) is the posterior probability
Pr(Ci) is the prior probability
Can find conditional probabilities, Pr(X|Ci).
Classify X with Max Pr(Ci | X)
Since Pr(X) is constant for all classes, therefore, instead maximize Pr(X|Ci) * Pr(Ci).
)()(*)|(
)|(XPr
iCPriCXPrXiCPr
Based on Bayes Theorem:
Calculating Probabilities Pr(X|Ci)
Using naïve assumption Pr(X | Ci ) = Pr( X1 | Ci ) × Pr( X2 | Ci )… × Pr( Xn | Ci )× Pr( XC | Ci) Scan the data and calculate Pr(X | Ci ) for given X .
Using P-Trees:
Pr(X|Ci) = # training samples in Ci having pattern X / # samples in class Ci
= RC[ P1(X1) ^ P2(X2) ^ … ^Pn(Xn) ^ PC(Ci) ] / RC[ PC(Ci) ]
Problem ? : if RC[ P1(X1) ^ P2(X2) ^ … ^Pn(Xn) ^ PC(Ci) ] = 0 for all i i.e unclassified pattern does not exist in the training set.
Band-based-P-tree Approach
• When all RC = 0 for given pattern
– Reduce the restrictiveness of the pattern• Removing the attribute with least information gain
– Calculate (assume attribute 2 has the least IG)• Pr( X | Ci ) = RC[ P1X1 ^ P3X3 ^ … ^ PnXn ^ PCCi ] / RC[ PCCi ]
• Calculation of information gain Using P-trees– 1 time calculation for the entire training data
Bit-based Approach• Search for similar patterns by removing the least significant bits in the attribute space.
• The order of the bits to be removed is selected by calculating the info gain (IG).
(b)(a)
R
00
01
10
11
01 10 1100 G
G(c)
01 10 1100
R
00
01
10
11
R
00
01
10
11
01 1000 11 G
00
01
10
11
(d)01 10 1100
R
G
E.g., Calculate the Bayesian conditional probability value for the pattern [G,R] = [10,01] in 2-attribute space.
Assume IG for 1st significant bit of R < that of G.
Assume IG for 2nd significant bit of G < that of R.
Initially, search for the pattern, [10,01] (a).
If not found, search for [1_,01] considering IG for the 2nd significant bit. Search space will increase (b).
If not found, search for [1_,0_] considering IG for the 2nd significant bit. Search space will increase (c).
If not found, search for [1_,_ _] considering IG for the 1st significant bit. Search space will increase (d).
Experiments
• The experimental data was extracted from two sets of aerial photographs of the Best Management Plot (BMP) of the Oakes Irrigation Test Area (OITA) near Oaks, North Dakota.
»
• The images were taken in 1997 and 1998.
• Each image contains 3 bands, red, green and blue reflectance values.
»
• Three other files contain synchronized soil moisture, nitrate and yield values.
Classification Accuracy
• Accuracy of the proposed bit-based approach is compared with band-based, and KNN with Euclidian distance.
• It is clear that our approach out performs the others.
Classification accuracy for '97 Data
0
10
20
30
40
50
60
70
80
90
1K 4K 16K 65K 260K
Training Data Size (pixels)
Band-Ptree KNN-Euc. Bit
Classification Accuracy Cont..
• The accuracy of the approach was also compared to an existing Bayesian belief network classifier. The classifier is J Cheng's Bayesian Belief Network available at
http://www.cs.ualberta.ca/~jcheng/ .
– This classifier was the winning entry for the KDD Cup 2001 data mining competition. The developer claims that the classifier can perform with or without domain knowledge.
• For the comparison smaller training data sets ranging from 4K to 16K pixels were used due to the inability of the implementation to handle larger data sets.
Training Size (pixels)Bit-Ptree
Based Bayesian Belief
4000 66 % 26 %
16000 67 % 51 %
The Belief network was built without using any domain knowledge to make it comparable with to P-Tree approach.
Accuracy
Classification Time
• P-Tree approach requires no build time (lazy classifier).
• In most lazy classifiers the classification time per tuple varies with the number of items in the training set due to the requirement of having to scan the training data.
• P-Tree approach does not require a traditional data scan.
• The data in figure was collected using 5 significant bits and a threshold probability of 0.85.
• The time is given for scalability comparisons.
Variation of Classification Time with Training Size for bit-P-tree alg.
0
100
200
300
0 100 200 300
Trainig sample size (pixels)
Conclusion
• Naïve assumption reduces the accuracy of the classification in this particular application domain.
• Our approach increases accuracy of a P-Tree Bayesian classifier by completely eliminating the naïve assumption.
– New approach has a better accuracy than the existing P-Tree based Bayesian classifier.
– It was also shown to be better than a Bayesian belief network implementation and a Euclidian distance based KNN approach.
• It has the same computational cost with respect to the use of P-Tree operations as the previous P-tree approach, and is scalable with respect to the size of the data set.