a new decision tree learning approach for novel class detection in concept drifting data stream...
DESCRIPTION
Journal of Computer Science and Engineering, ISSN 2043-9091, Volume 14, Issue 1, July 2012http://www.journalcse.co.ukTRANSCRIPT
JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 14, ISSUE 1, JULY 2012 1
A New Decision Tree Learning Approach for Novel Class Detection in Concept Drifting
Data Stream Classification Amit Biswas, Dewan Md. Farid and Chowdhury Mofizur Rahman
Abstract—Novel class detection in concept drifting data stream classification is the process of learning, where the data distributions change over time like weather conditions, economical changes, astronomical, and intrusion detection etc. Arrival of a novel class in concept-drift occurs in data stream when new data introduce the new concept classes or remove the old ones. Existing data mining classifiers cannot detect and classify the novel class instances until the classifiers are trained with the labeled instances of the novel class. In this paper, we propose a new approach for detecting novel class in concept drifting data stream classification using decision tree classifier that can determine whether new data instance belongs to a novel class. The proposed approach builds a decision tree from training data points, which continuously updates with recent data points so that the tree represents the most recent concept in data stream. The experimental analysis on benchmark datasets from UCI machine learning repository proved that the proposed approach can detect novel class in concept drifting data stream classification problems.
Index Terms—Concept Drifting, Data Stream Classification, Decision Tree, Novel Class.
—————————— u ——————————
1 INTRODUCTIONATA stream classification is the process of extracting knowledge and information from continuous data instances. A data stream is an ordered sequence of
data points that includes attribute values and class values. The goal of data mining classifiers is to predict the class value of a new or unseen instance, whose attribute values are known but the class value is unknown. The existing data mining classifiers (or classification models) are trained on instances of the dataset with fixed number of class values, but in real-world data stream classification problems a new data instance with new class value may appear and the classification model misclassify the new instance. Most of the existing data mining classifiers can-not detect and classify the novel class instances until the classifiers are trained with the labeled instances of the novel class. In real-life data stream mining problems the data distributions change over time, such as weather pre-dictions, astronomical, and intrusion detection etc.
Novel class detection in concept drifting data stream mining causes problems because the classification models become less accurate as time passes. The concept drift means the statistical properties of the target class, which the data mining classifiers are trying to classify, change over time in unforeseen ways. Novel class detection in concept drifting data stream classification refers to a change in the data stream when the underlying concept of
the data changes over time. Recently research on novel class detection in concept drifting data stream classifica-tion received much attention to intelligent computational researchers [1], [2], [3]. The data mining classifiers should update continuously so that it reflects the most recent concept in the data stream. The data stream classifiers are divided into two categories: single model and ensemble model. Single model incrementally update a single classi-fier and effectively respond to concept drifting [9], [13]. On the other side, ensemble model use a combination of classifiers, which combines a series of classifiers with the aim of creating an improved composite model, and also handle concept drifting efficiently [1], [5], [10], [12].
In this paper, we provide a solution for handling the novel class detection problem using decision tree. Our approach builds a decision tree from data stream, which continuously updates with new data points so that the latest tree represents the most recent concept in data stream. We calculate the threshold value based on the ratio of percentage of data points between each leaf node in the tree and the training dataset and also cluster the data points of training dataset based on the similarity of attribute values. If number of the data points classify by a leaf node of the tree increases than the threshold value that calculated before, which means a novel class arrived. Then we compare the new data point with existing data points based on the similarity of attribute’s value. If the attribute’s value of new data point is different than exist-ing data points and the new data point does not belongs to any cluster, which confirms a novel class arrived. Then we add the new data point into training dataset and re-
———————————————— • Amit Biswas is with the Department of Computer Science and Engineer-
ing, United International University, Dhaka, Bangladesh. • Dewan Md. Farid is with the Department of Computer Science and Engi-
neering, United International University, Dhaka, Bangladesh. • Chowchury Mofizur Rahman is with the Department of Computer Science
and Engineering, United International University, Dhaka, Bangladesh.
D
© 2012 JCSE www.Journalcse.co.uk
2
build the decision tree. We organize this paper as follows. Section 2 discusses
related work. Section 3 provides an overview of learning algorithms. Our approach is introduced in section 4. Sec-tion 5 discusses the datasets and experimental analysis. Finally, conclusions and future works are drawn is sec-tion 6.
2 RELATED WORK Novelty detection and data stream classification, where data distributions inherently change over time that re-ceived much attention to the intelligent computational researchers in many practical real-world applications, such as spam, climate change and intrusion detection. In 2011, Masud et al. proposed a novelty detection and data stream classification technique, which integrates a novel class detection mechanism into traditional mining classi-fiers that enabling automatic detection of novel classes before the true labels of the novel class instances arrive [1]. In order to determine whether an instance belongs to a novel class, the classification model sometimes needs to wait for more test instances to discover similarities among those instances. In the same year, R. Elwell and R. Polikar introduced an ensemble of classifiers-based approach named Learn++.NSE for incremental learning of concept drift, characterized by nonstationary environments [2]. Learn++.NSE trains one new classier for each of data it re-ceives, and combines these classifiers using a dynamically weighted majority voting. The novelty of the approach is in determining the voting weights, based on each classifi-er’s time-adjusted accuracy on current and past environ-ments.
In 2007, Kolter and Maloof proposed an ensemble ap-proach for concept drifting data stream classification that dynamically creates and removes weighted experts in response to changes in performance using dynamic weighted majority (DWM) [5]. It trains online learners of the ensemble and adds or removes experts based on the global performance of the ensemble. In 2006, Gaber and Yu [8] proposed a novel class detection approach termed as STREAM-DETECT to identify changes in data streams, which concerned with detecting changes in data streams by measuring online clustering result deviation over time. In 2005, Yang et al. [9] proposed an approach, which in-corporates proactive and reactive predictions. In a proac-tive mode, it anticipates what the new concept will be if a future concept change takes place, and prepares predic-tion strategies in advance. If the anticipation turns out to be correct, a proper prediction model can be launched instantly upon the concept change. If not, it promptly resorts to a reactive mode: adapting a prediction model to the new data. Widmer and Kubat presented a single clas-sifier named FLORA, which use a sliding window to choose a block of new instances to train a new classifier [14]. FLORA has a built-in forgetting mechanism with the implicit assumption that those instances that fall outside the window are no longer relevant, and the information carried by them can be forgotten.
3 LEARNING ALGORITHMS Data mining is the process of finding hidden information and patterns in a huge database. Data mining algorithms have two major functions: classification and clustering. Classification maps data into predefined groups or clas-ses. It is often referred to a supervised learning because the classes are determined before examining the data. Classification creates a function from training data. On the other side, clustering is similar to classification except that the groups are not predefined, but rather defined by the data alone. It is alternatively referred to as unsuper-vised learning.
3.1 Decision Tree Learning Decision tree (DT) learning is very popular mining tool for classification and prediction. It is easy to implement and requires little prior knowledge. DT can be build from large dataset with many attributes. In DT the successive division of the set of training instances proceeds until all the subsets consists of instances to a single class. There are 3 main components in a DT: nodes, leaves, and edges. Each node is labeled with an attribute by which the data is to be partitioned. Each node has a number of edges, which are labeled according to possible values of the at-tribute. An edge connects either two nodes or a node and a leaf. Leaves are labeled with a decision value for catego-rization of the data. To make a decision using a DT, start at the root node and follow the tree down the branches until a leaf node representing the class is reached. Each DT represents a rule set, which categorizes data according to the attribute of dataset.
The ID3 (Iterative Dichotomiser) technique builds DT using information theory [16]. The basic strategy used by ID3 is to choose splitting attributes from a dataset with the highest information gain. The amount of information associated with an attribute value is related to the proba-bility of occurrence. The concept used to quantify infor-mation is called entropy, which is used to measure the amount of randomness from a dataset. When all data in a set belong to a single class, there is no uncertainty then the entropy is zero. The objective of decision tree classifi-cation is to iteratively partition the given dataset into sub-sets where all elements in each final subset belong to the same class. The entropy calculation is shown in equation 1. Given probabilities p1, p2,…, ps where ∑i=1 pi = 1,
(1)
Given a dataset, D, H(D) finds the amount of subset of
dataset. When that subset is split into s new subsets S = {D1, D2, … , Ds}, we can again look at the entropy of those subsets. A subset of dataset is completely ordered if all examples in it are the same class. ID3 chooses the splitting attribute with the highest gain. The ID3 algorithm calcu-lates the gain by the equation 2.
(2)
The C4.5 is a successor of ID3 through GainRatio [15].
∑=
=S
i iiS pppppHEntropy
121 ))1log((),...,,(:
∑=
−=S
iii DHDpDHSDGain
1
)()()(),(
3
For splitting purpose, C4.5 use the largest GainRatio that ensures a larger than average information gain.
(3)
The C5.0 algorithm improves the performance of
building trees using boosting, which is an approach to combining different classifiers. But boosting does not al-ways help when the training data contains a lot of noise. When C5.0 performs a classification, each classifier is as-signed a vote, voting is performed, and the example of dataset is assigned to the class with the most number of votes. CART (Classification and Regression Trees) is a process of generating a binary tree for decision making [17]. CART handles missing data and contains a pruning strategy. The SPRINT (Scalable Parallelizable Induction of Decision Tree) algorithm uses an impurity function called gini index to find the best split [18]. Equation 4, difines the gini index for a dataset, D.
(4)
Where, pj is the frequency of class Cj in D. The good-
ness of a split of D into subsets D1 and D2 is defined by (5)
The split with the best gini value is chosen. A number of research projects of optimal feature selection and classi-fication have been done, which adopt hybrid stratecy in-volving evolutionary algorithm and inductive decision tree learning [19], [20], [21], [22], [23].
3.2 Clustering Clustering can be considered the most important unsu-pervised learning problem, which has been used in many real-world application domains, including biology, medi-cine, anthropology, marketing etc. It is the process of or-ganizing objects into groups whose members are similar in some way. A data point within one cluster is more like data points within that cluster than it is similar to data points outside it. A cluster is therefore a collection of ob-jects which are “similar” between them and are “dissimi-lar” to the objects belonging to other clusters. So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. Give a dataset D = {t1,t2,…,tn} of data points, a similarity measure, sim(ti,tl), defined between any two data points, ti, tl, € D, and an integer value k, the clustering problem is to define a mapping f: D�{1,…,k} where each ti is assigned to one cluster Kj, 1≤ j ≤k. Cluster-ing algorithms can be categorized based on their cluster model, like k-means clustering, distribution-based clus-tering, density-based clustering etc.
4 PROPOSED APPROACH The data stream is a continuous sequence of data points: {x1,x2,…,xnow}, where x1 is the very first data point in the
stream, and xnow is the latest data point, which has just arrived in the data stream. Each data point xi is an n-dimensional feature vector that consists of a number of attributes Ai = {A1,A2,…,An} with class label Ci = {C1,C2,…,Cn}. Each attribute consists of a number of attrib-ute values Ai = {Ai1,Ai2,…,Aip}. Algorithm 1 outlines the overview of our approach. We build a decision tree from training data points and calculate the threshold value based on the ratio of percentage of data points between each leaf node in the tree and the training data points and also cluster the training data points based on the similari-ty of attribute values. When classifying the continuous data streams in real-time, if number of the data points classify by a leaf node of the tree increases than the threshold value that calculated before, which means a novel class arrived. Then we compare the new data point with existing data points based on the similarity of attrib-ute’s value. If the attribute’s value of new data point is different than existing data points and the new data point does not belongs to any cluster, which confirms a novel class arrived. Then we add the new data point into train-ing dataset and rebuild the decision tree. The decision tree classifier continuously updates so that if represents the most resent concept in the data stream.
Algorithm 1: Novel Class Detection using Decision Tree 1. Find the best splitting attribute with highest in-
formation gain value in training dataset. 2. Create a node and label with splitting attribute.
[First node is the root node, T of the decision tree] 3. For each branch of the node, partition the data
points and grow sub training datasets Di by apply-ing splitting predicate to training dataset D.
4. For each sub training datasets Di, if data points in Di, are all of same class value, Ci then the leaf node labeled with Ci. Else continues steps 1 to 4 until each final subset belong to the same class value or leaf node created.
5. When the decision tree construction is complete, calculate the threshold value for each leaf node in the tree based on the ratio of percentage of data points between each leaf node in the decision tree and the data points in the training dataset.
6. Cluster the training data points based on the simi-larity of attribute values.
7. For classifying the continuous data streams in real-time, if number of the data points classify by a leaf node of the decision tree increases compare to threshold value that calculated before, which means a novel class may arrived.
8. If the attribute’s value of new data point is differ-ent than existing data points of the leaf node of the decision tree, and also the new data point does not belongs to any existing cluster, which confirms a novel class arrived.
9. If novel class detected, then add the new data point into existing training data points and gener-ate a new training dataset, Dnew.
10. Rebuild a new decision tree using new/updated training dataset, Dnew.
)||||,...,
||||(
),(),(1
DD
DDH
SDGainSDGainRatioS
=
∑−= 21)( jpDgini
))(())(()(
2
2
1
1
Dgininn
DgininnDginisplit +=
4
5 EXPERIMENTAL ANALYSIS
In this section, we describe the datasets, and the experimental results.
5.1 Datasets Data stream mining is the process of analyzing online data to discover patterns, which uses sophisticated math-ematical algorithms to segment the continuous data and evaluate the probability of future events. A set of data items called the dataset, which is the very basic concept of data mining and machine learning research. A dataset is roughly equivalent to a two-dimensional spreadsheet or database table. Table 1 describes about the datasets from UCI machine learning repository, which are used in ex-perimental analysis [26].
1. Iris Plants Database: This is one of the best known dataset in the pattern recognition literature. This dataset contains 3 class values (Iris Setosa, Iris Versicolor, and Iris Virginica), where each class re-fers to a type of iris plant. There are 150 instances and 4 attributes in this dataset (50 in each of three classes). One class is linearly separable from the other 2 classes.
2. Image Segmentation Data: The goal of this dataset is to provide an empirical basis for research on im-age segmentation and boundary detection. There are 1500 data instances in this dataset with 19 at-tributes and all the attributes are real. There are 7 class attribute values: brickface, sky, foliage, ce-ment, window, path, and grass.
3. Large Soybean Database: There are 35 attributes in this dataset and all attributes are nominalized. There are 683 data instances and 19 class values in this dataset.
4. Fitting Contact Lenses Database: It is very small dataset with only 24 data instances, 4 attibutes and 3 class attribute values (soft, hard, and none). All the attribute values are nominal in this dataset. The instances are complete and noise free and 9 rules cover the training set.
5. NSL-KDD Dataset: The Knowledge Discovery and Data Mining 1999 (KDD99) competition data con-tains simulated intrusions in a military network environment. It is often used a benchmark to eval-uate handling concept drift. NSL-KDD dataset is the new version of the KDD99 dataset, which solved some of the inherent problems of the KDD99 dataset [25]. Although, NSL-KDD dataset still suffers from some of the problems that dis-cussed by McHugh [24]. The main advantage of NSL-KDD dataset is that the training and testing data points are reasonable, so it become affordable to run the experiments on the complete set of training and testing dataset without the need to randomly select a small portion of dataset. Each record in NSL-KDD dataset consists of 41 attrib-utes and 1 class attribute. NSL-KDD dataset does not include redundant and duplicate examples in training dataset.
TABLE 1 Data Set Descriptions
Dataset No of
Attributes Attribute
Types No of
Instances
No of Class
Attribute Iris Plants Database
4 Real 150 3
Image Seg-mentation
Data 19 Real 1500 7
Large Soybean Database
35 Nominal 683 19
Fitting Contact Lenses Data-
base 4 Nominal 24 3
NSL-KDD Dataset
41 Real &
Nominal 25192 23
5.2 Results We implement our algorithm in Java. The code for decision tree has been adapted from the Weka machine learning open source repository (http://www.cs.waikato.ac.nz/ml/weka). Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. The experiments were run on an Intel Core 2 Duo Processor 2.0 GHz processor (2 MB Cache, 800 MHz FSB) with 1 GB of RAM. There are various approaches to determine the per-formance of data stream classifiers. The performance can most simply be measured by counting the proportion of correctly classified instances in an unseen test dataset. Table 2 summarizes the symbols and terms used throughout the equation 6 to 8.
TABLE 2 Used Symbols and Terms
Symbol Term N Total instances in the data stream Nc Total novel class instances in the data stream Fp Total existing class instances misclassified as novel class Fn Total novel class instances misclassified as existing class Fe Total existing class instances misclassified
Mnew % of novel class instances misclassified as existing class
Fnew % of existing class instances falsely identified as novel class
ERR Total misclassified error
(6)
(7)
(8) The equations 6, 7, and 8 are used to evaluate our approach. Table 3 and table 4 tabulate the results of performance com-
c
nnew N
FM 100*=
c
pnew NN
FF
−=
100*
NFFF
ERR enp 100*)( ++=
5
parison between our approach and traditional decision tree classifier.
TABLE 3 Performance of Proposed Approach
Dataset ERR Mnew Fnew
Iris Plants Database 4 4 3 Image Segmentation Data 2.9 1.3 2.9 Large Soybean Database 9.2 2.8 1.9
Fitting Contact Lenses Database 16.6 0 5.2 NSL-KDD Dataset 4.0 8.4 1.2
TABLE 4
Performance of Traditional Decision Tree Dataset ERR Mnew Fnew
Iris Plants Database 5.3 4 5 Image Segmentation Data 5.2 3.7 5.2 Large Soybean Database 10.8 6.5 2.8
Fitting Contact Lenses Database 50 100 5.2 NSL-KDD Dataset 5.3 10.0 1.5
6 CONCLUSION In this paper, we introduce decision tree classifier based novel class detection in concept drifting data stream clas-‐‑sification, which builds a decision tree from data stream. The decision tree continuously updates with new data
points so that the most recent tree represents the most recent concept in data stream. The main propose of this paper is to improve the performance of decision tree clas-‐‑sifier in concept drifting data stream classification prob-‐‑lems. The decision tree classifier is very popular super-‐‑vised learning algorithm that has several advantages such as it is easy to implement and requires little prior knowledge. We tested the performance of proposed ap-‐‑proach on several benchmark datasets, which proved proposed approach efficiently detect novel class and im-‐‑prove the classification accuracy. The future work focuses on addressing this problem under dynamic attribute sets.
7 APPENDIX , AN ILLUSTRATIVE EXAMPLE In large soybean database from UCI machine learning repos-‐‑itory [26], there are total 35 attributes and all the attribute values are nominal-‐‑valued. There are 683 data points in this dataset, which are categorized into 19 class attribute values. We split the dataset into 3 sub datasets: sub-‐‑dataset A con-‐‑tains 356 instances with 10 class attribute values, sub-‐‑dataset B contains 107 instances with 5 class attribute values, and sub-‐‑dataset C contains 220 instances with 4 class attribute values. We built a decision tree, DTA using sub-‐‑dataset A, which is shown in figure 1.
leafspot-size
seed
lt-1/8
norm
Bacterial-blight
purple-seed-stain
abnorm
fruit-pods
gt-1/8
norm
int-discolor
none
precip
gt-norm
plant-growth
abno
rm
alternarialeaf-spot
norm
lt-norm
norm
seed-discolor presentab
sent
few-present
dise
ased
dna
leaf-malf
absent
seed
norm
leaf-mild
absent
stem-cankers
above-sec-nde
diaporthe-stem-canker
abse
nt
brown-stem-rot
above-soil
brown-stem-rot
powdery-mildew
plant-growth
abnorm
normabno
rm
present
2-4-d-injury
frog-eye-leaf-spot
phyllosticta-leaf-spot
plant-stand
area-damaged
norm
al
low-areas
frog-eye-leaf-spot
scatte
red
frog-eye-leaf-spot
whole-field
germination
90-100
frog-eye-leaf-spot
80-8
9
alternarialeaf-spot
alternarialeaf-spot
lt-80
upper-areas
frog-eye-leaf-spot
lt-normal
phyllosticta-leaf-spot
alternarialeaf-spot
brown
brown-stem-rot
frog-eye-leaf-spot
cyst-nematode
upper-surf
cyst-nematode
purple-seed-stain
Fig. 1. Decision Tree DTA using Sub Dataset A.
6
Then we classified the 356 instances of sub-‐‑dataset A by applying the decision tree, DTA that correctly classified 323 instances and misclassified 33 instances. After that we classi-‐‑fied 107 instances of sub-‐‑dataset B [sub-‐‑dataset B contains 5 novel classes] by applying the decision tree DTA that detect novel class arrived. For example, leafspot-size = lt-1/8 and seed = norm: bacterial-blight, this leaf node satisfied 20 instances from sub-dataset A and 10 instances from sub-dataset B. The other attribute’s value of 10 instances from sub-dataset B are quite dissimilar than 20 instances from sub-dataset A, which confirms novel class arrived. Then we merged sub-dataset A and sub-dataset B to generate a new dataset XA+B. Next we rebuild the decision tree DTX, which is shown in figure 2. Similarly, we merged dataset XA+B with sub-‐‑dataset C [sub-‐‑dataset C contains 220 instanc-‐‑es with 4 novel classes] and again generate a new dataset XA+B+C. Final, we again rebuild the decision tree DTY, which correctly classified all the 683 instances of dataset XA+B+C to 91.5081% and all the 220 instances of sub-‐‑dataset C to 98.6364%. Decision tree, DTY is shown in figure 3.
ACKNOWLEDGMENT This research work was supported by Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh.
REFERENCES [1] M. M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham, “Classifi-
cation and Novel Class Detection in Concept Drifting Data Streams un-der Time Constraints,” IEEE Transactions on Knowledge and Data Engi-neening, Vol. 23, No. 6, pp. 859-874, June 2011.
[2] R. Elwell, and R. Polikar, “Incremental Learning of Concept Drift in Nonstationary Environment,” IEEE Transactions on Neural Networks, Vol. 22, No. 10, pp. 1517-1531, October 2011.
[3] A. Zhou, F. Cao, W. Qian, and C. Jin, “Tracking Clusters in Evolving Data Streams over Sliding Window,” Knowledge and Information Systems, Vol. 15, No. 2, pp. 181-214, May 2008.
[4] E. J. Spinosa, A. P. de Leon, F. de Carvalho, and J. Gama, “Clus-ter-Based Novel Concept Detection in Data Streams Applied to Intrusion Detection in Computer Networks," Proc. 2008 ACM Symp. Applied Computing, pp. 976-980, 2008.
[5] J. Z. Kolter, and M. A. Maloof, "Dynamic Weighted Majority: An Ensemble Method for Drifting Concept," Journal of Machine Learning Research, Vol. 8, pp. 2755-2790, 2007.
[6] B. R. Dai, J. W. Huang, M. Y. Yeh, and M. S. Chen, “Adaptive Clustering for Multiple Evolving Streams,” IEEE Transactions on Knowledge and Data Engineening, Vol. 18, No. 9, pp. 1166-1180, Septem-ber 2006.
[7] C. C. Aggarwal, J. Han, J. Wang, and P. S. yu, “A Framework for On-Demand Classification of Evolving Data Streams,” IEEE Transac-tions on Knowledge and Data Engineening, Vol. 18, No. 5, pp. 577-589, May 2006.
leafspot-size
canker-lesion
Bacterial-blight
lt-1/8
dna
leafspots-marg
w-s
-mar
g
seed-size
no
rm
bacterial-pustule
lt-norm
no-w-s-m
arg
bacterial-pustule
dna
Bacterial-blight
bro
wn
Bacterial-blight
bacterial-blight
dk-brown-blk
purple-seed-stain
tan
gt-1
/8
dna
int-discolor
none
leaves
norm
stem-cankers
abse
nt
canker-lesion
dna
diaporthe-pod-&-stem-
blight
brow
n
purple-seed-stain
dk-brown-
blk
purple-seed-stain
tan
purple-seed-stain
anthracnose
anthracnose
above-
soil
above-sec-ndestem
abnorm
plant-growth
norm
norm
abnorm
powdery-mildew
cyst-nematode
abnorm
fruit-spots
brownleaf-malf
abse
nt present
brown-stem-rot
2-4-d-injury
black
leaf-malf
absent
dna
diaporthe-stem-canker
absent
anthracnose
colored
diaporthe-stem-canker
brown-w/blk-specks
anthracnose
present2-4-d-injury
charcoal-rot
leaf-malf
abse
nt present
2-4-d-injury
fruit-pods
norm
int-discolor
none
precip
gt-norm
plant-growth
abno
rm
alternarialeaf-spot
norm
lt-no
rm
norm
seed-discolor presentab
sent
few-present
diseased
frog-eye-leaf-spot
phyllosticta-leaf-spot
plant-stand
area-damaged
norm
al
low-areas
frog-eye-leaf-spot
scatt
ered
frog-eye-leaf-spot
whole-field
germination
90-100
frog-eye-leaf-spot
80-8
9
alternarialeaf-spot
alternarialeaf-spot
lt-80
upper-areas
frog-eye-leaf-spot
lt-normal
phyllosticta-leaf-spot
alternarialeaf-spot
brown
brown-stem-rot
cyst-nematode
alternarialeaf-spot
black
leaves
abnorm
norm
frog-eye-leaf-spot
diaporthe-pod-&-stem-
blight
herbicide-injury
dna
Fig. 2. Decision Tree DTX using Sub Dataset XA+B.
7
[8] M. M. Gaber, and P. S. Yu, "Detection and Classification of Changes in Evolving Data Streams," Int’l Journal of Information Technology & Decision Making, Vol. 5, No. 4, pp. 659-670, 2006.
[9] Y. Yang, X. Wu, and X. Zhu, “Combining Proactive and Reac-tive Predictions for Data Streams," Proc. ACM SIGKDD, pp. 710-715, 2005.
[10] W. Fan, “Mining Concept Drifting Data Streams using Ensem-ble Classifiers," Proc. 10th ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining, pp. 128-137, 2004.
[11] M. Markou, and S. Singh, "Novelty Detection: A Review Part 2: Neural Network based Approaches," Signal Processing, Vol. 83, Issue 12, pp. 2499-2521, December 2003.
[12] H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining Concept Drift-ing Data Streams using Ensemble Classifiers,” IBM T. J. Watson Research, Hawthorne, NY 10532, Association for Computing Machinery Aug. 24, 2003.
[13] G. Hulten, L. Spencer, and P. Domingos, “Mining Time Changing Data Streams," Proc. 7th ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining, ACM New York, NY, USA, pp. 97-106, 2001.
[14] G. Widmer, and M. Kubat, "Laerning in the Presence of Con-cept Drift and Hidden Contests," Machine Learning, Vol. 23, No. 1, pp. 69-101, April 1996.
[15] J. R. Quinlan, “C4.5: Programs for Machine Learning,” Morgan Kaufmann Publishers, San Mateo, CA, 1993.
[16] J. R. Quinlan, “Induction of Decision Tree,” Machine Learning, Vol. 1, pp. 81-106, 1986.
[17] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, “Classifica-tion and Regression Trees,” Statistics Probability Series, Wadsworth, Belmont, 1984.
[18] J. Shafer, R. Agrawal, and M. Meha, “SPRINT: A Scalable Parallel Classifier for Data Mining,” Morgan Kaufmann, pp. 544-555, 1996.
[19] D. Turney, “Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm,” Journal of Artifi-cial Intelligence Research, pp. 369-409, 1995.
[20] J. Bala, J. Huang, H. Vafaie, K. DeJong, and H. Wechsler, “Hybrid Learning using Genetic Algorithms and Decision Trees for Pattern Classification,” Proc. 14th Int’l Con. On Artificial Intelligence, Montreal, pp. 1-6, 19-25 August 1995.
[21] C. G. Salcedo, S. Chen, D. Whitley, and S. Smith, “Fast and Accurate Feature Selection using Hybrid Genetic Strategics,” Proc. Genetic and Evolutionary Computation Conference, pp. 1-8, 1999.
[22] S. R. Safavian, and D. Landgrebe, “A Survey of Decision Tree Classi-fier Methodology,” IEEE Transactions on Systems, Man. and Cybermet-ics, Vol. 21, No. 3, pp. 660-674, 1991.
leafspot-size
canker-lesion
Bacterial-blight
lt-1/8
dna
leafspots-marg
w-s
-mar
g
seed-size
norm
bacterial-pustule
lt-norm
no-w-s-m
arg
bacterial-pustule
dna
Bacterial-blight
bro
wn
Bacterial-blight
phytophthora-rot
dk-brown-blk
purple-seed-stain
tan
roots
gt-1
/8
norm
mold-growth
absent
fruit-spots
absent
leaf-malf
absent
fruiting-bodies
abse
nt
date
april
may june
brown-spot brown-spot precip
lt-norm
norm
gt-norm
phyllosticta-leaf-spot brown-spot brown-spot
july
precip
lt-no
rm norm
gt-norm
phyllosticta-leaf-spot
phyllosticta-leaf-spot
frog-eye-leaf-spot
august
leaf-shread
seed-tmt
absen
t
none
alternarialeaf-spot
fungic
ide
plant-stand
norm
al
lt-normal
frog-eye-leaf-spot
alternarialeaf-spot
other
frog-eye-leaf-spot
present
alternarialeaf-spot
september
stem norm
abnorm
alternarialeaf-spot
frog-eye-leaf-spot
october
alternarialeaf-spot
present
brown-spot
present
phyllosticta-leaf-spot
colo
red
fruit-pods
brown-spot
frog-eye-leaf-spot
frog-eye-leaf-spot
frog-eye-leaf-spot
dnafew-present
diseasednorm
brown-w/blk-speckscrop-hist
brown-spot
brown-spot
brown-spot
frog-eye-leaf-spot
same-lst-sev-yrs
same-lst-two-
yrs
same-lst-yr
diff-
lst-year
dnadistort brown-spot
brown-stem-rot
present
leaves
norm
abnorm
diaporthe-pod-&-stem-
blight downy-mildew
area-damaged
rotted
herbicide-injury phytophth
ora-rot phytophthora-rot
herbicide-injury
scattered
low-areas
whole-field
upper-areas
cyst-nematode
galls-cysts
dna
int-discolor
none
leaves
norm
stem-cankers
absent
canker-lesiondna
diaporthe-pod-&-stem-
blight
brow
n
purple-seed-stain
dk-brown-blk
purple-seed-stain
tan
purple-seed-stain
rhizoctonia-root-rot anthracno
se
anthracnose
belo
w-
soil
above-soil
above-sec-nde
stem
abnorm
plant-growth
norm
norm
abnorm
powdery-mildew
cyst-nematode
abnorm
plant-stand
norm
al
leaf-malf
absent
seed
norm
abnorm
diaporthe-stem-canker anthracnose
pres
ent
2-4-d-injury
lt-normal
fruiting-bodies
absent
phytophthora-rot
present
roots
norm
rotte
d
galls-cysts
anthracnose phytophthora-rot
phytophthora-rot
brown
leaf-malf
absen
t
present
brown-stem-rot
2-4-d-injury
black
charcoal-rot
Fig. 3. Decision Tree DTY using Sub Dataset XA+B+C.
8
[23] W. Y. Loh, and X. Shih, “Split selection methods for classification tree,” Statistica Sininca, Vol. 7, pp. 815-840, 1997.
[24] J. McHugh, “Testing Intrusion Detection Systems: A critique of the 1998 and 1999 darpa intusion detection evaluations as performed by lincoln laboratory,” ACM Transactions on Information and System Secu-rity, Vol. 3, No. 4, pp. 262-294, 2000.
[25] The KDD Archive. KDD99 cup dataset, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, 1999.
[26] A. Frank, and A. Asuncion, “UCI Machine Learning Repository,” University of California, Irvine, School of Information and Computer Scienc-es, 2010, http://archive.ics.uci.edu/ml
Amit Biswas is currently completing Mas-ter of Science in Computer Science and Engineering from United International Uni-versity, Bangladesh. He obtained Bachelor of Computer Application (BCA) from Ban-galore University, India in 2004. He is an IT professional working as a Team Leader of Software department in a reputed IT com-pany named BASE Limited. He has also worked for Access to Information Pro-
gramme (A2I), Prime Minister’s Office, supported UNDP Bangla-desh. He has extensive experience and knowledge on Software Development and Database. Some of his developed software suc-cessfully using by PLAN Bangladesh, CARE Bangladesh, Bangla-desh Small and Cottage Industries Corporation (BSCIC), Habib Bank Limited, Dutch Bangla Bank, Rahimafrooz, etc.
Dr. Dewan Md. Farid received B.Sc. in Computer Science and Engineering from Asian University of Bangladesh in 2003, M.Sc. in Computer Science and Engineer-ing from United International University, Bangladesh in 2004, and Ph.D. in Com-puter Science and Engineering from Ja-hangirnagar University, Bangladesh in 2012. He is a part-time faculty member in the Department of Computer Science and
Engineering, United International University, Bangladesh and Daf-fodil International University, Bangladesh. He has published 1 book chapter, 8 journals, and 10 conferences in machine learn-ing, data mining, and intrusion detection. He has participated and presented his papers in international conferences at Malaysia, Por-tugal, Italy, and France. Dr. Farid is a member of IEEE and IEEE Computer Society. He worked as a visiting researcher at ERIC La-boratory, University Lumière Lyon 2 – France from 01-09-2009 to 30-06-2010. He received Senior Fellowship I & II awarded by National Science & Information and Communication Technology (NSICT), Ministry of Science & Information and Communication Technology, Government of Bangladesh, in 2008 and 2011 respectively.
Professor Dr. Chowdhury Mofizur Rahman had his B.Sc. (EEE) and M.Sc. (CSE) from Bangladesh Universi-ty of Engineering and Technology (BUET) in 1989 and 1992 respectively. He earned his Ph.D. from Tokyo Institute of Technology in 1996 under the aus-pices of Japanese Government scholar-ship. Prof Chowdhury is presently work-ing as the Pro Vice Chancellor and
acting treasurer of United International University (UIU), Dhaka, Bangladesh. He is also one of the founder trustees of UIU. Before joining UIU he worked as the head of Computer Science & Engineer-ing department of Bangladesh University of Engineering & Technol-ogy which is the number one technical public university in Bangla-desh. His research area covers Data Mining, Machine Learning, AI and Pattern Recognition. He is active in research activities and published around 100 technical papers in international journals and conferences. He was the Editor of IEB journal and worked as the moderator of NCC accredited centers in Bangladesh. He worked as the organizing chair and program committee member of a num-ber of international conferences held in Bangladesh and abroad. At present he is acting as the coordinator from Bangladesh for EU sponsored eLINK project. Prof Chowdhury has been working as the external expert member for Computer Science departments of a number of renowned public and private universities in Bangladesh. He is actively contributing towards the national goal of converting the country towards Digital Bangladesh.