data stream classification: training with limited amount of labeled data mohammad mehedy masud...

12
Data Stream Classification: Data Stream Classification: Training with Limited Amount of Training with Limited Amount of Labeled Data Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at Dallas Jing Gao Jiawei Han University of Illinois at Urbana- Champaign To appear in IEEE International Conference on Data Mining, (ICDM) Pisa, Italy, Dec 15-19, 2008 Funded by: Air Force

Upload: donald-king

Post on 16-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at

Data Stream Classification:Data Stream Classification:Training with Limited Training with Limited

Amount of Labeled DataAmount of Labeled Data

Mohammad Mehedy Masud

Latifur Khan Bhavani

ThuraisinghamUniversity of Texas at

Dallas

Jing GaoJiawei Han

University of Illinois at Urbana-Champaign

To appear in IEEE International Conference on Data Mining, (ICDM) Pisa, Italy, Dec 15-19, 2008Funded by: Air Force

Page 2: Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at

Data Stream Classification Data Stream Classification TechniquesTechniques

Data stream classification is a challenging task because of two reasons◦ Infinite length – can’t use all historical data for

training◦ Concept-drift – old models become outdated

Solutions:◦ Single model classification with incremental

learning◦ Ensemble classification◦ Ensemble techniques can be updated more

efficiently, and handles concept-drift more effectively.

Our solution: ensemble approach

Page 3: Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at

Ensemble ClassificationEnsemble ClassificationEnsemble techniques build an ensemble of M

classifiers.◦ The data stream is divided into equal-sized chunks◦ A new classifier is trained from each labeled chunk◦ The new classifier replaces one old classifier (if

required)Last labeled data

chunk

Data stream Last unlabeled data chunk

Train newClassifier

Ensemble

Ensemble classificationUpdate

ensemble

12

Page 4: Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at

Limited Labeled Data Limited Labeled Data ProblemProblem

Ensemble techniques assume that the entire data chunk is labeled during training.

This assumption is impractical because◦ Data labeling is both time consuming and costly◦ It may not be possible to label all the data ◦ Specially in an streaming environment, where data

is being produced at a high speedOur solution:

◦ Train classifiers with limited amount of labeled data◦ Assuming a fraction of the data chunk is labeled◦ We obtain better result compared to other

techniques that train classifiers with fully labeled data chunk

Page 5: Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at

Training With Partially-Training With Partially-Labeled ChunkLabeled Chunk

Train new classifier using semi-supervised clustering

If a new class has arrived in the stream, refine the existing classifiers

Update the ensemble

Last partially -labeled chunk

Data stream Last unlabeled data chunk

Train a classifier usingSemi-supervised

Clustering

Ensemble

Ensemble classificationrefine

ensemble

12

3

4

Update ensemble

Page 6: Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at

Semi-Supervised Semi-Supervised ClusteringClustering

Overview:Only a few instances in the training data are

labeledWe apply impurity-based clusteringThe goal is to minimize cluster impurityA cluster is completely pure if all the labeled

data in that cluster is from the same classObjective function:

K: number of clusters, Xi: data points belonging to cluster i, Li: labeled data points belonging to cluster iImpi: Impurity of cluster i = Entropy * dissimilarity count

Page 7: Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at

Semi-Supervised Clustering Semi-Supervised Clustering Using E-MUsing E-M

Constrained initialization:Initialize K seedsFor each class Cj,

Select kj seeds from the labeled instances of Cj using farthest-first traversal heuristic

where kj = (Nj/N) * K Nj = number of instances in class Cj, N = total number of

labeled instancesRepeat E-step and M-step until

convergence:E-step

Assign clusters to each instance So that the objective function is minimized Apply Iterative Conditional Mode (ICM) until convergence

M-stepRe-compute cluster centroids

Page 8: Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at

Saving Cluster Summary as Micro-Saving Cluster Summary as Micro-ClusterCluster

For each of the K clusters created using the semi-supervised clustering, save the followingsCentroidn: total number of pointsL: total number of labeled pointsLt[ ]: array containing the total number of labeled

points belonging to each class.e.g. : Lt[j]: total number of data points belonging to class

Cj

Sum[ ]: array containing the sum of the attribute values of each dimension of all the data pointse.g. : Sum[r]: sum of the rth dimension of all data points

in the cluster

Page 9: Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at

Using the Micro-Clusters as Using the Micro-Clusters as Classification ModelClassification Model

We remove all the raw data points after saving the micro-clusters

The set of K such micro-clusters built from a data chunk serves as a classification model

To classify a test data point x using the model:Find the Q nearest micro-clusters (by computing

the distance between x and their centroids)For each class Cj, Compute the “cumulative

normalized frequency (CNFrq[j])”, whereCNFrq[j] = sum of Lt[j]/L of all the Q micro-clusters

Output the class Cj with the highest CNFrq[j]

Page 10: Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at

ExperimentsExperimentsData sets:

Synthetic data – simulates evolving data streamsReal data – botnet data, collected from real botnet

traffic generated in a controlled environmentBaseline:

On-demand Stream C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for on-

demand classification of evolving data streams. IEEE Transactions on Knowledge and Data Engineering, 18(5):577–589, 2006.

For training, we use 5% labeled data in each chunk. So, if there are 100 instances in a chunk

Our technique (SmSCluster) use only 5 labeled and 95 unlabeled instances for training

On Deman Stream uses 100 labeled instances for training

EnvironmentS/W: Java, OS: Windows XP, H/W: Pentium-IV, 3GHz

dual core, 2GB RAM

Page 11: Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at

ResultsResults

Each time unit = 1,000 data points (botnet) and 1,600 data points (synthetic)

Page 12: Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at

ConclusionConclusionWe address a more practical approach

in classifying evolving data streams – training with limited amount of labeled data.

Our technique applies semi-supervised clustering to train classification models.

Our technique outperforms other state-of-the-art stream classification techniques, which use 20 times more labeled data for training than our technique.