an ensemble-based approach to fast classification of multi-label data streams

Click here to load reader

Upload: brenden-duncan

Post on 04-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

An Ensemble-based Approach to Fast Classification of Multi-label Data Streams. Xiangnan Kong, Philip S. Yu. Dept. of Computer Science University of Illinois at Chicago. Introduction: Data Stream. Data Stream: high speed data flow continuously arriving, changing. Applications: - PowerPoint PPT Presentation

TRANSCRIPT

Semi-Supervised Feature Selection for Graph Classification

Xiangnan Kong,Philip S. YuAn Ensemble-based Approach to Fast Classification of Multi-label Data StreamsDept. of Computer ScienceUniversity of Illinois at Chicago

1

Introduction: Data Stream network trafficcredit card transactionsonline message Applications: online message classification network traffic monitoring credit card transactions classification Data Stream: high speed data flow continuously arriving, changing

#So lets first talk about what is Graph Classification and why should we care about this problem.As we all know thatL1 L2 e.g. L3 they can all be represented as graphs

And what we would like to do is we want to perform classification on these graphs object.In other words, in conventional classification problem, each instance is a feature vector, while in graph classification problem, each instance is a graph.(1 min)2

Introduction: Stream Classification+-?Training dataIncoming dataStream Classification:Construct a classification model on past stream dataUse the model to predict the class label for incoming data+-??ClassificationModeltrainclassifydata stream#3Multi-Label Stream DataIn many real apps, one stream object can have multiple labels.LegendarySadCompanyNews ArticleLabels

EmailsLabelsConventional Stream Classification:Single-label settings: assume one stream object can only have one label#4Multi-Label Stream ClassificationTraditional Stream ClassificationinstancelabelobjectinstancelabelobjectlabellabelMulti-label Stream Classification#And all these methods are focused on supervised

1min5The problemStream DataHuge Data Volume + Limited memory cannot store the entire dataset for trainingRequire one-pass algorithm on the streamHigh SpeedNeed to process promptlyConcept DriftsOld data become outdatedMulti-label ClassificationLarge number of possible label sets (exponential)Conventional multi-label classification approach focus on offline settings, cannot apply here

010100#6Our Solution: Random Treevery fast in training and testingEnsemble of multiple treeseffective and can reduce the prediction varianceStatistics of multiple labels on the tree nodes effective training/testing on multiple labels Fading function reduce the influence of old data#And all these methods are focused on supervised

1min7Multi-label Random TreeMulti-label Random TreeSingle-pass over the dataSplit each node on random variable with random thresholdEnsemble of multiple treesMulti-label predictionsFading out old data

Conventional Decision TreesMulti-pass over the datasetVariable selection on each node splitSingle label predictionStatic updates, use the entire dataset including outdated data

#8Training: update treesTree 1 Tree Nt

acdefUpdate node statisticsUpdate node statistics#On the Tree Nodes

acTree Node statisticsStatistics on the nodeAggregated label relevance vectorAggregated number of instancesAggregated label set cardinalitiesTime stamp of the latest update

Fading functionThe statistics are rescaled with a time fading functionTo reduce the effect of the old data on the node statistics

#PredictionTree 1 Tree Nt

???Aggregate predictionsUse the aggregated label relevance to rank all possible labelsUse the aggregated set cardinality to decide how many labels are included in the label set#Experiment SetupThree methods are compared:Stream Multi-lAbel Random Tree (SMART)Multi-label stream classification with random tree [This Paper]SMART without fading function SMART(static): keep updating the trees without fadingMulti-label kNN state-of-the-art multi-label classification method + sliding window#Three multi-label stream classification datasets:MediaMill: Video annotation task, from MediaMill ChallengeTMC2007:Text classification task, from SDM text mining competitionRCV1-v2: large-scale text classification task, from Reuters datasetData Sets

--- # instances--- # features--- # labels--- label density

#EvaluationMulti-Label Metrics [Elisseef&Weston NIPS02]Ranking Loss Evaluate the performance on the probability outputs Average number of label pairs being ranked incorrectlyThe smaller the betterMicro F1Evaluate the performance on label set predictionConsider both micro average of precision and recallThe larger the betterSequential evaluation with concept driftsMixing two streams#Throughput / Efficiency

#

EffectivenessRanking Loss(lower is better) SMARTMulti-label StreamClassificationMediaMill DatasetStream (x 4,300 instances)Our approach with multi-label streaming random trees performed best in MediaMill datasetMulti-Label kNN (w=100)(w=200)SMART (static) without fading func(w=400)#

EffectivenessMicro F1(higher is better)MediaMill DatasetStream (x 4,300 instances)Multi-Label kNN (w=100)(w=200)SMART (static) without fading func(w=400) SMARTMulti-label StreamClassification#

Experiment ResultsMediaMillDatasetRCV-1-v2DatasetTMC2007DatasetMicro F1Ranking Loss

#

Experiment ResultsMediaMillDatasetRCV-1-v2DatasetTMC2007DatasetMicro F1Ranking Loss

#ConclusionsAn Ensemble-based approach for Fast Classification of Multi-Label Data StreamEnsemble-based approach (effective)Predict multiple labelsVery fast in training/updating node statistics and prediction using random trees (efficient)

Thank you!#