Novelty Detection in Data Streams
Profa. Elaine Faria UFU - 2018
• Slides based on the papers– FARIA, ELAINE R.; GONÇALVES, ISABEL J. C. R. ; DE
CARVALHO, ANDRÉ C. P. L. F. ; GAMA, JOÃO . Novelty detection in data streams. Artificial Intelligence Review, v. 45, p. 235-269, 2016.
– FARIA, ELAINE RIBEIRO; PONCE DE LEON FERREIRA CARVALHO, ANDRÉ CARLOS ; GAMA, JOÃO . MINAS: multiclass learning algorithm for novelty detection in data streams. Data Mining and Knowledge Discovery, v. 30, p. 640-680, 2016.
– FARIA, ELAINE; GONCALVES, ISABEL ; GAMA, JOAO ; PONCE DE LEON FERREIRA CARVALHO, ANDRE . Evaluation of Multiclass Novelty Detection Algorithms for Data Streams. IEEE Transactions on Knowledge and Data Engineering (Print), v. 27, p. 2961-2973, 2015.
2
Introduction
• Novelty Detection(ND) - DefinitionsNovelty detection is concerned with identifying abnormal system behaviours and abrupt changes from one regime to another (Lee and Roberts 2008)
The recognition that an input differs in some respect from previous inputs (Perner 2008)
Novelty detection makes it possible to recognize novel concepts, which may indicate the appearance of a new concept, a change occurred in known concepts or the presence of noise (Gama 2010).
3
Introduction
• Novelty detection – is useful in cases where an important class is
under-represented in the training set– is an important task, since, for many problems,
we never know if the currently available training data include on all possible object classes
– allows the recognition of novel profiles (concepts) in unlabeled data
4
Introduction
• Novelty Detection - Challenges– Concept drift
– Noise and outliers
– Recurring Concepts
– Concept Evolution• Number of problem classes increases over time
5
Introduction
• Data stream applications for ND– Intrusion detection– Fraud detection– Medical diagnosis– Detection of interest regions in images– Fault detection– Spam filter– Text classification– ....
6
Introduction
• It is important to distinguish– Anomaly detection
– Outlier detection
– Novelty detection
7
Introduction
• Novelty, anomaly and outlier detection are related to find patterns that are different from the normal (usual)– Anomaly and outlier detection give the idea of
an undesired pattern– Novelty indicates an emergent or a new
concept that needs to be incorporated to the normal pattern
8
Novelty detection - Formalization of the Problem
Training set (Offline Phase )Dtr = {(X1, y1), (X2, y2), …, (Xm, ym)}
Xi: vector of input attributes for the ith example yi: target attributeyi Ytr and Ytr ={c1,c2, …,cL}
When new data arrive (Online Phase)Yall ={c1,c2, …,cL, …, cK}, K > LGoal: Classify Xnew in Yall
9
Novelty detection - Phases
• Offline Phase– Induces a classifier from a set of labeled
examples → known concept about the problem
• Online Phase– Classifies new unlabeled examples– Identifies novelty patterns– Updates the decision model
10
Offline Phase - Taxonomy
11
Offline Phase
• Learning task– Unsupervised approaches
• Suppose that all the examples from the training set belongs to the normal concept
– Supervised approaches• Use the label of the examples to build the decision
model• Normal concept is composed by a set of different
classes
12
Online Phase
• Tasks– Classification of new examples– Detection of novelty patterns– Adaptation of the decision model
• some algorithms update the decision model in an offline fashion
13
Online Phase
14
Classification
Online Phase
• Classification– Verify if a new example can be explained by
the current decision model– Approach 1
• Classify new examples only as normal or novelty– Approach 2
• Consider the problem as a multiclass classification task
15
Online Phase
16
Classification - Taxonomy
Online Phase
• Classification with unknown label option– Examples not explained by the current
decision are not immediately classified• Assign an unknown profile
– They are put in a short-term memory for future analysis
• Used to update the decision model: extensions and novelty patterns
17
Online Phase
18
Detection of novelty patterns
or
Online Phase
• Detection of novelty patterns– Uses unlabeled examples not explained by
the current decision model to identify novelty patterns
– Anomaly detection• Presence of one example not explained by the
model identifies an anomaly behavior– Novelty
• Composed by a set of cohesive and representative examples not explained by the decision model
19
Online Phase
20
Detection of novelty patterns: Taxonomy
Online Phase
21
Update of the decision model
or
Online Phase
• Update of the decision model– Necessary task to address concept drift and
concept evolution– Can be carried with or without feedback– Forgetting mechanisms
• Important strategy used to remove outdated concepts
22
Online Phase
23
Update of the decision model
Online Phase
• Update of the decision model: External Feedback– Approach 1: external feedback
• Assume that the true label of all the examples will be available after a delay
• Unrealistic assumption for data streams– Approach 2: active learning
• Ask the user the label of a subset of the examples in the stream
– Approach 3: without feedback• Decision model is updated without information
about the true label of the examples 24
Online Phase
• Update of the decision model: Forgetting mechanism → Important to forget previous, outdated, concepts– Approach 1: Based on an ensemble of classifiers
• To train a new classifier and replace an old one– Approach 2: Based on clusters
• Clusters that do not received new examples for a long time are removed
– Approach 3: Based on weight• To reduce the weight of the old examples
25
Detection of recurring concepts
• Recurring concepts: definition– The class definitions may change when
previous situations recur, in periodic or random way, after some period of time (Elwell and Polikar 2011)
– Special type of concept drift where concepts that appeared in the past may recur in the future (Katakis et al. 2010)
26
Detection of recurring concepts
• Recurring contexts: Examples– Climate change– Electricity demand – Buyer habits – ....
27
Detection of recurring concepts
It would be a waste of effort to relearn an old concept from scratch for each recurrence (Widmer and Kubat 1996)
– In recurring contexts • Instead of forgetting outdated concepts, these
concepts should be saved and reexamined at some later time when they can improve the prediction performance in a cost-effective way
28
Detection of recurring concepts
• Systems that do not address recurring concepts: Treat them as novelty– Undesirable effects
• Increase in the false alarm rate• Increase in the human effort in analyzing the false
alarms• Computational efforts in executing a novelty
detection task and in learning a new class that was already learned
29
Detection of recurring concepts
• Approaches– Approach 1: To use an auxiliary ensemble of
classifiers that detects recurring classes– Approach 2: To use c ensembles, one per
class• Each ensemble is never deleted, but only updated• c is the number of classes seen so far in the
stream– Approach 3: To use a sleep memory to store
clusters not used to classify new examples for a long time 30
Treatment of Outliers
• Outliers– Data that are isolated, sparse and not present
in a representative number• Novelty detection algorithms
– Look for a cohesive and representative set of examples
– Must address the treatment of noise or outliers which can be confused with the appearing of a new concept or a change in the known concepts
31
Treatment of Outliers
• Approach for outlier treatment (used by MCM, ECSMiner, MINAS, OLINDDA algorithms)– To store the examples not explained by the current
model in a temporary memory– To cluster these examples– To apply validation criteria on the clusters
• Examples of validation criteria: cohesiveness, representativeness, separability
• Not valid clusters are potential outliersMinas also propose to remove old examples, which stay in the temporary memory for a long time
32
Examples of Novelty Detection Algorithms for Data Streams
• ECSMiner (Masud et al. 2011)• OLINDDA (Spinosa 2009)• MINAS (Faria 2016)• MCM (Masud et al. 2010)• CLAM (Al-Khateeb et al. 2012)
33
ECSMiner • Supervised algorithm for concept drift and
concept evolution• The decision model is composed by an
ensemble of classifiers– It supposes that all examples will be labeled after a
delay– Each classification model is trained from a chunk of
data– The ensemble is composed by M models– The ensemble is continuoulsy updated
• The model with the highest prediction error is replaced by a new model
34
ECSMiner
• Assumptions– After Tl timestamps the true label of the
example will be available– It is possible to wait to Tc timestamps before
to make a decision about the classification of an example
Tc < Tl
35
ECSMiner• Offline Phase
– Supervised– Ensemble of classifiers
• Decision tree or KNN
• Online Phase– Use the ensemble for classify new examples– Store the examples not explained by the ensemble (f-outliers)– Build clusters from f-outliers using K-Means– Calculate the q-NSC measure (q-neighbourhood silhouette
coefficient)– If most of the classifiers has the q-NSC positive→ a novelty is
detected
36
OLINDDA
• Offline Phase– Unsupervised– Learn a decision model about the normal class– The decision model is a set of clusters (k-hypershperes)
• Clustering algorithm: K-Means
37
OLINDDA
• Online Phase• Unsupervised• Use the decision model created in the offline
phase to classify new examples as normal • Examples not explained by the decision model are
put in a short-term memory (unknown)• Valid clusters of unknown examples are used to
create the extension and novelty models
38
OLINDDA
Normal
Extension
Novelty
39
OLINDDA
Normal
Extension
Novelty
Example ???
Example
Example
Example
If a new example is inside the radius of one of the hypersphers classify it with the label of the hypersphere
Example
Example
Example
Example
Example
40
Normal
Extension
OLINDDA
• If the example is labeled as unknown it is stored in a short-term memory
Example
Short-term memory
Not explained by any of the hyperspheres
41
OLINDDA
If the number of examples in the short-term memory > threshold cluster the examples using K-Means Only valid clusters (cohesive and representative) are considered
Sort-term memory
# Examples > Threshold
K- Means
42
OLINDDA
• A new cluster is – Extension
• Neighbourhood of the normal model
– Novelty• Distant from the
normal model
43
MINASMultIclass learNing Algorithm for data Streams
• Offline Phase– Learns a decision model based on the known concept
about the problem – Execute once– Each class represent by a set of clusters (hyperspheres)
• Online Phase– Receives new examples and classify them either as one of
the known classes or as unknown– Cohesive group of unknown examples are used to detect
new classes or extensions44
MINAS - Offline Phase
45
MINAS - Offline Phase• Micro-clusters: statistical summary (incremental)N number of examplesLS linear sum of the examplesSS squared sum of the examplest timestamp of the arrival of the last example classified by the micro-
cluster
• Example of clustering algorithms used in the Training Phase– K-Means– Clustream
46
MINAS - Offline Phase
47
MINAS
• Online Phase– To classify new examples– To detect novelty patterns– To update the decision model
48
MINAS - Classification
49
MINAS - Classification
• Classify an example as unknown means– The example is a noise or outlier and it can not be
explained by anyone of the micro-clusters • The example must be discarded
– The example represents a concept drift • The example must be used to update the decision
model– The example represents a novelty pattern
• The example must be used to update the decision model
50
MINAS – Novelty detection and update
51
MINAS - Online Phase
52
MINAS - Online Phase
53
MINAS-Active Learning
• Used when the label of a reduced set of examples are available
• Use active learning techniques to select a representative set of examples to be labeled and used to update the decision model
• Main idea– Time to time select the centroid of the new created
micro-clusters as the examples to be labeled by the specialist
– Update the decision model with the new label
54
Evaluation in Novelty DetectionMulticlass novelty detection data stream algorithms use binary evaluation measures
% of examples misclassified in the normal class
% of normal class examples wrongly classified as novelty
% classificações incorretas
FP: # of examples from the known classes wrongly classified as noveltyFN: # of examples from the novel classes wrongly classified as known classesFE: # of examples from known classes misclassified (other than FP)N: # of examples in the stream Nc: # of examples from the novel classes
55
Evaluation in Novelty Detection
• Binary classification evaluation measures: Problems– Considers the novelty detection as a binary
classification task• It is a multiclassification task
– Do not consider the unknown examples separately
– Do not consider that different novelty patterns can appear
– Evaluate only the final confusion matrix 56
57
Evaluation in Novelty Detection (Faria et. al 2013)
• Confusion matrix– Not square (rectangle)– Number of columns
increases over time– Novelty patterns do not
have direct matching with problem classes
– Presence of unknown examples
58
Evaluation in Novelty Detection (Faria et. al 2013)
• Rectangular Confusion Matrix – Problem
• Difficult to define hits and errors• Matrix is not squared• Each novelty pattern needs to be assigned to only
one class – One class may be associated with one or more novelties
– Solution• Representation using Bipartite graph• Based on the Hungarian Method
59
Evaluation in Novelty Detection (Faria et. al 2013)
Confusion Matrix
Corresponding Bipartite Graph Resulting Bipartite Subgraph
60
Evaluation in Novelty Detection (Faria et. al 2013)
• Unknown examples– Problem
• How to consider the unknown examples? – Hits or Errors?
– Solution• Neither hits nor errors• Unknown examples should be computed
separately
61
Unknown examples
ACCExp + ErrExp = 1 ACCExp/ErrExp: accuracy/error considering only the
examples explained by the model
Unki: # examples from the class Ci classified as unknown
ExCi: # examples from class Ci
M: # classes
Evaluation in Novelty Detection (Faria et. al 2013)
62
Evaluation in Novelty Detection (Faria et. al 2013)
• Use evaluation measure CER (Combined Error Rate) to calculate classification error rate
• Considerer only the examples classified as not unknown
#Ex′Ci: number of examples from class Ci#Ex′: number of examples
FPRi: false positive rateFNRi: false negative rate
63
Evaluation in Novelty Detection (Faria et. al 2013)
• Evaluation over time: Problem– In evolving data stream, it is not sufficient to extract
information about the final confusion matrix• Solution
– Plot a 2D-graphic• X represents the data timestamps • Y represents the evaluation measure values
– Plot the information about errors and unknown examples
– Identify the timestamps of when a new concept was detected
64
Referências• Masud M, Gao J, Khan L, Han J, Thuraisingham BM (2011)
Classification and novel class detection in concept-drifting data streams under time constraints. IEEE Transaction on Knowledge Data Engineering 23(6):859–874
• Spinosa EJ, Carvalho ACPLF, Gama J (2009) Novelty detection with application to data streams. Intelligent Data Analysis 13(3):405–422
• Faria, ER; Carvalho ACPLF, Gama J (2016) MINAS: multiclass learning algorithm for novelty detection in data streams. Data Mining and Knowledge Discovery, v. 30, p. 640-680
• Masud MM, Chen Q, Khan L, Aggarwal CC, Gao J, Han J, Thuraisingham BM (2010) Addressing concept evolution in concept-drifting data streams. In: Proceedings of the 10th IEEE international conference on data mining (ICDM’10), pp 929–934
65
Referências• Al-Khateeb TM, Masud MM, Khan L, Thuraisingham B
(2012) Cloud guided stream classification using class-based ensemble. In: Proceedings of the 2012 IEEE 5th international conference on computing (CLOUD’12). IEEE Computer Society, Washington, DC, USA, pp 694–701
• Elwell R, Polikar R (2011) Incremental learning of concept drift in nonstationary environments. IEEE Transactions on Neural Network 22(10):1517–1531
• Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 2(3):371–391
66
Referências• Widmer G, Kubat M (1996) Learning in the presence of
concept drift and hidden contexts. Machine Learning 23(1):69–101