116927830 principios del filtro de novedad
TRANSCRIPT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 1/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 1
An Evaluation of Neural Network Methods and
Data Preparation Strategies for Novelty
Detection
Rewbenio A. Frota and Guilherme A. Barreto, Member IEEE
and Joao C. M. Mota, Member IEEE
The authors are with the Department of Teleinformatics Engineering, Federal University of Ceara (UFC), Fortaleza-CE, Brazil.
E-mails: {rewbenio, guilherme, mota}@deti.ufc.br.
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 2/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 2
Abstract
An important issue in the design of a model for a particular data set is the quality of the data
concerning the presence of anomalous observations (outliers) and their influence in the performance of
pattern classifiers. Common approaches to deal with outliers remove them from data or improve the
robustness of the machine learning method by handling outliers directly. We explore these two views
by introducing a systematic methodology to compare the performance of neural methods applied to
novelty detection. Firstly, we describe in a tutorial-like fashion the most common neural-based novelty
detection techniques. Then, in order to compute reliable decision thresholds, we generalize the recent
application of the bootstrap resampling technique to unsupervised novelty detection to the supervised
case, and propose a outlier removal procedure based on it. Finally, we evaluate the performance of the
neural network methods through simulations on a breast cancer data set, assessing their robustness to
outliers and their sensitivity to training parameters, such as data scaling, number of neurons, training
epochs and size of the training set. We conclude the paper by discussing the obtained results.
Index Terms
Novelty detection, self-organizing maps, multilayer neural networks, bootstrap, prediction intervals.
I. INTRODUCTION
Novelty detection is the problem of reporting the occurrence of novel events or data. As such,
it has been the focus of an increasing attention in many pattern recognition applications whose
success depends on building a reliable model for the data, such as machine monitoring [1], [2],
[3], image processing [4], radar target detection [5], detection of masses in mammogram [6],
mobile robotics [7], [8], handwritten digit recognition [9], computer network security [10], [11],
[12], statistical process control [13], fault management [14], among others.
This interest is due in part to the crucial importance for some problems that the model may
be able to detect patterns that do not match well with the stored data representation. Due to the
wide range of applicability across disciplines in engineering and science, novelty detection can
also be called anomaly detection, intruder detection, fault detection or outlier detection.
Several neural, system-theoretic, statistical and hybrid approaches to novelty detection have
been proposed over the years, but it is becoming usual the formulation of novelty detection tasks
as one of the following pattern classification problems:
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 3/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 3
• Single-Class Classification: The data available for learning a representation of the expected
behavior of the system of interest is composed only of one type of data vectors, usually
representing normal activity of the system. This type of data is also referred to as positive
examples. The goal is to indicate if a given input vector correspond to normal or abnormal
behavior.
• Multi-Class Classification: The training set contains data vectors of different classes. The
data should be representative of positive (normal) and negative (abnormal) behavior, in order
to build an overall representation of the known system behavior, even (and specially) in
faulty operation [15]. The goal is to indicate is to classify the input vector into one or none
of the existing classes.
Thus, the design of novelty detectors can be generally stated as the task in which a description
of what is already known about the system is learned by fitting a set of normal and/or abnormal
data vectors, so that subsequent unseen patterns are evaluated by comparing a measure of novelty
against some threshold. The main challenges are then the collection of reliable data, the definition
of an appropriate learning machine (i.e. the classifier) and the computation of decision thresholds
with which novel patterns can be detected for a given application.
As can be verified in good survey papers recently published [16], [17], [18], [19], considerable
efforts have been devoted to the design of powerful classifiers and threshold computation tech-
niques, while much less attention has been paid to the data-related issues, such as the occurrence
of outliers and data scaling methods, and their influence on the performance of the classifiers.
In what concern the quality of the collected data, most of the works in novelty detection
assume, implicitly or explicitly, that the training data is outlier-free or the outliers are known
in advance. Since outliers may arise due to a number of reasons, such as measurement error,
mechanical/electrical faults, unexpected behavior of the system (e.g. fraudulent behavior), or
simply by natural statistical deviations within the data set, those assumptions are unrealistic.It is worth mentioning that the data labelling process, even if performed by an expert, is also
error-prone.
Even if we assume that the data is outlier-free, it is very difficult, if not impossible, to know in
advance if the sampled data, concerning the number of positive and/or negative examples, suffices
to give a reliable description of the underlying statistical nature of the system. For example, for
some applications, the number of negative (abnormal) examples can be very small, since they are
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 4/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 4
rare or difficult (expensive) to collect. It is well-known that for a good classification performance,
the number of examples per class should be ideally balanced [20]. This is particularly true for
powerful classifiers, such as feedforward neural networks [21], were occurrence of data overfitting
is usual.
In this case, it is recommended to consider the few negative examples available as outliers,
treating the novelty detection task as a single-class classification problem, in which training
the classifier is carried out with positive (normal) examples only. The outliers are then used
to test the performance of the novelty detection system. Some authors, however, argue that the
inclusion of outliers during training can be beneficial for the novelty detection system, improving
its robustness as a whole [4], [22], [23]. If known outliers are unavailable, these authors suggest
to generate artificial outliers for that purpose.
Bearing in mind the aforementioned issues concerning the design of a robust novelty detection
system, the contributions of this paper are manifold:
1) Proposal of a generic methodology to compute thresholds that is applicable independently
to supervised and unsupervised networks.
2) Proposal of a data cleaning strategy for outlier removal based on the proposed methodology.
3) Comparison of the proposed methodology with existing threshold determination techniques
using different neural network paradigms.
4) Evaluation of the proposed methodology in the presence of known and unknown outliers
and for different data scaling strategies.
The remainder of the paper is organized as follows. In Section II, we briefly present the
novelty detection task as a hypothesis testing procedure. Then, in Section III, we describe in
a tutorial-like fashion the most common neural-based novelty detection techniques. In Sections
IV and V, in order to compute reliable decision thresholds, we generalize the recent application
of the bootstrap resampling technique to unsupervised novelty detection to the supervised case,
and propose a outlier removal procedure based on it. Finally, in Section VI we evaluate the
performance of the neural network methods through simulations on a breast cancer data set and
discuss the obtained results. We conclude the paper in Section VII.
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 5/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 5
II. NOVELTY DETECTION AS HYPOTHESIS TESTING
Before starting the presentation of techniques for novelty detection, it is worth understanding
the novelty detection task under the formalism of statistical hypothesis testing, in order to
establish criteria to measure the performance of the neural models. First of all, it is necessary
to define a null hypothesis, i.e., the hypothesis to be tested. For our purposes, H 0 is stated as
follows:
• H 0: The input vector xnew reflects KNOWN activity.
where by the adjective known we mean the vector xnew represents normal behavior, if we are
dealing with single-class classification problems. If we have a multi-class classification problem,
the adjective known means that the input vector belongs to one of the already learned classes.
The so-called alternative hypothesis, denoted as H 1, is obviously given by:
• H 1: The input vector reflects the UNKNOWN activity.
so that, in this case, the input vector carries novel information, which in general is indicative of
abnormal behavior of the system being analyzed.
Thus, when formulating a conclusion regarding the condition of the system based on the
definitions of H 0 and H 1, two types of errors are possible:
• Type I error: This error occurs when H 0 is rejected when it is, in fact, true. The probability
of making a type I error is denoted by the so-called significance level, α, whose value is
set by the investigator in relation to the consequences of such an error. That is, we want to
make the significance level as small as possible in order to protect the null hypothesis and
to prevent, as far as possible, the investigator from inadvertently making false claims. Type
I error is also referred to as False Alarm, False Detection or yet False Positive.
• Type II error: This error occurs when H 0 is accepted when it should be rejected. The
probability of making a type II error is denoted by β (which is generally unknown). Type II
error is also referred to as Absence of Alarm or False Negative. A type II error is frequently
due to sample sizes N being too small.
Novelty detection systems are usually evaluated by the number of false alarms and absence
of alarms they produce. The ideal novelty detector would have α = 0 and β = 0, but this is
not really possible in practice. So, one tries to manage α and β error probabilities based on the
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 6/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 6
overall consequences (e.g. high costs, death, machine breakdown, virus infection, etc.) for the
system being analyzed.
For example, reporting false alarms too frequently that the system operators would gradually
put no faith on its decisions, to the point that they would refuse to believe that an actual problem
is occurring. In medical testing, absence of alarms (false negatives) provide false, incorrect
reassurance to both patients and physicians that patients are free of disease which is actually
present. This in turn leads to people receiving inappropriate understanding and a lack of better
advice and treatment to better protect their interests.
The difficulty is that, for any fixed sample size N , a decrease in α will cause an increase in β .
Conversely, an increase in α will cause a decrease in β . To decrease both α and β to acceptable
levels, we may increase the sample size N . Also, for any fixed α, an increase in the sample size
N will cause a reduction in β , i.e., a greater number of samples reduce the probability of reject
the null hypotheses when it is true.
Usually, in neural-based novelty detection the number of samples is fixed and strongly related
to the number of neurons. If one increases the number of neurons in order to decrease both α
and β , the computational costs also increase rapidly. This can be problematic if the novelty
detection systems is supposed to work in real-time, such as in computer network or spam
detection softwares. Even for offline applications, higher computational efforts demand highercomputational power, increasing the costs of the hardware. In this paper, an alternative to increase
the number of samples which demands much lesser additional computer efforts is proposed. For
this purpose, the number of samples of the variable of interest is increased through statistical
resampling techniques, such as bootstrap [24].
III. NEURAL METHODS FOR NOVELTY DETECTION
Supervised and unsupervised artificial neural network (ANN) algorithms have been used in a
wide range of novelty detection tasks, mainly due to its nonparametric1 nature and its powerful
generalization performance [25]. In this section we briefly review the most common ANN
approaches to novelty detection. It is not our intention to provide a comprehensive survey of
possible approaches, but rather to give an introduction to the issue.
1By nonparametric we mean methods that make none or very few assumptions about the statistical distribution of the data.
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 7/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 7
A. Optimal Linear Associative Memory
One of the first approaches to novelty detection, called the Novelty Filter was proposed by
Kohonen and Oja [26]. The following theoretical development is a special case of the Optimal
Linear Associative Memory (OLAM) [27], in which one tries to learn a linear mapping y = Mx
from a finite set of input-output pairs (xi,yi), i = 1, . . . , m.
For novelty detection purpose, we are interested only in the autoassociative OLAM. In this
case, the pair (xi,yi) reduces to the redundant pair (xi,xi). So, given a set of sample vectors
x1,x2, . . . ,xm, it is possible to compute the matrix M as follows:
M = X∗X, (1)
where the columns of X are the training vectors xi and X∗ = XT (XXT )−1 denotes the pseudo-
inverse matrix of X.
Let the known vectors x1,x2, . . . ,xm span some unique linear subspace L(x1,x2, . . . ,xm) of
n, or alternatively,
L = L(x1,x2, . . . ,xm) =x|x =
m
i=1cixi
(2)
where the c1, . . . , cm are arbitrary real scalars from the domain (−∞, ∞). It can be shown
that the matrix M behaves as a projection operator . The operator M projetsn onto
L. There
is another operator, called dual operator that projects n onto L⊥, which is the orthogonal
complement space {x ∈ n : xyT = 0, ∀y ∈ L}.
It can be shown that the dual operator is given by I−X∗X, where I denotes the n×n identity
matrix. Every vector in n can be uniquely decomposed as follows:
x = x(X∗X) + x(I−X∗X) = x+ x, (3)
in which the projection x measures what is known about the input x relative to the vectors
x1,x2, . . . ,xm stored in matrix M as shown in (1).
By its turn, the projection x is called the novelty vector , since it measures what is maximally
unknown or novel in the measured input vector x. Thus, the magnitude of x can be used for
novelty detection purposes. In such applications, the larger the norm ||x||, the less certain we
are of judging the vector x as belonging to the linear subspace L. In [26], Kohonen and Oja
implemented the novelty filter as a fully-connected adaptive neural feedback system.
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 8/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 8
B. The Self-Organizing Map
The Self-Organizing Map (SOM) [28], [29] is one of the most popular neural network
architectures. It belongs to the category of competitive learning algorithms and it is usually
designed to build an ordered representation of spatial proximity among vectors of an unlabelled
data set.
The neurons in the SOM are put together in an output layer, A, in one-, two- or even three-
dimensional arrays. Each neuron i ∈ A has a weight vector wi ∈ n with the same dimension of
the input vector x ∈ n. The network weights are trained according to a competitive-cooperative
scheme in which the weight vectors of a winning neuron and its neighbors in the output array
are updated after the presentation of an input vector. Roughly speaking, the functioning of this
type of learning algorithm is based on the concept of winning neuron, defined as the neuron
whose weight vector is the closest to the current input vector.
During the learning phase, the weight vectors of the winning neurons are modified incremen-
tally in time in order to extract average features from the set of input patterns. The SOM has
been widely applied to pattern recognition and classification tasks, such as clustering and vector
quantization. In these applications, the weight vectors are called prototypes or centroids of a
given class or category, since through learning they become the most representative element of
a given group of input vectors.
Using Euclidean distance, the simplest strategy to find the winning neuron, i∗(t), is given by:
i∗(t) = arg min∀i
x(t) −wi(t) (4)
where x(t) ∈ n denotes the current input vector, wi(t) ∈ n is the weight vector of neuron i,
and t symbolizes the time steps associated with the iterations of the algorithm. Accordingly, the
weight vector of the winning neuron is modified as follows:
wi(t + 1) = wi(t) + η(t)h(i∗
, i; t)[x(t) −wi(t)] (5)
where h(i∗, i; t) is a Gaussian function which control the degree of change imposed to the weight
vectors of those neurons in the neighborhood of the winning neuron:
h(i∗, i; t) = exp
−ri(t) − ri∗(t)2
σ2(t)
(6)
where σ(t) defines the radius of the neighborhood function, ri(t) and ri∗(t) are respectively,
the positions of neurons i and i∗ in the array. The learning rate, 0 < η(t) < 1, should decay
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 9/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 9
with time to guarantee convergence of the weight vectors to stable states. In this paper, we use
η(t) = η0 (ηT /η0)(t/T ), where η0 and ηT are the initial and final values of η(t), respectively. The
variable σ(t) should decay in time similarly to the learning rate η(t).
The SOM has several beneficial features which make it a valuable tool in data mining
applications [30]. In particular, the use of a neighborhood function imposes an order to the
weight vectors, so that, at the end of the training phase, input vectors that are close in the input
space are mapped onto the same winning neuron or onto winning neurons that are close in the
output array. This is the so-called topology-preserving property of the SOM, which has been
particularly useful for data visualization purposes [31].
Once SOM algorithm has converged, the set of ordered weight vectors summarizes important
statistical characteristics of the input. The SOM should reflect variations in the statistics of
the input distribution: regions in the input space X from which a sample x are drawn with
a high probability of occurrence are mapped onto larger domains of the output space A, and
therefore with better resolution than regions in X from which sample vectors are drawn with a
low probability of occurrence.
This density matching property is one of the most important for novelty detection purposes. For
example, once the SOM has been trained with unlabelled vectors that one believes to consist only
of data representing the normal state of the system being analyzed, we can use the quantizationerror, e(x,wi∗, t), between the current input vector x(t) and the winning weight vector wi∗(t)
as a measure of the degree of proximity of x(t) to the distribution of “normal” data vectors
encoded by the weight vectors of the SOM.
The quantization error is computed as follows:
e(x,wi∗, t) = x(t) −wi∗(t) =
n j=1
(x j(t) − wi∗ j(t))2 (7)
where t is an index denoting the current discrete time step. Roughly speaking, If e(x,wi∗
, t)is larger than a certain threshold ρ, one assume that the current input is far from the region of
the input space representing normal behavior as modelled by the SOM weights, thus revealing a
novelty or an anomaly in the system being analyzed. Several procedures to compute the threshold
ρ have been developed in recent years, most of them based on well-established statistical
techniques (see e.g. [16], [18]). In the following sections we describe some of these techniques
in the context of SOM-based novelty detection.
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 10/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 10
C. Computing Single-Thresholds
In [12], the SOM is trained with data representing the activity of normal users within a
computer network. The threshold ρ is determined by computing the statistical p-value associated
with the distribution of training quantization errors. The p-value defines the probability of
observing the test statistic e(x,wi∗, t) as extreme as or more extreme than the observed value,
assuming that the null hypothesis is true. This novelty detection procedure is implemented as
follows:
• Step 1: After training is finished, the quantization errors for the training set vectors are
computed (e1, e2, . . . , em).
• Step 2: The quantization error for a new input vector is computed, enew.
• Step 3: The p-value for any new input vector, denote by P new, is computed as follows. Let
B be the number of distances in (e1, e2, . . . , em) that are greater than enew. Thus,
ρ = P new =B
m(8)
• Step 4: If ρ > α, then H 0 is accepted; otherwise it is rejected. A significance level α = 0.05
is commonly used.
• Step 5: Steps 2-4 are repeated for every new input vector.
According to the authors the system is very reliable and have presented acceptable rates of
false negatives and false positives, concluding that theses errors were caused by normal changes
in user profiles. Similar approaches have applied to novelty detection in cellular networks [32],
time series modelling [33] and machine monitoring [2].
A single-threshold SOM-based method for fault detection in rotating machinery is presented
in [3]. The procedure follows the same steps described previously, except that in this case, the
novelty threshold is computed as follows. For each neuron in the immediate neighborhood of
the winning neuron (also called 1-neighborhood neurons), one computes their distances, Di∗ j =
wi∗ − w j, to the winning neuron. The novelty threshold is taken as the maximum value of
these distances:
ρ = max∀ j∈V 1
{Di∗ j} (9)
where V 1 is the set of 1-neighborhood neurons of the current winning neuron. Thus, if enew > ρ
then the input vector carries novel or anomalous information, i.e. the null hypotheses should be
rejected.
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 11/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 11
D. Computing Double-Thresholds
In this section, we describe techniques that compute two thresholds for evaluating the degree
of novelty in the input vector. The rationale behind this approach is based on the fact that, for
certain applications, not only a very high quantization error is indicative of novelty, but also
very small ones.
One can argue that a small quantization error means that the input vector is almost surely
normal. This is true if no outliers are present in the data. However, in more realistic scenarios,
there is no guarantee that the training data is outlier-free, and a given neuron could be representing
exactly the region the outliers belong to.
Thus, since outliers can be handled directly by double-threshold methodologies, they are more
robust than single-threshold approaches as will be shown in the simulations. In addition, double-
threshold methodologies is well-suited for outlier removal purposes as we propose in Section
V.
In [14], the authors proposed a novel technique to detect faults in cellular systems by com-
puting the Bootstrap Prediction Interval (BOOPI) for the distribution of quantization errors. The
lower and upper limits of the BOOPI define the novelty thresholds. Several competitive models
are analyzed and the SOM has provided the best results, generating low false alarms rates.
To implement this procedure, a sample of M bootstrap instances (eb1, eb2, . . . , ebM ) is drawn
with replacement from the original sample of m(m M ) quantization errors (e1, e2, . . . , em),
where each original instance has equal probability to be sampled. Then, the lower and upper
limits of the BOOPI method are computed via percentiles.2
It is shown that prediction (or confidence) intervals can be computed from the bootstrap
samples without making any assumption about the distribution of the original data, provided the
number M of bootstrap samples is large, e.g. M > 1000 [34], [35], [24]. For a given significance
level α, we are interested in an interval within which we can certainly find a percentage 100(1−α)(e.g. α = 0.05) of normal values of the quantization error. Hence, we compute the lower and
upper limits of this interval as follows:
2The percentile of a distribution of values is a number N α such that a percentage 100(1 − α) of the population values are
less than or equal to N α. For example, the 75th percentile (also referred to as the 0.75 quantile) is a value (N α such that 75%
of the values of the variable fall below that value.
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 12/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 12
• Lower Limit (ρ−): This is the 100α2
th percentile.
• Upper Limit (ρ+): This is the 100(1 − α2
)th percentile.
This interval [ρ−, ρ+] can then be used to classifying a new state vector into normal/abnormal
by means of a simple hypothesis test:
IF enew ∈ [ρ−, ρ+]
THEN xnew is NORMAL (10)
ELSE xnew is ABNORMAL
Instead of using computing the 100α2
th and 100(1 − α2
)th percentiles, one can also use the
well-known statistical box-plot technique3 to determine the interval [ρ−, ρ+]. As will be shown
in the simulations, this box-plot approach revealed to be one of the more robust approach to
novelty detection.
It is worth noting that this use of box-plot for novelty detection is very similar to the one
introduced by [36]. However, there are two important differences: (i) In our case, the interval
[ρ−, ρ+] is computed from the set of M bootstrap instances (eb1, eb2, . . . , ebM ), while in [36] the
interval is computed from the quantization errors generated by “cleaned training data set from
which the outliers were removed. (ii) In order to detect and remove outliers, the method by [36]
demands the additional computation of the MID matrix4 and the Sammon’s mapping [37], which
makes it unsuitable for online applications due the excessive computational burden required5.
E. Habituation-based Methods
In psychology, habituation is defined as a response diminishment as a function of stimulus
repetition, when no reward or punishment follows, and it is a constant finding in almost any
3
In Box Plots ranges or distribution characteristics of values of a selected variable (or variables) are plotted separately forgroups of cases defined by values of a categorical (grouping) variable. The central tendency (e.g., median or mean), and range
or variation statistics (e.g., quartiles, standard errors, or standard deviations) are computed for each group of cases.
4 Median Interneuron Distance matrix is defined as that whose mij entry is the median of the Euclidean distance between the
weight vector wi and all neurons within its L-neighborhood.
5The Sammon’s mapping is a non-linear mapping that maps a set of input vectors onto a plane trying to preserve the relative
distance between the input vectors approximately. It is widely used to visualize the SOM ordering by mapping the values of
weight vectors onto a plane. Sammon’s mapping can be applied directly to data sets, but it is computationally very intensive.
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 13/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 13
behavioral response [38]. For instance, if you make an unusual sound in the presence of the
family dog, it will respond – usually by turning its head toward the sound. If the stimulus is
given repeatedly and nothing either pleasant or unpleasant happens to the dog, it will soon cease
to respond. This lack of response is not a result of fatigue nor of sensory adaptation and is
long-lasting; when fully habituated, the dog will not respond to the stimulus even though weeks
or months have elapsed since it was last presented. Recently, some authors have proposed the
use of mathematical models of habituation together with the SOM applied to novelty detection
task.
Marsland et al. [39], [7] presented an unsupervised algorithm, called Habituating Self-Organizing
Map (HSOM), for detecting novel stimuli encountered by a mobile robot during navigation. The
HSOM is a two-layered network in which each neuron of the first layer (an usual SOM) is
connected to the output neuron via a habituable synapse, so that the more frequently a first-layer
neuron is chosen the winner, the lower the efficacy of its output synapse and hence the lower the
activation of the output neuron. The output value associated with the winning neuron is taken
as the novelty value and the more familiar the input vector is, the faster the output value decays
to zero.
In [8] the authors proposed an alternative to HSOM, called Grow When Required (GWR)
network, that allows the insertion of new neurons when necessary. In the GWR, both the synapsesand the neurons have counters that indicate how many times they have fired (i.e. have been
selected as winner). Using these counters it is possible to determine whether a given neuron is
still learning the inputs or if it is ‘confused’ (i.e., it tries to encode input vectors from different
classes). If this is the case, then a new neuron is added to the network between the input and
the winning neuron that caused the confusion.
The insertion of neurons is dependent upon two user-defined thresholds. The first is a minimum
activity threshold below which the current node is not considered to be a sufficiently good matchand second is a maximum habituation threshold above that the current node is not considered
to have learnt sufficiently well. The GWR network can be used as a novelty filter without any
modification, if the neuron that fires has not fired before, or fired very infrequently, then the
input is novel.
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 14/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 14
F. Multilayer Feedforward Supervised Networks
The most popular supervised ANN algorithm, the multilayer Perceptron (MLP), learns an
input-output mapping through minimization of some objective function, usually the mean squared
error [25]. Due to its popularity, it is natural that MLPs have been also widely used for novelty
detection. Two main approaches are common for this purpose: (i) If examples of normal and
abnormal behaviors are available, the MLP is used as a nonlinear classifier [4], [22], [23]; (ii)
If only data representing normal behavior is available, then the MLP is commonly used as an
auto-associator [40]. These possibilities are better described next.
The single-hidden layer MLP classifier, using the logistic or the hyperbolic tangent function
for the activation function of the hidden and output layer neurons, implements very general
nonlinear discriminant functions [20], [25]. Usually, if there are q classes of data, denoted here
(C 1, C 2, . . . , C q), we will need q output neurons. These neurons are then trained to produce
output values yi, i = 1, . . . , q that encode the different classes. For example, the neuron i should
produce an output value yi close to 1 if the input vector belongs to class C i; otherwise, yi = 0
(or yi = −1). For testing the classification performance, we select the neuron with the highest
output value:
yk = max∀i
{yi} (11)
Then, we assign the new input vector x to class C k, in a “winner-take-all” (WTA) classification
scheme.
For novelty detection purposes, given a new input vector, once we find the neuron with the
highest output value as in (11), we verify if yk is below a preset threshold (ρ). If so, then a novelty
is declared. This approach was used by Singh and Markou [4], Augusteijn and Folkert [22] and
Vasconcelos et al. [23]. Additionally, Augusteijn and Folkert argued that the WTA classification
scheme is unsuitable for novelty detection, since it takes into account only the information
carried out by a single output neuron. Hence, they suggested taking the entire output pattern
into account, so that the distance between this output pattern and each one of the target patterns
(used during training) is computed, and if the smallest of these distances are above a preset
threshold then the input pattern is considered to belong to a novel class.
To improve the MLP performance in novelty/outlier detection tasks, Vasconcelos et al. [23]
suggested the use of the Gaussian Multilayer Perceptron (GMLP) [41]. In this network it is used
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 15/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 15
a Gaussian activation function for the hidden layer neurons, instead of the usual sigmoids. This
simple modification provided better results, due to the fact that Gaussian activation functions
forces the receptive field of a neuron to be more selective, being active only for a narrow partition
of the input space, since it tends to produce closed regions surrounding the training data.
Finally, the MLP is also commonly used for novelty detection tasks as an autoassociative
architecture [42], [40]. The autoassociative MLP is designed to learn an input-output mapping
in which the target vectors are the input vectors themselves. This is usually implemented through
a hidden layer whose number of neurons is lower than the dimension of the input vectors.
The network is trained to reconstruct as well as possible a training set consisting of vec-
tors representing normal behavior. In this sense, the autoassociative MLP can be viewed as a
nonlinear version of the Novelty Filter presented in Section III-A. Hence, it should be able to
adequately reconstruct subsequent normal input vectors, but should perform poorly on the task
of reconstructing abnormal (novel) ones.
Thus, the detection of novel or anomalous input patterns reduces to the task of assessing how
well such vectors are reconstructed by the autoassociative MLP. Quantitatively, this procedure
consists in computing an upper-bound for the reconstruction error of all the training set vectors
at the end of training. For testing purposes, this upper bound is usually relaxed a little by
a certain percentage. New input patterns are subsequently classified by checking whether thereconstruction error of the new input pattern is above the relaxed upper bound, thus revealing
novel data, or below (if data is normal).
Another popular supervised multilayer ANN, the Radial Basis Function (RBF) network, have
been used for novelty detection [43]. In such applications, RBF networks frequently use the
WTA classification rule. This rule is very simple to use, and is often an appropriate solution
[44]. However, the same issues discussed for the MLP concerning the WTA classification rule also
apply to RBF-based novelty detectors, and the method proposed by Augusteijn and Folkert [22]can be used instead.
A mathematically well-founded alternative has been proposed by Li et al. [1]. If the output
value of neuron i is given by:
yi(x) = wT i φ(x) + bi (12)
where φ(x) = [φ1(x) · · · φq(x)]T is the vector of Gaussian basis functions and bi is the bias
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 16/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 16
of the output neuron i. Li et al. developed a method to set the threshold values for each output
neuron of a RBF network as follows:
ρi = bi + εi (13)
where bi is the bias of neuron i computed during training and 0 < εi 1 is a very small positive
constant introduced to make the classifier robust to noise and disturbances while having little
increase on the misclassification rate. By using this method, outputs may be readily interpreted
as an ‘unknown fault’ when none of the ‘normal’ or ‘fault output neuron exceeds the threshold
ρi.
IV. A GENERAL METHODOLOGY FOR COMPARISON
As pointed out by Markou and Singh [17], there are a number of studies and proposals in the
field of novelty detection, but comparative work has been much less. To our knowledge, only
a few papers have compared different neural model on the same data set [4], [10], [11], [45].
None of them provided general guidelines on which technique will work best on what types of
data, which one is more robust to outliers, and which data preprocessing method provide better
results. In this paper we give a first step in trying to answer some of this questions by providing
a general methodology to compare the performance of neural-based novelty detection systems
under more statistically-oriented framework, thus avoiding ad hoc approaches.
The rationale behind the proposal of a general methodology was the observation that the
decision thresholds of many neural-based novelty detection systems, specially those based on
MLP and RBF networks, were computed heuristically, without clearly stated principles. For
example, a commonly used heuristic for MLP- or RBF-based novelty detectors is to set the
decision threshold to ρ = 0.5. That is, all the outputs of the network fall below this value then
an unknown (novel) activity is detected.
This problem is also observed in many unsupervised methods, but in a lower scale. In general,
the decision thresholds in these cases are more statistically-inspired, such as the p-value or the
BOOPI approach. In this paper, we argue that most of the techniques described for SOM-based
novelty detectors, can be equally used by MLP- and RBF-based novelty detectors. For that,
once a neural method to be evaluated is defined, the approach we propose to compute decision
threshold is a combination of four main steps listed below:
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 17/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 17
Step 1: Define the output variable, z t, to be evaluated at a given time step t. It is worth
emphasizing that z t should reflect the statistical variability of the training data. For that
purpose, we give next some possibilities next.
• OLAM : It can be chosen the Euclidean norm of the novelty vector, z t = x(t) (see
Section III-A).
• SOM : The quantization error, z t = x(t) − wi∗(t) is the usual choice (see Sec-
tion III-B).
• MLP/RBF : In this case, we have two situations.
– For single-output networks, it can be the output value of the network itself, i.e.,
z t = y(t).
– For multi-output networks, it can be the Euclidean norm of the difference be-
tween the vector of desired outputs, d(t), and the vector of actual outputs, y(t).
Then, we have z t = d(t) −y(t). If the Autoassociative MLP is being used z t
can be chosen as the norm of the reconstruction error.
Step 2: After the learning machine has been trained. Compute the values of z t for each
vector of the training set, Z = (z 1, z 2, . . . , z m).Step 3: Generate a sample of M bootstrap instances Z b = (z b1, z b2, . . . , z bM ) drawn with
replacement from the original sample (z 1, z 2, . . . , z m), where each original value of z i
has equal probability to be sampled.
Step 4: Compute the threshold for novelty detection tests using the bootstrap samples
Z b. In this case, we have again two possibilities.
• For single-threshold methods, one can choose e.g. the p-value approach described
in (8) or Tanaka’s method described in (9).
• For double-threshold methods, one can choose to compute prediction intervals
through percentile [14] or by the Box-Plot method, both described in Section III-D.
Several advantages of the proposed methodology are listed below:
• Reliability: It is a statistically well-founded approach, since its functioning is based on the
bootstrap resampling method. In addition, if one adopts the BOOPI, the computed thresholds
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 18/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 18
will correspond exactly to prediction (confidence) intervals for the output variable z t.
• Nonparametric: No assumptions about the statistical properties of the output variable are
made in any stage of the procedure.
• Generality: It allows the comparison of supervised and unsupervised learning methods
under common basis.
• Robustness: The bootstrap resampling technique allows the generation of a large number
of samples, improving the estimates of parameters.
• Simplicity: The method is logical, very easy to understand and apply.
As it will shown in the simulations (VI), one of the main conclusions that we have drawn
from the comparison of the neural methods under the proposed methodology is that training
with outliers is not so good as suggested by some authors. In addition, an interesting by-product
of the proposed methodology is the development of simple data cleaning strategy as described
in the next section.
V. DATA PREPARATION STRATEGIES
In [46], Ypma and Duin comment on the usual unavailability of samples that describe ac-
curately the faults in the system and claim that the best solution is to accurately build a
representation of the normal operation of the system and measure faults as a deviation of this
normality. In [47], this problem is addressed using the Vapnik’s principle of never solving a
problem that is more general than the one that we are actually interested in.
If we are only interested in detecting novelty, it is not always necessary to estimate a full
density model of the data. As pointed out earlier in this paper, a common approach to novelty
detection (for some authors the genuine one!) is to treat the problem as a single-class mod-
elling/classification problem, in which we are interested in build a good representation for only
a restricted class and then create a method to test if novel events are members of this class. Inthis kind of method, the training set must be ideally outlier-free, even the unknown ones. The
supporters of this viewpoint argue that a novelty detection system may have its performance
improved if associated with some mechanism of outlier cleaning.
The general methodology proposed in the previous section lends itself to automatic data
cleaning, removing anomalous (undesirable) vectors from the training set, and then retraining
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 19/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 19
the neural model with the cleaned data set. The proposed data cleaning procedure is detailed
next:
• Step 1: Choose a neural model and compute decision thresholds according to the general
methodology described in Section IV.
• Step 2: Apply novelty tests using the training data vectors. Obviously, for a well-trained
network, only a few of these vectors will be considered as novel ones.
• Step 3: Exclude those “abnormal” vectors from the original training set and retrain the
network with the new (cleaned) set.
In Section VI, we report simulations showing the benefits of this data cleaning procedure.
A. Data Scaling Strategies
Data scaling is one important issue that is usually underemphasized in applications of neural
methods to novelty detection. In this paper, we also evaluate the neural algorithms in this
respect, assessing the influence of different methods on their performances. For this purpose,
three techniques are utilized, so that two of them apply to the components of the input vectors
individually and one applies to the vectors as a whole.
• Soft Normalization - The distributions of the individual components, x j, j = 1, . . . , m are
standardized to zero mean and unitary variance:
xnew j =
x j − x jσ j
(14)
where
x j =1
m
m j=1
x j and σ j =
1
m − 1
m j=1
(x j − x j)2 (15)
• Hard Normalization- The components x j are rescaled to the [0;1] range:
xnew j =
x j − min(x j)
max(x j) − min(x j)(16)
in which max(x j) and min(x j) are the maximum and minimum values of x j , respectively.
• Whitening and Sphering- The data vectors x are transformed to a new vector v whose
components are uncorrelated and their variances are unitary equal unity. In other words,
the covariance matrix of equals the identity matrix E {vvT } = I. This is usually performed
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 20/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 20
through the eigenvalue decomposition (EVD) of the covariance matrix Rx = E {xxT } of
the original data vectors [20], [25].
VI. SIMULATIONS AND RESULTS
In this section we evaluate the performance of the neural network methods discussed in Section
III through simulations on a breast cancer data set [48], available through the UCI Machine
Learning Repository [49]. This dataset was chosen because biomedical applications require high
accuracy due to human factors involved. False positive and false negative errors in diagnosis have
different implications to the person being analyzed, but both should be reduced. Unsupervised
and supervised architectures were assessed under the proposed methodology by their robustness
to outliers and their sensitivity to training parameters, such as data scaling, number of neurons,
training epochs and size of the training set.
Let a cancer detection test be performed under the null hypothesis H 0 that the person is healthy
(i.e. normal behavior). If an actual cancer is not detected (false negative), the most probable is
that the person go home and forget health for a while, until the next visit to the doctor. This is
a serious problem, since the detection of a malignant tumor in the earlier stages of development
is crucial for the success of the treatment. If a false cancer is detected (false positive), the
person will probably make further investigations about the disease and will finally discover thatthe previous diagnostic was wrong. In this case, besides additional costs for new exams, the
person is exposed to undesirable psychological stress while waiting for the final results. Form
the exposed, we will put more emphasis on false negative error rates by virtue of its higher
importance to the health.
For SOM-based novelty detection, the output variable z t was the quantization error. For MLP-
and RBF-based novelty detectors, the output variable z t was the output of the network itself,
except for the Autoassociative MLP for which we selected the norm of the reconstruction error.
For all the neural algorithms, the decision thresholds were determined from the bootstrap samples
of z t using the following methods: p-value, box-plots, BOOPI. Additionally, for SOM-based
novelty detectors the Tanaka’s method is also used to compute decision thresholds.
All tests were performed using a 1-dimensional SOM. The MLP consisted of a single hidden
layer of neurons trained with the standard backpropagation algorithm with momentum term. The
logistic function was adopted for all neurons. The RBF consisted of a first layer of gaussian
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 21/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 21
basis functions whose centers ci were computed by the SOM algorithm. A single radius is
defined for all the gaussians, computed as a fraction of the maximum distance among all the
centers, i.e. σ = dmax(ci, c j)/√
2q, ∀i = j, where q is the number of basis functions and
dmax(ci, c j) = max∀i= j{ci − c j}. In the simulations, we are interested in the evaluation of
following issues:
• Novelty detection using the aforementioned neural network techniques.
• Performance improvement through the proposed outlier removal procedure.
• Performance sensitivity to different data preprocessing methodologies.
Unsupervised ANNs: The first set of simulations compares the novelty detection ability of
the neural methods. The entire data set consisted of 699 nine-dimensional feature vectors, whose
attributes xi, i = 1, . . . , 9 are the following: clump thickness, uniformity of cell size, uniformity
of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal
nucleoli and mitoses. All the attributes assume values within the range [1 − 10]. The hard
normalization method was used to rescale the data to the [0 − 1].
A number of 16 instances containing a single missing attribute value were excluded from the
original data set. From the remaining set of 683 vectors, 444 vectors corresponded to benign
tumors and 239 to malignant ones. From the total of 444 “normal” vectors, 355 of them (about
80%) were selected for training purposes. From the remaining 89 “normal” vectors, 30 of them
were replaced by “abnormal” vectors, randomly chosen from the set of 239 “abnormal” vectors.
The inclusion of abnormal vectors in the testing set was necessary in order to evaluate the
false negative (Error II) rates. If only examples of normal vectors were present in the testing
set, we could estimate only the false positive (Error I) rates. For each combination of neural
network model and decision threshold computation strategy, this procedure was repeated for 100
simulation runs, and the final error rates were averaged accordingly.
The false negative rates obtained for the SOM-based novelty detectors as a function of thenumber of neurons are shown in Figure 1. Each neural model in this figure was trained for 100
epochs. It can be noted that the pair (SOM, Box-plot) produced the lower rates, but followed very
closely by the pair (SOM, p-value). The pairs (SOM, BOOPI) and (SOM, TANAKA) provided
the worst rates.
The second set of simulations evaluates the sensitivity of the SOM-based novelty detectors to
changes in the number of training epochs, as shown in Figure 2. The training parameters used
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 22/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 22
2 5 10 15 20 25 30 35 40 45 50
3
10
20
30
40
50
60
70
80
90
Number of Neurons
F a l s e N e g a t i v e R a t e s ( % )
BOOPI
P−VALUE
BOX−PLOT
TANAKA
Fig. 1. False negative rates (in percentage) as a function of the number of neurons in the SOM.
were the same as those used for the first set of simulations, except that the number of neurons
was set to 40. The overall performances remain the same as in Figure 1, with the pair (SOM,
Box-plot) achieving the lowest false negative rates.
As a double threshold decision test, it can detect outliers in regions of high quantization
error (QE) as well as in regions of low QE. As discussed in Section III-D, this type of outlier
(referred to as unknown outliers) can be the result of an erroneous labelling. If unknown outliers
are present in the training set, some neurons may be attracted to these spurious patterns, so that
in the future some outliers will probably fire these neurons giving low quantization errors. Only
novelty decisions based on double thresholds methods, such as the Box-plot or the BOOPI, can
detect outliers in this case.
The pair (SOM, BOOPI), which in theory could also detect outliers in the low-QE region,
has obtained a performance only better than the pair (SOM, TANAKA). This may be due to the
fact that the great majority of unknown outliers are in the region of high-QE, probably due to
the low occurrence of unknown outliers (such as mislabelled normal data) in the training set,
thus implicitly revealing the good quality of the data set.
It is interesting to note that the performance of the pair (SOM, TANAKA) gets worse as the
number of number of training epochs increases. This may occur because of the very nature of
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 23/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 23
1 50 100 150 200 250 300
3
10
20
30
40
50
60
70
80
90
Number of Training Epochs
F a l s e N e g a t i v e R a t e s ( % )
BOOPIP−VALUE
BOX−PLOTTANAKA
Fig. 2. False negative rates (in percentage) as a function of the number of training epochs.
Tanaka’s test. Once the SOM network has more time to converge, it better fits the data manifold.
Then, we can observe that the quantization error enew tends to decrease even more, while the
novelty threshold ρ computed in (9) tends to stabilize, remaining constant. So, as the network
achieves a better representation of the data, it becomes more and more rare to observe enew > ρ,
and hence the test is almost never positive for novelty, even when the presented data vector is
truly novel. This contradicts the common sense that says that the better the representation of the
data, the better the network’s result.
Finally, another interesting conclusion drawn from Figures 1 and 2 is that, for a large number of
neurons or a very long training period, the pairs (SOM, Box-plot) and (SOM, p-value) produced
very similar false negative rates. This may be due to the fact that increasing the number of
neurons of the SOM or the number of training epochs the mean value of the quantization error
decreases, which makes few real outliers to fall below the decision threshold computed according
to the p-value method.
The third set of simulations compares the accuracy of SOM-based novelty detectors with
respect to the size of the training and testing sets. The purpose of this test is to give a rough
idea of which method requires less data to give high accuracy. In Figure 3 we observed that
no relevant changes in performance were verified as the sample size varied significantly. For
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 24/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 24
10 20 30 40 50 60 70 80 902
4
6
8
10
12
14
16
18
20
Size of the Training Data (% of total Data Set)
F a l s e N e g a t i v e R a t e s ( % )
BOX−PLOT
P−VALUE
BOOPI
Fig. 3. False negative rates as a function of the training set size.
these tests, the number of neurons and the number of training epochs were set to 40 and 100,
respectively.
Supervised ANNs: The same tests described above for SOM-based novelty detectors were
repeated here for the supervised methods (MLP, Autoassociative MLP, GMLP and RBF). The
first set of simulations evaluates the false negative rate as a function of the number of hidden
neurons. For these tests, each MLP network was trained for 1000 epochs with normal data
vectors only. The learning rate and the momentum factor were set to 0.35 and 0.5, respectively.
For clarity sake, the results are shown only for the p-value (Figure 4) and the Box-plot (Figure
5) decision threshold methods, since they provided the best overall results. The best individual
performances were produced by the pairs (MLP, p-value) and (RBF, Box-plot). These figures also
illustrate that some methods of computing decision thresholds (e.g. the p-value) are unsuitable
for certain supervised neural networks (e.g. RBF).
To illustrate how the presence of outliers in the training set influence the performance of
unsupervised and supervised novelty detectors, we simulated the pairs (SOM, Box-plot) and
(MLP, p-value) on a training set that contains a given number of fake outliers, i.e. originally
abnormal data vectors that we intentionally labelled as being normal ones. The result is shown in
Figure 6. For comparison purposes, we also simulated a standard MLP classifier for a two-class
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 25/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 25
0 5 10 15 20 25 30 35 40 45 500
10
20
30
40
50
60
70
80
90
100
Number of Hidden Neurons
F a l s e N e g a t i v e R a t e s ( % )
GMLP
MLP
RBF
AA
Fig. 4. False negative rates for supervised novelty detectors as a function of the number of hidden neurons, using the p-value
decision threshold method.
0 5 10 15 20 25 30 35 40 45 500
10
20
30
40
50
60
70
80
90
100
Number of Hidden Neurons
F a l s e N e g a t i v e R a t e s ( % )
MLP
GMLP
AA
RBF
Fig. 5. False negative rates for supervised novelty detectors as a function of the number of hidden neurons, using the Box-plot
decision threshold method.
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 26/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 26
(normal/abnormal) problem, using the WTA decision scheme. The pair (Two-class MLP, WTA)
was trained according the guidelines presented in Section III-F, using the actual labels of the
data vectors.
Figure 6 shows how the false negative rates vary with the number of outliers. As expected,
the performance of the single-class methods, (SOM, Box-plot) and (MLP, p-value), deteriorates
with the presence of outliers, while the performance of the two-class methods improves. This
occurs because single-class methods learn erroneously to consider outliers as normal data vectors,
diminishing their sensitivity to novelty. For the two-class method, the sensitivity to novelty is
increased, since the classifier learns to separate better and better what is normal from what is
abnormal. However, this occurs only when more than 30% of the training patterns are abnormal
ones. Since it is generally unrealistic to have such a high number of abnormal vectors, the overall
conclusion is that if only few abnormal patterns are available the best thing to do is to exclude
them from the training data and to choose a single-class approach. Note that in Figure 6 the
performance of the single-class methods is much better than the two-class approach when the
percentage of outliers is lower than 10%.
TABLE I
BEST RESULTS OBTAINED FOR THE NOVELTY DETECTION TASK .
False Negative False Positive
MODEL Mean Variance Mean Variance
(RBF, Box-plot) 0.1 0.1 9.9 11.8
(MLP, p-value) 0.6 0.5 3.4 3.5
(SOM, Box-plot) 2.0 1.0 3.7 2.9
(Novelty Filter, Box-plot) 3.3 41.0 5.2 11.9
(Two-Class MLP, WTA) 0.9 0.9 3.5 3.2
Finally, Table I presents the best results obtained for novelty detection for the dataset used in
this paper. The best overall performance was obtained by the pair (RBF, Box-plot). Note that
the result shown for the pair (Two-class MLP, WTA) is for a training set containing 50% of
abnormal vectors.
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 27/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 27
0 5 10 15 20 25 30 35 40 45 500
10
20
30
40
50
60
70
80
90
100
Percentage of Outliers in Data
F a l s e N e g a t i v e R a t e s ( % )
(SOM , BOX−PLOT)
(MLP , P−VALUE)
MLP Classifier
Fig. 6. False negative rates (in percentage) as a function of the number of outliers in the training set.
Data Cleaning: It was observed that all novelty detection methods have presented a consid-
erable reduction on their false negative rates after the application of the data cleaning procedure
proposed in Section V, as shown in Figure 7 for the pair (SOM, Box-plot). The number of
neurons was set to 40, and the number of training epochs was varied from 1 to 200. It is worth
noting that the training on a cleaned data set yielded the best performance of a SOM-basednovelty detector, achieving a false negative rate below 3%.
Data Scaling: For the simulations carried out so far we rescaled the data vectors through the
hard normalization method. Soft normalization and data decorrelation have also been tested, but
for this particular data set they performed poorer than the hard normalization, as can be seen
in Figure 8, when the number of neurons for the pair (SOM, box-plot) is varied. The general
conclusion we draw from these results is that different data scaling produce different error rates,
and then a number of them should be tested ever it is possible.
VII. CONCLUSION AND FURTHER WOR K
In this paper we have introduced a systematic methodology to compare the performance of
neural methods applied to novelty detection tasks. This methodology allowed us not only to
evaluate the computational properties of both supervised and unsupervised algorithms under
common basis, but also paved the way for the proposal of a data cleaning strategy for outlier
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 28/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 28
1 50 100 150 200
3
Number of Training Epochs
F a l s e N e g a t i v e R a t e s ( % )
After Data Cleaning
No Data Cleaning
Fig. 7. False negative rates for the pair (SOM, Box-plot) trained on the original and the cleaned data set as the number of
training epochs varies.
2 5 10 15 20 25 30 35 40 45 50
3
10
20
30
40
50
60
70
80
90
100
Number of Neurons
F a l s e N e g a t i v e R a t e s ( % )
Soft NormalizationHard Normalization
Whitening Transf.
Fig. 8. False negative rates obtained by the pair (SOM, Box-plot) for three different data scaling methods, as the number of
neurons varies.
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 29/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 29
removal. This outlier removal strategy has shown to be very efficient in diminishing the false
negative error rates of all simulated neural methods.
The proposed methodology was also used to assess the effectiveness of existing decision
threshold computation techniques when used in conjunction with different neural network algo-
rithms, such as the SOM, MLP and RBF. The influence of different data scaling methods and
the robustness of several neural-based novelty detectors to outliers in the training data were also
evaluated.
Further work is being developed in order to extend the applicability of the proposed methodol-
ogy to novelty detection in time series data. For this purpose, we are currently evaluating several
recurrent neural networks architectures, such as the Elman network or the Recursive SOM, using
different decision threshold computation methods. The chosen application is a computer network
intruder detection task, where anomalous behavior is to be detected based on the analysis of the
network traffic time series.
ACKNOWLEDGMENT
This work was developed under the financial support of CNPq (grant DCR:305275/2002-0).
The first author also thanks FUNCAP for supporting his graduate studies.
REFERENCES
[1] Y. Li, M. J. Pont, and N. B. Jones, “Improving the performance of radial basis function classifiers in condition monitoring
and fault diagnosis applications where ‘unknown’ faults may occur,” Pattern Recognition Letters, vol. 23, no. 5, pp. 569–
577, 2002.
[2] T. Harris, “A Kohonen SOM based machine health monitoring system which enables diagnosis of faults not seen in the
training set,” in Proceedings of the International Joint Conference on Neural Networks, (IJCNN’93) , vol. 1, pp. 947–950,
1993.
[3] M. Tanaka, M. Sakawa, I. Shiromaru, and T. Matsumoto, “Application of kohonen’s self-organizing network to the diagnosis
system for rotating machinery,” in Proceedings of the IEEE International Conference on Systems, Man and Cybernetics
(SMC’95), vol. 5, pp. 4039–4044, 1995.
[4] S. Singh and M. Markou, “An approach to novelty detection applied to the classification of image regions,” IEEE
Transactions on Knowledge and Data Engineering, vol. 16, no. 4, pp. 396–407, 2004.
[5] G. A. Carpenter, M. Rubin, and W. W. Streilein, “ARTMAP-FD: familiarity discrimination applied to radar target
recognition,” in Proceedings of the IEEE International Conference on Neural Networks, vol. 3, pp. 1459–1464, 1997.
[6] C. J. Rose and C. J. Taylor, “A generative statistical model of mammographic appearance,” in Proceedings of the 2004
Medical Image Understanding and Analysis (MUIA’04) (D. Rueckert and J. H. andG. Z. Yang, eds.), pp. 89–92, 2004.
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 30/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 30
[7] S. Marsland, U. Nehmzow, and J. Shapiro, “Novelty detection for robot neotaxis,” in Proceedings of the 2nd International
ICSC Symposium on Neural Computation, pp. 554–559, 2000.
[8] S. Marsland, J. Shapiro, and U. Nehmzow, “A self-organising network that grows when required,” Neural Networks, vol. 15,
no. 8–9, pp. 1041–1058, 2002.
[9] Y. Le Cun, B. Boser, J. S. Denker, R. E. Howard, W. Habbard, L. D. Jackel, and D. Henderson, “Handwritten digit
recognition with a back-propagation network,” in Advances in Neural Information Processing Systems (D. S. Touretzky,
ed.), vol. 2, pp. 396–404, Morgan Kaufmann, 1990.
[10] D. Vu and V. R. Vemuri, “Computer network intrusion detection: A comparison of neural networks methods,” Journal of
Differential Equations and Dynamical Systems, 2002.
[11] Z. Zhang, J. Li, C. N. Manikopoulos, J. Jorgenson, and J. Ucles, “HIDE: A hierarchical network intrusion detection system
using statistical preprocessing and neural network classification,” in Proceedings of the IEEE Workshop on Information
Assurance and Security, pp. 85–90, 2001.
[12] A. J. Hoglund, K. Hatonen, and A. S. Sorvari, “A computer host-based user anomaly detection system using the self-
organizing map,” in Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN’00) ,vol. 5, (Como, Italy), pp. 411–416, 2000.
[13] R. S. Guh, F. Zorriassatine, J. D. T. Tannock, and C. O’Brien, “On-line control chart pattern detection and discrimination:
A neural network approach,” Artificial Intelligence in Engineering, vol. 13, no. 4, pp. 413–425, 1999.
[14] G. A. Barreto, J. C. M. Mota, L. G. M. Souza, R. A. Frota, L. Aguayo, J. S. Yamamoto, and P. E. O. Macedo, “Competitive
neural networks for fault detection and diagnosis in 3G cellular systems,” Lecture Notes in Computer Science, vol. 3124,
pp. 207–313, 2004.
[15] E. Alhoniemi, J. Hollmen, O. Simula, and J. Vesanto, “Process monitoring and modeling using the self-organizing map,”
Integrated Computer Aided Engineering, vol. 6, no. 1, pp. 3–14, 1999.
[16] M. Markou and S. Singh, “Novelty detection: A review – Part 1: Statistical approaches,” Signal Processing, vol. 83, no. 12,
pp. 2481–2497, 2003.
[17] M. Markou and S. Singh, “Novelty detection: A review – Part 2: Neural network based approaches,” Signal Processing,
vol. 83, no. 12, pp. 2499–2521, 2003.
[18] V. J. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artificial Intelligence Review, vol. 22, pp. 85–126,
2004.
[19] S. Marsland, “Novelty detection in learning systems,” Neural Computing Surveys, vol. 3, pp. 157–195, 2003.
[20] A. Webb, Statistical Pattern Recognition. John Wiley & Sons, 2nd ed., 2002.
[21] S. Lawrence, I. Burns, A. D. Back, A. C. Tsoi, and C. L. Giles, “Neural network classification and unequal prior class
probabilities,” in Neural Networks: Tricks of the Trade (G. Orr, K.-R. Muller, and R. Caruana, eds.), vol. 1524 of Lecture
Notes in Computer Science, pp. 299–314, Springer Verlag, 1998.
[22] M. F. Augusteijn and B. A. Folkert, “Neural network classification and novelty detection,” International Journal of Remote
Sensing, vol. 23, no. 14, pp. 2891–2902, 2002.
[23] G. C. Vasconcelos, M. C. Fairhurst, and D. L. Bisset, “Investigating feedforward neural networks with respect to the
rejection of spurious patterns,” Pattern Recognition Letters, vol. 16, pp. 207–212, 1995.
[24] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. Chapman & Hall, 1993.
[25] J. C. Principe, N. R. Euliano, and W. C. Lefebvre, Neural and Adaptive Systems: Fundamentals through Simulations . John
Wiley & Sons, 2000.
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 31/32
SUBMITTED TO IEEE TKDE - SPECIAL ISSUE ON INTELLIGENT DATA PREPARATION 31
[26] T. Kohonen and E. Oja, “Fast adaptive formation of orthogonalizing filters and associative memory in recurrent networks
of neuron-like elements,” Biological Cybernetics, vol. 25, pp. 85–95, 1976.
[27] T. Kohonen, Self-Organization and Associative Memory. Springer-Verlag, 3rd ed., 1989.
[28] T. Kohonen, “The self-organizing map,” Proceedings of the IEEE , vol. 78, no. 9, pp. 1464–1480, 1990.
[29] T. Kohonen, Self-Organizing Maps. Springer-Verlag, 3rd ed., 2001.
[30] J. Vesanto and J. Ahola, “Hunting for correlations in data using the self-organizing map,” in Proceedings of the International
ICSC Congress on Computational Intelligence Methods and Applications (CIMA99), pp. 279–285, 1999.
[31] A. Flexer, “On the use of self-organizing maps for clustering and visualization,” Intelligent Data Analysis, vol. 5, no. 5,
pp. 373–384, 2001.
[32] J. Laiho, M. Kylvaja, and A. Hoglund, “Utilisation of advanced analysis methods in UMTS networks,” in Proceedings of
the IEEE Vehicular Technology Conference (VTS/spring), (Birmingham, Alabama), pp. 726–730, 2002.
[33] F. Gonzalez and D. Dasgupta, “Neuro-immune and self-organizing map approaches to anomaly detection: A comparison,”
in Proceedings of the First International Conference on Artificial Immune Systems , (Canterbury, UK), pp. 203–211, 2002.
[34] Y. Reich and S. V. Barai, “Evaluating machine learning models for engineering problems,” Artificial Intelligence in Engineering, vol. 13, pp. 257–272, 1999.
[35] T. J. DiCiccio and B. Efron, “Bootstrap confidence intervals,” Statistical Science, vol. 11, no. 3, pp. 189–228, 1996.
[36] A. Munoz and J. Muruzabal, “Self-organising maps for outlier detection,” Neurocomputing, vol. 18, pp. 33–60, 1998.
[37] J. W. S. Jr., “A nonlinear mapping for data structure analysis,” IEEE Transactions on Computers, vol. C-18, pp. 401–409,
1969.
[38] E. N. Sokolov, “Neuronal models and the orienting reflex,” in The central nervous system and behaviour (M. A. B. Brazier,
ed.), pp. 187–276, Josiah Macy Jr. Foundation, 1960.
[39] S. Marsland, U. Nehmzow, and J. Shapiro, “Detecting novel features of an environment using habituation,” in Proceedings
of the 6th International Conference on Simulation of Adaptative Behaviour (SAB’00) , (Cambridge, MA), MIT Press, 2000.
[40] N. Japkowicz, C. Myers, and M. Gluck, “A novelty detection approach to classification,” in Proceedings of the 14th
International Joint Conference on Artificial Intelligence (IJCAI’95), pp. 518–523, 1995.
[41] M. R. W. Dawson and D. P. Schopflocher, “Modifying the generalized delta rule to train networks of nonmonotonic
processors for pattern classification,” Connection Science, vol. 4, no. 1, pp. 19–31, 1992.
[42] T. Petsche, A. Marcantonio, C. Darken, S. J. Hanson, G. M. Kuhn, and I. Santoso, “A neural network autoassociator for
induction motor failure prediction,” in Advances in Neural Information Processing Systems (D. Touretzky, M. Mozer, and
M. Hasselmo, eds.), vol. 8, pp. 924–930, MIT Press, 1996.
[43] S. Albrecht, J. Busch, M. Kloppenburg, F. Metze, and P. Tavan, “Generalized radial basis functions networks for
classification and novelty detection: Self-organization of optimal bayesian decision,” Neural Networks, vol. 13, pp. 1075–
1093, 2000.
[44] L. P. Cordella, C. De Stefano, F. Tortorella, and M. Vento, “A method for improving classification reliability of multilayer
perceptrons,” IEEE Transactions on Neural Networks, vol. 6, no. 5, pp. 1140–1147, 1995.
[45] J. F. D. Addison, S. Wermter, and J. MacIntyre, “Effectiveness of feature extraction in neural network architectures for
novelty detection,” in Proceedings of the 9th International Conference on Artificial Neural Networks (ICANN’99) , pp. 976–
981, IEE Press, 1999.
[46] A. Ypma and R. P. W. Duin, “Novelty detection using self-organising maps,” in Progress in Connectionist-Based
November 24, 2004 DRAFT
7/28/2019 116927830 Principios Del Filtro de Novedad
http://slidepdf.com/reader/full/116927830-principios-del-filtro-de-novedad 32/32