lsj1360 anomaly detection via online oversampling
TRANSCRIPT
-
8/11/2019 LSJ1360 Anomaly Detection via Online Oversampling
1/5
A Novel on Intrusion Detection Using Outlier Detection PCA
and ICA Techniques
ABSTRACT
Outlier detection is the process of identifying unusual behavior. It is widely used in datamining, for example, to identify customer behavioral change, fraud and manufacturingflaws. In recent years many researchers had proposed several concepts to obtain the
optimal result in detecting the anomalies. But the process of PCA made it challenging due to its
computations. In order to overcome the computational complexity, online oversampling PCAhas been used. The algorithm enables quick Online updating of the principal directions for
the effective computation and satisfying the online detecting demand and also oversampling
will improve the impact of outliers which leads to accurate detection of outliers. Experimental
results show that this method is effective in computation time and need less memory
requirements also clustering technique is added to it for optimization.
KEYWORDS: Outlier Detection PCA, Description of KDD cup dataset, Use of PCA forAnomaly Detection.
1.
INTRODUCTION
In the modern world, computer has become an
inevitable resource. To get connected with one
another and to share the information, network has
become an emerging technology. As the users are
geographically distributed, resources and
information sharing are done only via internet.
The users are not aware that the people with
whom they are connected are authorized users. So
that many hackers pretend to be the authorized
users and try to hack the private information. To
overcome this problem Intrusion Detection System
(IDS) is developed [2].
The idea of Intrusion Detection was started in 1980.
Any action that is performed by a user illegally
is called Intrusion. Intrusion Detection is detecting
such anomalous activity at computing and
networking resources [1].
The techniques for intrusion detection are of two
major categories as shown in Fig. 1: Misuse
Detection and Anomaly Detection. Misuse detection-
Catch the intrusions in terms of the characteristics ofknown attacks or system vulnerabilities [3, 4 And 5].
Anomaly detection- Detect any action that
significantly deviates from the normal behavior.
Anomaly detection is robust against new types of
attacks and it can learn by examples no need to write
rules by hand. So, it has become hot topic in the field
of computer security [1], [2].
Fig. 1Classification of Intrusion Detection
Data Mining is the growing technology in the field of
Computer Science. The trend of research and
development on data mining is expected to be
flourishing because the huge amounts of data
have been collected in databases and the necessity
of understanding and making good use of such
data in decision making has served as the driving
force in data mining. Moreover, with the fast
computerization of the society, the social impact of
data mining should not be under-estimated. When a
large amount of interrelated data is effectively
analyzed from different perspectives, it can pose
threats to the goal of protecting data security and
guarding against the incursion of privacy. It is a
tricky task to develop effective techniques for
preventing the disclosure of sensitive information
in data mining, especially as the use of data
mining is rapidly increasing in various domains
-
8/11/2019 LSJ1360 Anomaly Detection via Online Oversampling
2/5
ranging from business examination, customer
analysis to medicine and government.
Data mining involves six familiar classes of tasks:
Anomaly detection
Association rule learning
Clustering
Classification
Regression
Sequential pattern mining
The presence of errors, missing values and outliers
may affect the tasks performed on normal data. So
the outliers are predicted and removed from normal
data. The removal of outliers introduced the
emerging task called Anomaly Detection. It has
used in many applications like NIDS (Network
Intrusion Detection System), credit card fraud
detection, medical application, industrial damage etc.
In the Intrusion Detection System, anomaly
detection has been performed by many techniques.
Among which outlier detection technique is
applicable for many online applications and it also
involves many approaches. This paper is based on
the survey made on intrusion detection system
that includes outlier detection techniques [1]. It
includes both supervised as well as unsupervised
learning approaches.
2.
OUTLIER DETECTION PCA.
2.1 Outlier detection via PCA
PCA is a well known unsupervised
dimensionality reduction method which determines
the principal directions of the data distribution.
To obtain these principal directions, one needs to
construct the data covariance matrix and calculate
its dominant eigenvectors. These eigenvectors will
be the most informative among the vectors in the
original data space, and are considered as the
principal directions. The black circle represents
normal data instances, the red circle denotes an
outlier and the arrow is the dominant principal
direction. The principal direction is deviated when an
outlier instance is added. More specifically, thepresence of such an outlier instance produces a large
angle between the resulting and the original principal
directions. On the other hand, this angle will be small
when a normal data point is added. Therefore, we
will use this property to determine the outlier ness
of the target data point using the LOO
strategy.PCA requires the calculation of global
mean and data covariance matrix, we found that
both of them are sensitive to the presence of
outliers. There are outliers present in the data
dominant eigenvectors produced by PCA will be
remarkably affected by them and this will produce a
significant variation of the resulting principal
directions.
Fig 2: The effect of addition and deletion of an
outlier as single/duplicated points
2.2 Oversampling PCA (os PCA)
The os PCA scheme will duplicate the target
instance multiple times, and the idea is to
amplify the effect of outlier rather than that of
normal data. While it might not be sufficient to
perform anomaly detection simply based on the most
dominant eigenvector and ignore the remaining ones,
online os PCA method aims to efficiently determine
the anomaly of each target instance without
sacrificing computation and memory efficiency.
More specifically, if the target instance is an
outlier, this over- sampling scheme allows us to
overemphasize its effect on the most dominant
eigenvector and thus we can focus on extracting
and approximating the dominant principal direction
in an online fashion, instead of calculating multiple
eigenvectors carefully.
The os PCA will duplicate the target instances n
no of times from data set and computes the score
of outlier. If the score is above the threshold it is
determined to be an outlier. The power method for os
PCA is also provides the solution in solving the
Eigen value decomposition problem. The solving of
PCA to calculate the principal directions n times
using n instances and it is very costlier and itprohibits the online data in anomaly detection. Using
LOO strategy it is not necessary to recomputed
covariance matrix for all instances and thus find
the difference between the normal and outlier data is
easily identified.
The dominant Eigen value is more than others
and it is shown that power method requires only
multiplication of matrix not matrix decomposition.
-
8/11/2019 LSJ1360 Anomaly Detection via Online Oversampling
3/5
So this method is better for reducing the computation
cost in calculation of dominant principal direction.
The cost of memory is O (np) for matrix
multiplication. However the use of power method
does not guarantee fast convergence and it also need
to store covariance matrix which not used in high
dimensional data, so we prefer the least square
approximation method.
2.3 Oversampling PCA with Anomaly Detection
Effective and efficient updating technique for
osPCA, which allows us to determine the
principal direction of the data. This updating
process makes anomaly detection in online or
streaming data settings feasible. More importantly,
since we only need to calculate the solution of the
original PCA offline, we do not need to keep the
entire covariance or outer matrix in the entire
updating process. Once the final principal
direction is determined, we use the cosinesimilarity to determine the difference between the
current solution and the original one (without
oversampling), and thus the score of outlierness
for the target instance can be determined
accordingly. An algorithm anomaly detection via
online oversampling PCA in which data matrix
with values are transposed and it is weighted with a
value. Score of outlinerness is found if s is higher
than the threshold x is an outlier for computing the
principal directions and finds the cosine similarity
to obtain the difference between the existing value
and new instance.
2.4 The Use of PCA for Anomaly Detection
In this section, we study the variation of
principal directions when we remove or add a
data instance, and how we utilize this property to
determine the outlierness of the target data point. We
use Figure 1 to illustrate the above observation. We
note that the clustered blue circles in Figure 1
represent normal data instances, the red square
denotes an outlier, and the green arrow is the
dominant principal direction. From Figure 1, we
see that the principal direction is deviated when
an outlier instance is added. More specifically,
the presence of such an outlier instance producesa large angle between the resulting and the
original principal directions. On the other hand,
this angle will be small when a normal data point is
added. Therefore, we will use this property to
determine the outlierness of the target data point
using the LOO strategy. We now present the idea of
combining PCA and the LOO strategy for anomaly
detection. Given a data set A with n data
instances, we first extract the dominant principal
direction u from it. If the target instance is x t, we
next compute the leading principal direction without
x t present. To identify the outliers in a dataset,
we simply repeat this procedure times with the
LOO strategy (one for each target instance).
3.
EXPERIMENTAL WORK
The above four techniques are evaluated with KDD
cup data set. The description about the dataset and
the results are tabulated below.
A. Description of KDD cup dataset
Software to detect intrusions protects the computer
network from unauthorised users. Intrusion
Detection System is capable of distinguishing bad
connections, called intrusions, and good normal
connections [8]. A connection is a sequence of TCP
packets between which data flow from source totarget under some well defined protocol.
This dataset was collected from a LAN setup which
is operated as a true Air Force environment After
two week of heavy network traffic the collected
test data yields 2 million connection records.
Each connection is labelled as either normal data or
an attack [8].
Denial-of-service attacks (DOS)
Probing surveillance attacks e.g., port
scanning
Remote-to-local attacks (R2L) e.g.
guessing password
User-to-root attacks (U2R) e.g., various
buffer overflow attacks the test data may
include some other attacks to make it more
realistic.
B. Performance Evaluation
The outlier detection techniques that are surveyed
in this paper are evaluated with KDD cup
dataset. Then the results are tabulated.
Common measures for evaluating anomaly detection
problems:
Recall (Detection rate) ratio between the number
of correctly detected anomalies and the total number
of anomalies
False alarm (false positive) rate ratio between
the number of data records from normal class
that are misclassified as anomalies and total
number of data records from normal class ROC
-
8/11/2019 LSJ1360 Anomaly Detection via Online Oversampling
4/5
Curve is a trade-off between Recall and False alarm
Area under the ROC curve (AUC) is computed using
a trapezoid rule
TABLE I
AUC Scores of oversampling PCA (osPCA), Local
Outlier Factor (LOF), Distance-Based on KDD cupdataset
From the above tabulation it is clear that osPCA
approach provides accurate detection with high AUC
score and minimum false rate [2]-[7].
4. RELATED WORK
Most of the anomaly detection algorithms are
proposed in [2], [6], [8].The methods can be
density based and distance based, statistical
approach. In density based method [8] it uses Local
Outlier Factor (LOF) in which it assigns a degree to
each of the data instance and LOF is approximately
determined to be 1.Higher the LOF declares that the
data point is an outlier and lower the LOF declares
that the data point is not an outlier. LOF judges the
structure of data through the density of the data point.
In distance based method [6] the distance between
the data point and the neighbor data point is
evaluated, if the distance between them is
beyond the estimated threshold value then it is
considered as an outlier. In statistical approach in
which data is provide with certain standard and it
identifies the outlier when data deviate the standard
and also it is unfair for the data with noise. It is also
not feasible for assumption of online and offline data.
The latest outlier detection methodology is Angle
Based Outlier Detection(ABOD) method in which
it measure the angle between target instance and
the existing data points and it determines data point
as an outlier only when the angle between them is
large. It has the limitation that it requires high
computation complexity for measuring the angle
for instance pairs is tedious. The above methods arenot suitable for online and continuous data and also it
requires higher memory requirement and
computation in high dimensional space. In this
paper, we propose the ICA will substantially used
among multiple instances to find an outlier.
5. FUTURESCOPE
The advanced PCA is the efficient technique as it
reduces the memory requirement and the
computational cost. This technique cab be able
to handle the high dimensional data without the
large iterations. Communication protocol helps in
reducing the computational cost. Due to the online
storage the memory requirement is also reduced.
6.
FUTURE WORK
The future work can be carried on the normal data
with the multi clustering structure. It is also required
to focus on the extremely high dimensional data so as
to solve the problem of curse of dimensionality.
7. CONCLUSION
An online anomaly detection method based on
oversample PCA, and thus can successfully use
the variation of the dominant principal direction
to identify the presence of rare but abnormal data,online osPCA is preferable for online large-scale
or streaming data problems, compared with other
anomaly detection methods, this approach is able
to achieve satisfactory results while significantly
reducing computational costs and memory
requirements. This paper provides a solution to the
curse of dimensionality problem in the pair wise
scoring techniques. Using dynamic programming
alignment matrix is calculated.(similar to
Needleman-W unsch method).The scoring method
is used to calculate alignment matrix. Trace back
begins at the element in the matrix with the
maximum score. Trace back continues until a cell
with a score of zero is reached. Local alignment
based on trace back path has been calculated.
REFERENCES
[1] W. Lee and S. Stolfo, Data mining approaches
for intrusion detection Proc. of the 7th USENIX
security symposium, 1998.
[2] E. Bloedorn, et al., Data Mining for Network
Intrusion Detection: How to Get Started, MITRE
Technical Report, August 2001.
[3 ] A.K. Jones and R.S. Sielken, Computer SystemIntrusion Detection: A Survey. Technical report,
University of Virginia Computer Science
Department, 1999
[4] V.Barnett and T. Lewis, Outliers in Statistical
Data, John Wiley Sons, 2006
[5] D.M. Hawkins, Identification of Outliers.
Chapman and Hall, 1980
-
8/11/2019 LSJ1360 Anomaly Detection via Online Oversampling
5/5
[6] F.Angiulli, S.Basta, and C.Pizzuti, Distance
Based Detection and Prediction of Outliers, IEEE
Trans.on Knowledge and Data Eng., vol. 18, no. 2,
pp.145-160, May 2006.
[7] A.Lazarevic, L.Ertoz, V.Kumar, A.Ozgur, and
J.Srivastava,A Comparitive Study of Anomaly
Detection Schemes In Network Intrusion Detection,
Proc. Third SIAM Intl Conf. Data Mining 2003
[8] M.Breunig, H.-P.Kriegel, R.T. Ng, and
J.Sander,LOF Identifying Density Based Local
Outliers, Proc. ACM SIGMOD Intl Conf.
Management of Data 2000