lsj1360 anomaly detection via online oversampling

8/11/2019 LSJ1360 Anomaly Detection via Online Oversampling

1/5

A Novel on Intrusion Detection Using Outlier Detection PCA

and ICA Techniques

ABSTRACT

Outlier detection is the process of identifying unusual behavior. It is widely used in datamining, for example, to identify customer behavioral change, fraud and manufacturingflaws. In recent years many researchers had proposed several concepts to obtain the

optimal result in detecting the anomalies. But the process of PCA made it challenging due to its

computations. In order to overcome the computational complexity, online oversampling PCAhas been used. The algorithm enables quick Online updating of the principal directions for

the effective computation and satisfying the online detecting demand and also oversampling

will improve the impact of outliers which leads to accurate detection of outliers. Experimental

results show that this method is effective in computation time and need less memory

requirements also clustering technique is added to it for optimization.

KEYWORDS: Outlier Detection PCA, Description of KDD cup dataset, Use of PCA forAnomaly Detection.

1.

INTRODUCTION

In the modern world, computer has become an

inevitable resource. To get connected with one

another and to share the information, network has

become an emerging technology. As the users are

geographically distributed, resources and

information sharing are done only via internet.

The users are not aware that the people with

whom they are connected are authorized users. So

that many hackers pretend to be the authorized

users and try to hack the private information. To

overcome this problem Intrusion Detection System

(IDS) is developed [2].

The idea of Intrusion Detection was started in 1980.

Any action that is performed by a user illegally

is called Intrusion. Intrusion Detection is detecting

such anomalous activity at computing and

networking resources [1].

The techniques for intrusion detection are of two

major categories as shown in Fig. 1: Misuse

Detection and Anomaly Detection. Misuse detection-

Catch the intrusions in terms of the characteristics ofknown attacks or system vulnerabilities [3, 4 And 5].

Anomaly detection- Detect any action that

significantly deviates from the normal behavior.

Anomaly detection is robust against new types of

attacks and it can learn by examples no need to write

rules by hand. So, it has become hot topic in the field

of computer security [1], [2].

Fig. 1Classification of Intrusion Detection

Data Mining is the growing technology in the field of

Computer Science. The trend of research and

development on data mining is expected to be

flourishing because the huge amounts of data

have been collected in databases and the necessity

of understanding and making good use of such

data in decision making has served as the driving

force in data mining. Moreover, with the fast

computerization of the society, the social impact of

data mining should not be under-estimated. When a

large amount of interrelated data is effectively

analyzed from different perspectives, it can pose

threats to the goal of protecting data security and

guarding against the incursion of privacy. It is a

tricky task to develop effective techniques for

preventing the disclosure of sensitive information

in data mining, especially as the use of data

mining is rapidly increasing in various domains


2/5

ranging from business examination, customer

analysis to medicine and government.

Data mining involves six familiar classes of tasks:

Anomaly detection

Association rule learning

Clustering

Classification

Regression

Sequential pattern mining

The presence of errors, missing values and outliers

may affect the tasks performed on normal data. So

the outliers are predicted and removed from normal

data. The removal of outliers introduced the

emerging task called Anomaly Detection. It has

used in many applications like NIDS (Network

Intrusion Detection System), credit card fraud

detection, medical application, industrial damage etc.

In the Intrusion Detection System, anomaly

detection has been performed by many techniques.

Among which outlier detection technique is

applicable for many online applications and it also

involves many approaches. This paper is based on

the survey made on intrusion detection system

that includes outlier detection techniques [1]. It

includes both supervised as well as unsupervised

learning approaches.

2.

OUTLIER DETECTION PCA.

2.1 Outlier detection via PCA

PCA is a well known unsupervised

dimensionality reduction method which determines

the principal directions of the data distribution.

To obtain these principal directions, one needs to

construct the data covariance matrix and calculate

its dominant eigenvectors. These eigenvectors will

be the most informative among the vectors in the

original data space, and are considered as the

principal directions. The black circle represents

normal data instances, the red circle denotes an

outlier and the arrow is the dominant principal

direction. The principal direction is deviated when an

outlier instance is added. More specifically, thepresence of such an outlier instance produces a large

angle between the resulting and the original principal

directions. On the other hand, this angle will be small

when a normal data point is added. Therefore, we

will use this property to determine the outlier ness

of the target data point using the LOO

strategy.PCA requires the calculation of global

mean and data covariance matrix, we found that

both of them are sensitive to the presence of

outliers. There are outliers present in the data

dominant eigenvectors produced by PCA will be

remarkably affected by them and this will produce a

significant variation of the resulting principal

directions.

Fig 2: The effect of addition and deletion of an

outlier as single/duplicated points

2.2 Oversampling PCA (os PCA)

The os PCA scheme will duplicate the target

instance multiple times, and the idea is to

amplify the effect of outlier rather than that of

normal data. While it might not be sufficient to

perform anomaly detection simply based on the most

dominant eigenvector and ignore the remaining ones,

online os PCA method aims to efficiently determine

the anomaly of each target instance without

sacrificing computation and memory efficiency.

More specifically, if the target instance is an

outlier, this oversampling scheme allows us to

overemphasize its effect on the most dominant

eigenvector and thus we can focus on extracting

and approximating the dominant principal direction

in an online fashion, instead of calculating multiple

eigenvectors carefully.

The os PCA will duplicate the target instances n

no of times from data set and computes the score

of outlier. If the score is above the threshold it is

determined to be an outlier. The power method for os

PCA is also provides the solution in solving the

Eigen value decomposition problem. The solving of

PCA to calculate the principal directions n times

using n instances and it is very costlier and itprohibits the online data in anomaly detection. Using

LOO strategy it is not necessary to recomputed

covariance matrix for all instances and thus find

the difference between the normal and outlier data is

easily identified.

The dominant Eigen value is more than others

and it is shown that power method requires only

multiplication of matrix not matrix decomposition.


3/5

So this method is better for reducing the computation

cost in calculation of dominant principal direction.

The cost of memory is O (np) for matrix

multiplication. However the use of power method

does not guarantee fast convergence and it also need

to store covariance matrix which not used in high

dimensional data, so we prefer the least square

approximation method.

2.3 Oversampling PCA with Anomaly Detection

Effective and efficient updating technique for

osPCA, which allows us to determine the

principal direction of the data. This updating

process makes anomaly detection in online or

streaming data settings feasible. More importantly,

since we only need to calculate the solution of the

original PCA offline, we do not need to keep the

entire covariance or outer matrix in the entire

updating process. Once the final principal

direction is determined, we use the cosinesimilarity to determine the difference between the

current solution and the original one (without

oversampling), and thus the score of outlierness

for the target instance can be determined

accordingly. An algorithm anomaly detection via

online oversampling PCA in which data matrix

with values are transposed and it is weighted with a

value. Score of outlinerness is found if s is higher

than the threshold x is an outlier for computing the

principal directions and finds the cosine similarity

to obtain the difference between the existing value

and new instance.

2.4 The Use of PCA for Anomaly Detection

In this section, we study the variation of

principal directions when we remove or add a

data instance, and how we utilize this property to

determine the outlierness of the target data point. We

use Figure 1 to illustrate the above observation. We

note that the clustered blue circles in Figure 1

represent normal data instances, the red square

denotes an outlier, and the green arrow is the

dominant principal direction. From Figure 1, we

see that the principal direction is deviated when

an outlier instance is added. More specifically,

the presence of such an outlier instance producesa large angle between the resulting and the

original principal directions. On the other hand,

this angle will be small when a normal data point is

added. Therefore, we will use this property to

determine the outlierness of the target data point

using the LOO strategy. We now present the idea of

combining PCA and the LOO strategy for anomaly

detection. Given a data set A with n data

instances, we first extract the dominant principal

direction u from it. If the target instance is x t, we

next compute the leading principal direction without

x t present. To identify the outliers in a dataset,

we simply repeat this procedure times with the

LOO strategy (one for each target instance).

3.

EXPERIMENTAL WORK

The above four techniques are evaluated with KDD

cup data set. The description about the dataset and

the results are tabulated below.

A. Description of KDD cup dataset

Software to detect intrusions protects the computer

network from unauthorised users. Intrusion

Detection System is capable of distinguishing bad

connections, called intrusions, and good normal

connections [8]. A connection is a sequence of TCP

packets between which data flow from source totarget under some well defined protocol.

This dataset was collected from a LAN setup which

is operated as a true Air Force environment After

two week of heavy network traffic the collected

test data yields 2 million connection records.

Each connection is labelled as either normal data or

an attack [8].

Denial-of-service attacks (DOS)

Probing surveillance attacks e.g., port

scanning

Remote-to-local attacks (R2L) e.g.

guessing password

User-to-root attacks (U2R) e.g., various

buffer overflow attacks the test data may

include some other attacks to make it more

realistic.

B. Performance Evaluation

The outlier detection techniques that are surveyed

in this paper are evaluated with KDD cup

dataset. Then the results are tabulated.

Common measures for evaluating anomaly detection

problems:

Recall (Detection rate) ratio between the number

of correctly detected anomalies and the total number

of anomalies

False alarm (false positive) rate ratio between

the number of data records from normal class

that are misclassified as anomalies and total

number of data records from normal class ROC


4/5

Curve is a trade-off between Recall and False alarm

Area under the ROC curve (AUC) is computed using

a trapezoid rule

TABLE I

AUC Scores of oversampling PCA (osPCA), Local

Outlier Factor (LOF), Distance-Based on KDD cupdataset

From the above tabulation it is clear that osPCA

approach provides accurate detection with high AUC

score and minimum false rate [2]-[7].

4. RELATED WORK

Most of the anomaly detection algorithms are

proposed in [2], [6], [8].The methods can be

density based and distance based, statistical

approach. In density based method [8] it uses Local

Outlier Factor (LOF) in which it assigns a degree to

each of the data instance and LOF is approximately

determined to be 1.Higher the LOF declares that the

data point is an outlier and lower the LOF declares

that the data point is not an outlier. LOF judges the

structure of data through the density of the data point.

In distance based method [6] the distance between

the data point and the neighbor data point is

evaluated, if the distance between them is

beyond the estimated threshold value then it is

considered as an outlier. In statistical approach in

which data is provide with certain standard and it

identifies the outlier when data deviate the standard

and also it is unfair for the data with noise. It is also

not feasible for assumption of online and offline data.

The latest outlier detection methodology is Angle

Based Outlier Detection(ABOD) method in which

it measure the angle between target instance and

the existing data points and it determines data point

as an outlier only when the angle between them is

large. It has the limitation that it requires high

computation complexity for measuring the angle

for instance pairs is tedious. The above methods arenot suitable for online and continuous data and also it

requires higher memory requirement and

computation in high dimensional space. In this

paper, we propose the ICA will substantially used

among multiple instances to find an outlier.

5. FUTURESCOPE

The advanced PCA is the efficient technique as it

reduces the memory requirement and the

computational cost. This technique cab be able

to handle the high dimensional data without the

large iterations. Communication protocol helps in

reducing the computational cost. Due to the online

storage the memory requirement is also reduced.

6.

FUTURE WORK

The future work can be carried on the normal data

with the multi clustering structure. It is also required

to focus on the extremely high dimensional data so as

to solve the problem of curse of dimensionality.

7. CONCLUSION

An online anomaly detection method based on

oversample PCA, and thus can successfully use

the variation of the dominant principal direction

to identify the presence of rare but abnormal data,online osPCA is preferable for online large-scale

or streaming data problems, compared with other

anomaly detection methods, this approach is able

to achieve satisfactory results while significantly

reducing computational costs and memory

requirements. This paper provides a solution to the

curse of dimensionality problem in the pair wise

scoring techniques. Using dynamic programming

alignment matrix is calculated.(similar to

Needleman-W unsch method).The scoring method

is used to calculate alignment matrix. Trace back

begins at the element in the matrix with the

maximum score. Trace back continues until a cell

with a score of zero is reached. Local alignment

based on trace back path has been calculated.

REFERENCES

[1] W. Lee and S. Stolfo, Data mining approaches

for intrusion detection Proc. of the 7th USENIX

security symposium, 1998.

[2] E. Bloedorn, et al., Data Mining for Network

Intrusion Detection: How to Get Started, MITRE

Technical Report, August 2001.

[3 ] A.K. Jones and R.S. Sielken, Computer SystemIntrusion Detection: A Survey. Technical report,

University of Virginia Computer Science

Department, 1999

[4] V.Barnett and T. Lewis, Outliers in Statistical

Data, John Wiley Sons, 2006

[5] D.M. Hawkins, Identification of Outliers.

Chapman and Hall, 1980


5/5

[6] F.Angiulli, S.Basta, and C.Pizzuti, Distance

Based Detection and Prediction of Outliers, IEEE

Trans.on Knowledge and Data Eng., vol. 18, no. 2,

pp.145-160, May 2006.

[7] A.Lazarevic, L.Ertoz, V.Kumar, A.Ozgur, and

J.Srivastava,A Comparitive Study of Anomaly

Detection Schemes In Network Intrusion Detection,

Proc. Third SIAM Intl Conf. Data Mining 2003

[8] M.Breunig, H.-P.Kriegel, R.T. Ng, and

J.Sander,LOF Identifying Density Based Local

Outliers, Proc. ACM SIGMOD Intl Conf.

Management of Data 2000

lsj1360 anomaly detection via online oversampling

Documents