lsj1360 anomaly detection via online oversampling

Upload: nagabhushanamdonthineni

Post on 02-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 LSJ1360 Anomaly Detection via Online Oversampling

    1/5

    A Novel on Intrusion Detection Using Outlier Detection PCA

    and ICA Techniques

    ABSTRACT

    Outlier detection is the process of identifying unusual behavior. It is widely used in datamining, for example, to identify customer behavioral change, fraud and manufacturingflaws. In recent years many researchers had proposed several concepts to obtain the

    optimal result in detecting the anomalies. But the process of PCA made it challenging due to its

    computations. In order to overcome the computational complexity, online oversampling PCAhas been used. The algorithm enables quick Online updating of the principal directions for

    the effective computation and satisfying the online detecting demand and also oversampling

    will improve the impact of outliers which leads to accurate detection of outliers. Experimental

    results show that this method is effective in computation time and need less memory

    requirements also clustering technique is added to it for optimization.

    KEYWORDS: Outlier Detection PCA, Description of KDD cup dataset, Use of PCA forAnomaly Detection.

    1.

    INTRODUCTION

    In the modern world, computer has become an

    inevitable resource. To get connected with one

    another and to share the information, network has

    become an emerging technology. As the users are

    geographically distributed, resources and

    information sharing are done only via internet.

    The users are not aware that the people with

    whom they are connected are authorized users. So

    that many hackers pretend to be the authorized

    users and try to hack the private information. To

    overcome this problem Intrusion Detection System

    (IDS) is developed [2].

    The idea of Intrusion Detection was started in 1980.

    Any action that is performed by a user illegally

    is called Intrusion. Intrusion Detection is detecting

    such anomalous activity at computing and

    networking resources [1].

    The techniques for intrusion detection are of two

    major categories as shown in Fig. 1: Misuse

    Detection and Anomaly Detection. Misuse detection-

    Catch the intrusions in terms of the characteristics ofknown attacks or system vulnerabilities [3, 4 And 5].

    Anomaly detection- Detect any action that

    significantly deviates from the normal behavior.

    Anomaly detection is robust against new types of

    attacks and it can learn by examples no need to write

    rules by hand. So, it has become hot topic in the field

    of computer security [1], [2].

    Fig. 1Classification of Intrusion Detection

    Data Mining is the growing technology in the field of

    Computer Science. The trend of research and

    development on data mining is expected to be

    flourishing because the huge amounts of data

    have been collected in databases and the necessity

    of understanding and making good use of such

    data in decision making has served as the driving

    force in data mining. Moreover, with the fast

    computerization of the society, the social impact of

    data mining should not be under-estimated. When a

    large amount of interrelated data is effectively

    analyzed from different perspectives, it can pose

    threats to the goal of protecting data security and

    guarding against the incursion of privacy. It is a

    tricky task to develop effective techniques for

    preventing the disclosure of sensitive information

    in data mining, especially as the use of data

    mining is rapidly increasing in various domains

  • 8/11/2019 LSJ1360 Anomaly Detection via Online Oversampling

    2/5

    ranging from business examination, customer

    analysis to medicine and government.

    Data mining involves six familiar classes of tasks:

    Anomaly detection

    Association rule learning

    Clustering

    Classification

    Regression

    Sequential pattern mining

    The presence of errors, missing values and outliers

    may affect the tasks performed on normal data. So

    the outliers are predicted and removed from normal

    data. The removal of outliers introduced the

    emerging task called Anomaly Detection. It has

    used in many applications like NIDS (Network

    Intrusion Detection System), credit card fraud

    detection, medical application, industrial damage etc.

    In the Intrusion Detection System, anomaly

    detection has been performed by many techniques.

    Among which outlier detection technique is

    applicable for many online applications and it also

    involves many approaches. This paper is based on

    the survey made on intrusion detection system

    that includes outlier detection techniques [1]. It

    includes both supervised as well as unsupervised

    learning approaches.

    2.

    OUTLIER DETECTION PCA.

    2.1 Outlier detection via PCA

    PCA is a well known unsupervised

    dimensionality reduction method which determines

    the principal directions of the data distribution.

    To obtain these principal directions, one needs to

    construct the data covariance matrix and calculate

    its dominant eigenvectors. These eigenvectors will

    be the most informative among the vectors in the

    original data space, and are considered as the

    principal directions. The black circle represents

    normal data instances, the red circle denotes an

    outlier and the arrow is the dominant principal

    direction. The principal direction is deviated when an

    outlier instance is added. More specifically, thepresence of such an outlier instance produces a large

    angle between the resulting and the original principal

    directions. On the other hand, this angle will be small

    when a normal data point is added. Therefore, we

    will use this property to determine the outlier ness

    of the target data point using the LOO

    strategy.PCA requires the calculation of global

    mean and data covariance matrix, we found that

    both of them are sensitive to the presence of

    outliers. There are outliers present in the data

    dominant eigenvectors produced by PCA will be

    remarkably affected by them and this will produce a

    significant variation of the resulting principal

    directions.

    Fig 2: The effect of addition and deletion of an

    outlier as single/duplicated points

    2.2 Oversampling PCA (os PCA)

    The os PCA scheme will duplicate the target

    instance multiple times, and the idea is to

    amplify the effect of outlier rather than that of

    normal data. While it might not be sufficient to

    perform anomaly detection simply based on the most

    dominant eigenvector and ignore the remaining ones,

    online os PCA method aims to efficiently determine

    the anomaly of each target instance without

    sacrificing computation and memory efficiency.

    More specifically, if the target instance is an

    outlier, this over- sampling scheme allows us to

    overemphasize its effect on the most dominant

    eigenvector and thus we can focus on extracting

    and approximating the dominant principal direction

    in an online fashion, instead of calculating multiple

    eigenvectors carefully.

    The os PCA will duplicate the target instances n

    no of times from data set and computes the score

    of outlier. If the score is above the threshold it is

    determined to be an outlier. The power method for os

    PCA is also provides the solution in solving the

    Eigen value decomposition problem. The solving of

    PCA to calculate the principal directions n times

    using n instances and it is very costlier and itprohibits the online data in anomaly detection. Using

    LOO strategy it is not necessary to recomputed

    covariance matrix for all instances and thus find

    the difference between the normal and outlier data is

    easily identified.

    The dominant Eigen value is more than others

    and it is shown that power method requires only

    multiplication of matrix not matrix decomposition.

  • 8/11/2019 LSJ1360 Anomaly Detection via Online Oversampling

    3/5

    So this method is better for reducing the computation

    cost in calculation of dominant principal direction.

    The cost of memory is O (np) for matrix

    multiplication. However the use of power method

    does not guarantee fast convergence and it also need

    to store covariance matrix which not used in high

    dimensional data, so we prefer the least square

    approximation method.

    2.3 Oversampling PCA with Anomaly Detection

    Effective and efficient updating technique for

    osPCA, which allows us to determine the

    principal direction of the data. This updating

    process makes anomaly detection in online or

    streaming data settings feasible. More importantly,

    since we only need to calculate the solution of the

    original PCA offline, we do not need to keep the

    entire covariance or outer matrix in the entire

    updating process. Once the final principal

    direction is determined, we use the cosinesimilarity to determine the difference between the

    current solution and the original one (without

    oversampling), and thus the score of outlierness

    for the target instance can be determined

    accordingly. An algorithm anomaly detection via

    online oversampling PCA in which data matrix

    with values are transposed and it is weighted with a

    value. Score of outlinerness is found if s is higher

    than the threshold x is an outlier for computing the

    principal directions and finds the cosine similarity

    to obtain the difference between the existing value

    and new instance.

    2.4 The Use of PCA for Anomaly Detection

    In this section, we study the variation of

    principal directions when we remove or add a

    data instance, and how we utilize this property to

    determine the outlierness of the target data point. We

    use Figure 1 to illustrate the above observation. We

    note that the clustered blue circles in Figure 1

    represent normal data instances, the red square

    denotes an outlier, and the green arrow is the

    dominant principal direction. From Figure 1, we

    see that the principal direction is deviated when

    an outlier instance is added. More specifically,

    the presence of such an outlier instance producesa large angle between the resulting and the

    original principal directions. On the other hand,

    this angle will be small when a normal data point is

    added. Therefore, we will use this property to

    determine the outlierness of the target data point

    using the LOO strategy. We now present the idea of

    combining PCA and the LOO strategy for anomaly

    detection. Given a data set A with n data

    instances, we first extract the dominant principal

    direction u from it. If the target instance is x t, we

    next compute the leading principal direction without

    x t present. To identify the outliers in a dataset,

    we simply repeat this procedure times with the

    LOO strategy (one for each target instance).

    3.

    EXPERIMENTAL WORK

    The above four techniques are evaluated with KDD

    cup data set. The description about the dataset and

    the results are tabulated below.

    A. Description of KDD cup dataset

    Software to detect intrusions protects the computer

    network from unauthorised users. Intrusion

    Detection System is capable of distinguishing bad

    connections, called intrusions, and good normal

    connections [8]. A connection is a sequence of TCP

    packets between which data flow from source totarget under some well defined protocol.

    This dataset was collected from a LAN setup which

    is operated as a true Air Force environment After

    two week of heavy network traffic the collected

    test data yields 2 million connection records.

    Each connection is labelled as either normal data or

    an attack [8].

    Denial-of-service attacks (DOS)

    Probing surveillance attacks e.g., port

    scanning

    Remote-to-local attacks (R2L) e.g.

    guessing password

    User-to-root attacks (U2R) e.g., various

    buffer overflow attacks the test data may

    include some other attacks to make it more

    realistic.

    B. Performance Evaluation

    The outlier detection techniques that are surveyed

    in this paper are evaluated with KDD cup

    dataset. Then the results are tabulated.

    Common measures for evaluating anomaly detection

    problems:

    Recall (Detection rate) ratio between the number

    of correctly detected anomalies and the total number

    of anomalies

    False alarm (false positive) rate ratio between

    the number of data records from normal class

    that are misclassified as anomalies and total

    number of data records from normal class ROC

  • 8/11/2019 LSJ1360 Anomaly Detection via Online Oversampling

    4/5

    Curve is a trade-off between Recall and False alarm

    Area under the ROC curve (AUC) is computed using

    a trapezoid rule

    TABLE I

    AUC Scores of oversampling PCA (osPCA), Local

    Outlier Factor (LOF), Distance-Based on KDD cupdataset

    From the above tabulation it is clear that osPCA

    approach provides accurate detection with high AUC

    score and minimum false rate [2]-[7].

    4. RELATED WORK

    Most of the anomaly detection algorithms are

    proposed in [2], [6], [8].The methods can be

    density based and distance based, statistical

    approach. In density based method [8] it uses Local

    Outlier Factor (LOF) in which it assigns a degree to

    each of the data instance and LOF is approximately

    determined to be 1.Higher the LOF declares that the

    data point is an outlier and lower the LOF declares

    that the data point is not an outlier. LOF judges the

    structure of data through the density of the data point.

    In distance based method [6] the distance between

    the data point and the neighbor data point is

    evaluated, if the distance between them is

    beyond the estimated threshold value then it is

    considered as an outlier. In statistical approach in

    which data is provide with certain standard and it

    identifies the outlier when data deviate the standard

    and also it is unfair for the data with noise. It is also

    not feasible for assumption of online and offline data.

    The latest outlier detection methodology is Angle

    Based Outlier Detection(ABOD) method in which

    it measure the angle between target instance and

    the existing data points and it determines data point

    as an outlier only when the angle between them is

    large. It has the limitation that it requires high

    computation complexity for measuring the angle

    for instance pairs is tedious. The above methods arenot suitable for online and continuous data and also it

    requires higher memory requirement and

    computation in high dimensional space. In this

    paper, we propose the ICA will substantially used

    among multiple instances to find an outlier.

    5. FUTURESCOPE

    The advanced PCA is the efficient technique as it

    reduces the memory requirement and the

    computational cost. This technique cab be able

    to handle the high dimensional data without the

    large iterations. Communication protocol helps in

    reducing the computational cost. Due to the online

    storage the memory requirement is also reduced.

    6.

    FUTURE WORK

    The future work can be carried on the normal data

    with the multi clustering structure. It is also required

    to focus on the extremely high dimensional data so as

    to solve the problem of curse of dimensionality.

    7. CONCLUSION

    An online anomaly detection method based on

    oversample PCA, and thus can successfully use

    the variation of the dominant principal direction

    to identify the presence of rare but abnormal data,online osPCA is preferable for online large-scale

    or streaming data problems, compared with other

    anomaly detection methods, this approach is able

    to achieve satisfactory results while significantly

    reducing computational costs and memory

    requirements. This paper provides a solution to the

    curse of dimensionality problem in the pair wise

    scoring techniques. Using dynamic programming

    alignment matrix is calculated.(similar to

    Needleman-W unsch method).The scoring method

    is used to calculate alignment matrix. Trace back

    begins at the element in the matrix with the

    maximum score. Trace back continues until a cell

    with a score of zero is reached. Local alignment

    based on trace back path has been calculated.

    REFERENCES

    [1] W. Lee and S. Stolfo, Data mining approaches

    for intrusion detection Proc. of the 7th USENIX

    security symposium, 1998.

    [2] E. Bloedorn, et al., Data Mining for Network

    Intrusion Detection: How to Get Started, MITRE

    Technical Report, August 2001.

    [3 ] A.K. Jones and R.S. Sielken, Computer SystemIntrusion Detection: A Survey. Technical report,

    University of Virginia Computer Science

    Department, 1999

    [4] V.Barnett and T. Lewis, Outliers in Statistical

    Data, John Wiley Sons, 2006

    [5] D.M. Hawkins, Identification of Outliers.

    Chapman and Hall, 1980

  • 8/11/2019 LSJ1360 Anomaly Detection via Online Oversampling

    5/5

    [6] F.Angiulli, S.Basta, and C.Pizzuti, Distance

    Based Detection and Prediction of Outliers, IEEE

    Trans.on Knowledge and Data Eng., vol. 18, no. 2,

    pp.145-160, May 2006.

    [7] A.Lazarevic, L.Ertoz, V.Kumar, A.Ozgur, and

    J.Srivastava,A Comparitive Study of Anomaly

    Detection Schemes In Network Intrusion Detection,

    Proc. Third SIAM Intl Conf. Data Mining 2003

    [8] M.Breunig, H.-P.Kriegel, R.T. Ng, and

    J.Sander,LOF Identifying Density Based Local

    Outliers, Proc. ACM SIGMOD Intl Conf.

    Management of Data 2000