[ieee 2008 first international conference on emerging trends in engineering and technology - nagpur,...

A COMPARATIVE STUDY OF DATA MINING ALGORITHMS FOR NETWORK INTRUSION DETECTION

Mrutyunjaya Panda Dept. of Electronics and Comm. Engg.

GIET, Gunupur-765022, India. E-Mail: [email protected]

Manas Ranjan Patra Dept. of Computer Science

Berhampur University -760 007, India E-Mail: [email protected]

Abstract

Data mining techniques are being applied in building intrusion detection systems to protect computing resources against unauthorised access. In this paper, the performance of three well known data mining classifier algorithms namely, ID3, J48 and Naïve Bayes are evaluated based on the 10-fold cross validation test. Experimental results using the KDDCup’99 IDS data set demonstrate that while Naïve Bayes is one of the most effective inductive learning algorithms, decision trees are more interesting as far as the detection of new attacks is concerned. Key Words: IDS, ID3, J48, Naïve Bayes, Cross–Validation, Confusion matrix, ROC, Precision-Recall Characteristic. 1. Introduction

Modern computer networks must be equipped with appropriate security mechanisms in order to protect the information resources maintained by them. Intrusion detection systems (IDSs) are integral parts of any well configured and managed computer network systems. An IDS is a combination of software and hardware components, capable of monitoring different activities in a network and analyze them for signs of security threats. There are two major approaches to intrusion detection: anomaly detection and misuse detection. Misuse detection uses patterns of well known intrusions to match and identify unlabeled data sets. In fact, many commercial and open source intrusion detection systems are misuse based. Anomaly detection, on the other hand, consists of building models from normal data which can be used to detect variations in the observed data from the normal model. The advantage with anomaly detection algorithms is that they can detect new forms of attacks which might deviate from the normal behaviour [1].

In this paper, various supervised learning algorithms, particularly decision trees based on ID3, J48, and Naïve Bayes algorithms are explored for network intrusion

detection. It has been argued that the construction of classifiers often involve a sophisticated design stage, where as the performance assessment that follows is less sophisticated and sometimes inadequate [2]. Also, it is important to note that nearly all methods measure classifier performance based solely on classification accuracy of a very limited set of given instances. Examples of such methods include cross-validation tests, confidence tests, least-squared error, lift charts and ROC-curves [3]. We use cross-validation test for performance evaluation. 10% KDDCup’99 intrusion detection dataset [4] generated by the MIT Lincoln Laboratory has been used in our experiment. The rest of the paper is organised as follows: Section2 provides the detailed insight of the various data mining algorithms used. In Section3, Cross-Validation is discussed. Finally, in Section4, results and discussions are outlined followed by the conclusion and future scope in Section5. 2. Classification Algorithms Systems that construct classifiers are one of the commonly used tools in data mining. Such systems take a collection of cases as input, each belonging to a small number of classes described by a fixed set of attributes, and output a classifier that can accurately predict the class to which a new case belongs. Datasets can have nominal, numeric or mixed attributes and classes. Not all classification algorithms perform well for different types of attributes, classes and for datasets of different sizes. In order to design a generic classification tool, one should consider the behaviour of various existing classification algorithms on different datasets. In this work, classification algorithms based on decision trees and Naïve Bayes are analysed. These are applied to KDDCup’99 dataset and then evaluated for accuracy by using 10-fold cross validation strategy [5]. The selection of a classification algorithm depends on the requirements and the nature of classification required.

First International Conference on Emerging Trends in Engineering and Technology

978-0-7695-3267-7/08 $25.00 © 2008 IEEE

DOI 10.1109/ICETET.2008.80

504

2.1 Decision trees Decision tree induction has been studied in details in both areas of pattern recognition and machine learning. This synthesizes the experience gained by people working in the area of machine learning and describes a computer program called ID3, which has evolved to a new system, named C4.5 [6]. J48 is an enhanced version of C4.5. 2.1.1 ID3 Algorithm ID3 is one of the famous Inductive Logic Programming methods, developed by Quinlan [7]. It is essentially an attribute based machine-learning algorithm that constructs a decision tree based on a training set of data and an entropy measure to build the leaves of the tree. The informal formulation of ID3 is as follows: • Determine the attribute that has the highest information

gain on the training set. • Use this attribute as the root of the tree; create a branch

for each of the values that the attribute can take. • For each of the branches, repeat this process with the

subset of the training set that is classified by this branch. 2.1.2 J48 Algorithm J48 (enhanced version of C4.5) is based on the ID3 algorithm developed by Ross Quinlan [6], with additional features to address problems that ID3 was unable to deal. In practice, C4.5 uses one successful method for finding high accuracy hypotheses, based on pruning the rules issued from the tree constructed during the learning phase. However, the principal disadvantage of C4.5 rule sets is the amount of CPU time and memory they require. Given a set S of cases, J48 first grows an initial tree using the divide-and-conquer algorithm as follows: • If all the cases in S belong to the same class or S is

small, the tree is leaf labelled with the most frequent class in S.

• Otherwise, choose a test based on a single attribute with two or more outcomes. Make this test as the root of the tree with one branch for each outcome of the test, partition S into corresponding subsets S1,S2,… according to the outcome for each case, and apply the same procedure recursively to each subset.

There are usually many tests that could be chosen in this last step. J48 uses two heuristic criteria to rank possible tests: information gain, which minimizes the total entropy of the subsets {Si} (but is heavily biased towards tests with numerous outcomes), and the default gain ratio that divides information gain by the information provided by the test outcomes. After the building process, each attribute test along the path from the root to the leaf becomes a rule antecedent (precondition) and the classification at the leaf node becomes the rule consequence (post condition). To illustrate the post pruning of the rules, let us consider the

following rule generated from the tree: IF (service=login) ^ (flag=SF) THEN class=ftp_ write

This rule is pruned by removing any antecedent whose removal does not worsen its estimated accuracy. The pruning algorithm is based on a pessimistic estimate of the error rate associated with a set of N cases, E of which do not belong to the most frequent class. Instead of E/N, J48 algorithms determine the upper limit of the binomial probability when E events have been observed in N trials, using a user-specified confidence whose default value is 0.25. In addition to the advantages cited by Mitchell [8], the pruned rules have many advantages in intrusion detection. Since the rules have the “IF….THEN…” format, they can be used as a model for a rule based intrusion detection. The different J48 rules that are generated are concise and intuitive. Therefore, they can be checked and inspected by a security expert for further investigation. It can be seen that J48 rules has interesting properties for the intrusion detection since it generates good generalization accuracy. New intrusions may appear after the building process whose forms are quite similar to known attacks that are considered a priori. Using the generalized accuracy of the rules, new attack variations could then be detected using different rules. Real time IDSs require short rules for efficiency. Post pruning the rules generates accurate conditions and avoids over fitting, thus improves the execution time for real time intrusion detection. 2.2. Naïve Bayes Algorithm Naïve Bayes [9] is an extension of Bayes theorem in that it assumes independence of attributes. This assumption is not strictly correct when considering classification based on text extraction from a document as there are relationships between the words that accumulate into concepts. Problems of this kind, called problems of supervised classification, are ubiquitous. Naive Bayes sometimes also called as idiot’s Bayes, simple Bayes, and independence Bayes. This is important for several reasons. It is easy to construct without any need for complicated iterative parameter estimation schemes. This means it may be readily applied to huge data sets. It is robust, easy to interpret, and often does surprisingly well though it may not be the best classifier in any particular application. General discussion of the Naïve Bayes method and its merits can be found in [10]. 3. Cross-Validation Test Cross-validation (CV) tests exist in a number of variants but the general idea is to divide the training data into a number of partitions or folds. The classifier is

505

evaluated by its classification accuracy over one partition after having learned from the other. This procedure is then repeated until all partitions have been used for evaluation. Some of the most common types are 10-fold, n-fold and bootstrap CV [9]. The difference between these three types of CV lies in the way data is partitioned. Leave-one-out is similar to n-fold CV, where n stands for the number of instances in the data set. Leave–one-out or n-fold CV is performed by leaving one instance out for testing and training on the other instances. This procedure is then performed until all instances have been left out once. It has been argued that the design of 10-fold cross-validation introduces a source of correlation since one uses examples for training in one trial and for testing in another [11]. 4. Experimental Results and Discussions In this section, the performance of each of the above data mining algorithm based on classification techniques are discussed along with their significance. The experiments for these are carried out using WEKA (Waikato Environment for Knowledge Analysis) [12]; with full data set for training and 10-fold cross validation for testing purposes. The results presented in the following sections were acquired on a Pentium4 CPU, 2.8GHz, and 512MB of RAM. Evaluation and Discussion This section presents experimental results using Decision trees with the ID3 and J48 (extension of C4.5) algorithms [13], along with the results obtained from Naïve Bayes algorithm. In Table-1, comparison is done with respect to time taken to build the model, along with the error rate for all the three data mining algorithms. The confusion matrix for Naïve Bayes, J48 and ID3 algorithms for efficient intrusion detection are shown in Table-2.

Table 1. Comparison of data mining algorithms

Experiments Overall Error Rate Time taken to build the Model in Seconds

Best KDD 7.29% - Naïve Bayes 3.56% 0.11 J48 3.47% 0.88 ID3 3.47% 1.8

The performance evaluation of each type of the algorithm is measured with the help of ROC (Receiver Operating Characteristic) analysis [14]. Measurement of area under the ROC (AUC) [15] allows for performance evaluation which is a measure of effectiveness of IDSs, as shown in Figure-1. This can also be viewed as a performance measure that is integrated over a region of possible operating points.

In this paper, precision-recall analysis, which is also treated as a performance evaluation criterion is shown in Figure-2, in order to demonstrate the efficacy of the proposed analysis. It can be observed from all these analysis that Naive Bayes performs better than the other two decision trees algorithms. However, decision trees are robust in detecting new intrusions, in comparison to the Naïve Bayes.

Figure 1.ROC analysis

Comparison of Data Mining Algorithms

00.20.40.60.8

11.2

0 0.5 1 1.5

Precision Rate

Rec

all R

ate

ID3

NaïveBayes

J48

Figure 2. Precision-Recall characteristic The kappa statistics for all values, as shown in Figure-3 suggest that the ability of the various classification methods examined could be considered as good, i.e. the classifier stability is very strong. Root mean square error in the same way suggests that the error rate is very small, which can be considered as a measure of effectiveness of the model. It can be observed from the Figure-3, that the accuracy of overall classification for Naïve Bayes for all classes is better than the overall accuracy obtained in the case of J48 and ID3 algorithms.

506

00.20.40.60.8

1

ID3 J48 Naïv eBayes

kappa

accuracy

root meansquare error

Figure 3. Comparison of data mining algorithms

Table 2.Confusion Matrix and comparison

5. Conclusions and Future Scope In this paper, we have compared the effectiveness of the classification algorithm Naïve Bayes with the decision tree algorithms namely, ID3 and J48. This helps one to construct an effective network intrusion detection system. It is observed from the results obtained by our

experimentation that the Naïve Bayes model is quite appealing because of its simplicity, elegance, robustness and effectiveness. On the other hand, decision trees have proven their efficiency in both generalization and detection of new attacks. The results show that there is no single best algorithm to out perform others in all situations. In certain cases there might be dependence on the characteristics of the data. To choose a suitable algorithm, a domain expert or expert system may employ the results of the classification in order to make better decisions. Since, these do not deal with unknown attacks; our future investigation is directed towards handling new attacks. References [1]D.Denning, “An Intrusion Detection Model”, IEEE

Transaction on Software Engineering, 13(2), 1987.pp.222-232.

[2]N.M.Adams and D.J.Hand, “Improving the practice of the classifiers performance assessment”, Neural Computation, vol.12, issue.2, MITPress, 2000, pp.305-312.

[3]I.H.witten and E.Frank, “Data Mining: Practical Machine learning tools and techniques with Java implementations”,Academic Press, Morgan Kauffmann publishers, ISBN: 1- 55860-552-5, 1999.

[4]KDD Cup’99 Intrusion Detection datasets. Availableat:http://kdd.ics.uci.edu/databases/kddcup1999/kdd

cup99.html,1999. [5]S. white, I. Jagielska.”Investigation into the application of data

mining techniques to classification of call centre data”, In decision support in an uncertain and complex world: The IFIP TC8/WG8.3 Int. conference, 2004.pp.793-802.

[6]J.R.Quinlan. C4.5: Programs for machine learning. Morgan Kaufman Publishers, 1993.

[7]J.R.Quinlan, “Decision trees and decision making”, IEEE transaction on system Man cyber, March/April 1990, 20(2).

[8]T.M.Mitchell, “Machine learning”, McGraw Hill, 1997. [9]N. Laves son and P. Davidson, “Multi-dimensional measures

function for classifier performance”, 2nd. IEEE International conference on Intelligent system, June2004, pp.508-513.

[10]M. Panda and M. R. Patra, “Network intrusion detection using Naive Bayes”, Int. Journal of Computer Science and Network Security, vol.7, no.12, 2007, pp.258-263.

[11]M.A.Maloof, “On Machine learning ROC analysis and statistical tests of significance”, 16th. International conference on Pattern recognition, IEEE, vol.2, 2002, pp.204-207. [12]WEKA:http://www.cs.waikato.ac.nz/ml/weka. [13]F.Provost, T.Fawcelt, “Robust classification for imprecise environments”, Machine learning, 42, 2001, pp.203-231. [14]Thomas C.W. landgrebe, Andrew P. Bradley, et.al. “Precision–Recall operating characteristic (P-ROC) curves in imprecise environments”, IEEE, 2006. [15]N. Kerdprasop, K. Kerdprasop, “Data Partitioning for Incremental Data Mining”, Proc. of 1st Int. forum on Information and computer Technology, jan9-10, 2003, Sizouka Univ., Hamamatsu, Japan, pp.114-118.

Actual

Probe DoS U2R R2L

Probe NaïveBayes J48 ID3

756 735 741

4 41 30

1 0 0

27 0 0

DoS NaïveBayes J48 ID3

0 0 0

23349 23400 23400

19 0 0

297 0 0

U2R NaïveBayes J48 ID3

1 5 5

1 16 15

38 18 18

2 0 0

R2L NaïveBayes J48 ID3

0 0 0

1 3 1

5 0 0

54 51 53

FPR NaïveBayes J48 ID3

0.0014 0.26 0.000163 0.00025 0.0008 0.0458 0.0004 0.00026 0.0006 0.0458 0.00038 0.000229

Precision Rate NaïveBayes J48 ID3

96% 99% 90.47% 90% 93.16% 92.41% 40% 75% 94.75% 92.41% 41.86% 77.94%

Recall Rate NaïveBayes J48 ID3

99.8% 99.5% 60.3% 14.2% 86.17% 99.24% 100% 100% 86.26% 99.3% 100% 100%

FNR NaïveBayes J48 ID3

0.13% 0.02% 39.68% 85.8% 13.83% 0.76% 0% 0% 13.74% 0.69% 0% 0%

F-Value NaïveBayes J48 ID3

1 0.992 0.72 0.25 0.8953 0.9571 0.5714 0.8571 0.9031 0.9573 0.5901 0.876

PSP for Naïve Bayes=96.4349% PSP for J48=96.5296% PSP for ID3=96.5372%

507

[ieee 2008 first international conference on emerging trends in engineering and technology - nagpur,...

Documents