aladin: active learning of anomalies to detect intrusion · pdf filealadin: active learning of...

ALADIN: Active Learning of Anomalies toDetect Intrusion

Jack W. Stokes, John C. PlattMicrosoft ResearchOne Microsoft Way

Redmond, WA 98052Email: {jstokes,jplatt}@microsoft.com

Joseph KravisMicrosoft Network SecurityRedmond, WA 98052 USA

Email: [email protected]

Michael ShilmanChatterPop, Inc.

Email: [email protected]

March 4, 2008

Technical Report MSR-TR-2008-24

i

This page intentionally left blank.

ii

Abstract

This paper proposes using active learning combined with rare classdiscovery and uncertainty identification to statistically train a networktraffic classifier. For ingress traffic, a classifier can be trained for anetwork intrusion detection or prevention system (IDS/IPS) while aclassifier trained on egress traffic can detect malware on a corporatenetwork. Active learning selects “interesting traffic” to be shown toa security expert for labeling. Unlike previous statistical misuse oranomaly-detection-based approaches to training an IDS, active learn-ing substantially reduces the number of labels required from an expertto reach an acceptable level of accuracy and coverage.

Our system defines “interesting traffic” in two ways, based on twogoals for the system. The system is designed to discover new cate-gories of traffic by showing examples of traffic for the analyst to labelthat do not fit a pre-existing model of a known category of traffic.The system is also designed to accurately classify known categories oftraffic by requesting labels for examples which it cannot classify withhigh certainty. Combining these two goals overcomes many problemsassociated with earlier anomaly-detection based IDSs.

Once trained, the system can be run as a fixed classifier with nofurther learning. Alternatively, it can continue to learn by labelingdata on a particular network. In either case, the classifier is efficientenough to run in real-time for an IPS.

We tested the system on the KDD-Cup-99 Network Intrusion De-tection dataset, where the algorithm identifies more rare classes withapproximately half the number of labels required by previous activelearning based systems. We have also used the algorithm to find pre-viously unknown malware on a large corporate network from a set offirewall logs.

1 Introduction

With the proliferation of on-line networks attacks and new forms of malware,computer security is an increasingly important area of research. Networkintrusion detection involves identifying malicious activity by analyzing thetransfer of packets across a computer network, while system and securityevents generated by a computer’s operating system can be used to detectmalware or unauthorized logons on individual computers (i.e. host intrusiondetection). In the past, two styles of intrusion detection and preventionsystems (IDS/IPS) have been proposed in the literature: misuse systems andanomaly detection. Misuse IDSs can either be rule-based where an expertanalyst designs rules to match known traffic patterns [1] or statistically

1

based where rules (i.e. classifier weights) are learned from the data itself [2].Anomaly detection systems typically learn a statistical model of normaltraffic and anomalies are identified as traffic patterns which do not fit thenormal model [3, 4]. Hybrid models attempt to merge both styles of intrusiondetection.

In the past, anomaly detection systems have performed poorly for anumber of reasons. However, given the increasing complexity of the net-work traffic in an adversarial world, we believe that, in conjunction with amisuse system, anomaly detection must be employed in order to discovernew network attacks masquerading as legitimate network transfers. Onemain problem of previously proposed anomaly detection algorithms is thatgiven the huge number of different types of traffic on the network or in acomputer, it is extremely difficult to design a single model to represent allnormal traffic. Therefore, anomalies must be identified within each sepa-rate class of traffic [5]. Furthermore, we believe that misuse detection canalso benefit from classifying known classes of normal traffic in addition tomalicious traffic.

Another problem with standard anomaly detection systems is that rareevents are often uninteresting. Typically, an IDS based on anomaly detec-tion presents analysts with rare events, assuming that rare events are theresult of intrusions. However, “rare” is not necessarily synonymous with “in-teresting.” In our experience with the network packet logs, we have foundmany examples of normal traffic that have features that are unusual. Weneed a system to filter out these rare, but uninteresting items, in order tofind the very rare, but interesting samples such as a zero-day worm.

Expert analysts have an amazing ability to identify strange traffic pat-terns within a cluster of traffic with similar characteristics. Training bothrule-based or statistically-based systems requires a huge amount of timefrom expert analysts either designing rules for the former or labeling trafficfor the latter. Employing security experts is difficult and expensive; thusin addition to achieving the highest possible detection rate, minimizing theamount of time required by the analysts is a primary objective in trainingan IDS/IPS. Active learning is a methodology for statistically based systemswhere the algorithm identifies the best samples to label thereby producingthe highest classification accuracy with the least amount of human effort [6].Active learning provides a tool to attempt to deal with the huge amounts ofcomputer or network traffic in corporate computer networks. For examplein October 2006, corporate proxy firewalls from Microsoft Corp. runningInternet Security and Acceleration (ISA) Server logged up to 25 million en-tries of outbound network packets on a single day to a SQL database where

2

an entry corresponds to a corporate computer sending data to another com-puter located outside of the corporate intranet. The ISA logs containedonly 10% of the total amount of outbound, firewall traffic for any given day:analysts could be responsible for examining billions of events per month.

Recently, active learning has been used to minimize the number of sam-ples (i.e. human effort) required for anomaly detection [7]. As the number ofdistinct classes of network traffic continues to increase, identifying anomalieswithin each class of traffic requires more and more labeled data. Again, ac-tive anomaly detection significantly reduces the effort required by analystsin order to discover the “rare” and interesting traffic which is most likely tobe new instances of network attacks.

This paper proposes ALADIN: a synthesis of the active learning approachfor classification and active anomaly detection. ALADIN stands for “ActiveLearning of Anomalies to Detect INtrusions” and is applicable to both host-based and network intrusion detection. Instead of presenting a firehoseof anomalies to a security analyst, we first present a small number. Theanalyst then labels these into categories. We use these labels to updateboth a classifier and an anomaly detector. We repeat alternating labelingand learning: once an item gets labeled, items similar to it are no longeranomalies, and hence get classified correctly. This allows analysts to quicklyfilter out common categories of traffic and identify rare anomalies that arenew security risks. The main contribution of this paper is a new frameworkthat both quickly finds new classes of traffic using anomaly detection andalso creates classifiers with high prediction accuracy.

Network intrusion detection systems try to detect inbound attacks fromingress network traffic. A related problem is searching for computers infectedwith malware which are transmitting data outside of a company by analyzingegress traffic. The ALADIN algorithm can also be used to design classifiersbased on egress network traffic features. In this paper, we describe the designand implementation of a malware classification system based on ALADIN.

To be clear, the combination of classification and anomaly detection isonly used during the training phase. For real-time (IPS) or off-line (IDS)detection, the classifier’s weights are used as a statistical misuse system.Weights for a general purpose IDS/IPS can be generated by a relativelysmall number of expert security analysts working for the IDS manufacturerwho label data for a large number of customers as shown in figure 1. As newsamples are labeled, updated classification weights can be downloaded muchlike signatures for anti-virus engines are continuously updated today. Mostsmall to medium size businesses should be able to use the general purposeweights with excellent performance. However the framework is extensible;

3

Figure 1: General Classifier Training with Optional End-User AdaptiveTraining.

for those companies which use the IDS/IPS and employ a small number ofanalysts who understand the specific nature of their corporate computingenvironment, the general purpose weights can further be adaptively trainedbased on any labeled samples provided by the end-user analysts. When newgeneral purpose weights are available, samples previously labeled by theend-user can be used again to re-adapt the new general purpose weights.

We first review details of related work in active anomaly detection (sec-tion 2.1), in using active learning for an IDS (section 2.2), and in multi-modeand hybrid intrusion detection (section 2.3). We then present our proposedsystem for combining active learning, anomaly detection, and high classifi-cation accuracy in section 3. We describe our experiences with the use ofthe ALADIN algorithm for detecting malware on a corporate network fromfirewall logs for outbound network packets in section 4 and show experimen-tal results for network intrusion detection in section 5 for the KDD-Cup 99data set. We present conclusions and propose future research topics in 6.

2 Related Work

2.1 Active Anomaly Detection

Recent work in the machine learning community has investigated algorithmsthat combine active learning and anomaly detection. The machine learningalgorithms discussed in this paper act on generic items, that can be in anevent log. Each item is represented as a vector of features, ~x, that describea set of traffic (see section 4 for our definition of features). Features can be

4

either continuous variables or a discretization of a categorical variable.Pelleg and Moore [7] proposed a system for active anomaly detection.

They view the goal of anomaly detection as the detection, with as few sam-ples as possible, of all of the categories of data in an unlabeled data set.Thus, they share one of the goals of ALADIN: the equivalent task of discov-ering new types of network attacks or malware.

The key insight in Pelleg and Moore is that anomalies can be disregardedonce they are labeled by an expert and accounted for by a model. There-fore, they maintain a model of input vectors ~x: P (~x|i), one model for eachclass label i. They measure anomalousness by how unlikely a data point isaccording to its class’ model. In Pelleg and Moore, an unlabeled data pointundergoes four steps:

1. The class label is estimated by finding the most likely class via

c = arg maxi

P (i|~x) = arg maxi

P (~x|i)P (i)∑j P (~x|j)P (j)

(1)

where P (i|~x) is the probability of a set of traffic described by feature~x being in class i, P (~x|i) is a model of traffic in class i where it is theprobability of observing ~x given that we are certain that the traffic isin class i, and P (i) is the prior probability of observing traffic in classi (without observing any features).

2. The anomaly score for an item is: − log P (~x|c).3. Items with high anomaly scores are shown to an expert for labeling.

4. The models P (~x|i) are updated with new labeled data. Repeat.

Pelleg and Moore propose several different ways of selecting items in step3. The method they favor (called interleave) is to select a fixed numberof outliers per class (where the outliers per class are selected to have thehighest anomaly scores).

Note that Pelleg and Moore’s work is optimized for finding anomalies asquickly as possible. The classifier used in step 1 of their algorithm, trained onthe data labeled in step 3 may not provide the highest accuracy classificationresults.

Another active-learning-like algorithm for anomaly detection was pre-sented in [21]. That algorithm used active learning as a method for creatingdiverse anomaly detectors: the number of labeled samples presented to anexpert may become quite large.

5

2.2 Active Learning for Intrusion Detection

Another related work is that of Almgren and Jonsson [6], which uses activelearning to make an accurate, statistical misuse intrusion detector. Alm-gren and Jonsson use a form of active learning [22], where an expert labelsunlabeled items that lie near decision boundaries of a classifier. The itemsnear the decision boundary are those that the classifier is most uncertainabout: therefore, labeling them has the most value. In addition, the highlyaccurate Support Vector Machine (SVM) classifier only pays attention toitems near the decision boundary: points far away are ignored. In [6], Alm-gren and Jonsson show that active learning provides much better results forclassification accuracy in an IDS when compared to random sampling.

The Almgren algorithm operates in the following way:

1. A two-class radial basis function (RBF) SVM is evaluated on eachdata point ~x:

z =∑

j

αjyjK(~xj , ~x) + b (2)

where z is the score for the item belonging to the positive class, yj isthe label for a set of traffic j, +1 if positive, -1 if negative, (~xj , yj) arethe labeled items, and αj is the result of the SVM algorithm. K(~xj , ~x)is a measure of the similarity between a new input vector ~x and astored memory vector ~xj : typically K is large when the two vectorsare similar, and close to 0 if not.

2. The certainty score for an item is |z|.3. Items with low certainty scores are shown to an expert for labeling.

4. The SVM is re-run with the new labeled data. Repeat.

Since Almgren’s algorithm is optimized to accurately predict the label forthe unlabeled samples, it will not typically find anomalies as quickly as Pel-leg’s anomaly detection algorithm. The paper also only presented a 2-classalgorithm; we show that the performance is worse in multi-class problemsin section 5.

2.3 Multi-Mode and Hybrid Intrusion Detection

Typical anomaly detection algorithms consider two modes: normal andanomalous. By modeling normal behavior correctly, the goal is to iden-tify anomalous behavior which hopefully corresponds to detecting malicious

6

activity. However, learning a single model representing all normal behavioris problematic and leads to high false positive rates where anomalous, nor-mal behavior is incorrectly predicted to be malicious. As in the algorithmproposed in this paper, Valdes [5] also approaches anomaly detection for in-trusion detection using multiple classes, or modes. Valdes analyzes multiplemodes for sequences corresponding to the categorical destination TCP port.

Recently, Lane [23] proposed a hybrid IDS based on combining a statis-tical misuse IDS with an IDS based on anomaly detection. This algorithmuses semi-supervised learning with a partially observable Markov decisionprocess (POMDP). Lane’s algorithm is applied to host based time series ofUNIX commands.

3 Proposed Active Learning Algorithm for HighClassifier Accuracy and Rare-Class Discovery

It is possible to simply run both Pelleg’s algorithm and Almgren’s algorithmto create an intrusion detection system which both finds new intrusions andrefines the rules for existing known categories. However, a security analystwould then be bombarded with labels: neither algorithm would cooperateor share labels.

Therefore, we propose the ALADIN algorithm: a single intrusion detec-tion framework for both anomaly detection and classification. In ALADIN,the labels provided by the experts update both the classifier and the anomalydetector. The classifier finds uncertain items, which get labeled by an ana-lyst. The anomaly detector consists of one model per class, where items thatdo not fit the corresponding model are considered anomalous. Anomalousitems are also labeled by the analyst. The overall architecture of ALADINis shown in figure 2.

3.1 Learning the Classifier

The first stage of the proposed algorithm is illustrated in the left hand sideof figure 3. After the analyst is finished labeling one or more of the samplesproposed by the active learning algorithm, illustrated by the larger dots inthe figure, a multi-class, discriminative classifier is trained. In our work, wechoose to use logistic regression [24] which learns models of the form

P (class i|~x) = 1/(1 + exp(−∑

j

wijxj + bi)) (3)

7

Figure 2: ALADIN algorithm for detecting malware from intrusion detectionlogs.

where xj is the value of the jth feature in an item. Logistic regression choosesthe parameters wij and bi to minimize the cross-entropy loss function E [24]

E = −∑n

I∑

i=1

tin log P (i|~xn) + (1− tin) log(1− P (i|~xn)) (4)

where tin is 1 if the nth input vector (~xn) is in class i and 0 otherwise, andI is the total number of classes.

Logistic regression is chosen so that the classifier can be quickly retrainedfor each labeling iteration, without forcing the analyst to wait for a lengthytraining algorithm. Other classifiers, such as a linear SVM or RBF SVM [24]can also be used.

Note that for features xj that are categorical (i.e., their domain is adiscrete set), ALADIN uses the standard encoding [6], where each categoricalfeature is represented by N binary inputs to the logistic regression. Onlyone of these inputs has a value of one, corresponding to the feature category:the rest are zero. Continuous features are scaled so that the mean of theentire (label+unlabeled) data set is zero, and the variance is one.

In order to increase the accuracy of the logistic regression, ALADIN se-lects items for analyst labeling that have high uncertainty. That is, ALADIN

8

chooses items with a low certainty score:

mini,j 6=i

|P (i|~xn)− P (j|~xn)| (5)

where i = arg maxk(P (k|~xn)). Traffic with a low certainty score has theproperty that ALADIN has difficulty assigning the traffic to one class, be-cause two classes are almost equally likely.

MSN MessengerOutlook

Skype

Analyst-labeled items

Learned decision boundaries

Map of feature values

Unlabeled

items

Anomalies (potential malware):

ask analyst for labels

Samples closest to

the hyperplanes

Figure 3: The left figure shows the result of training a classifier: decisionboundaries in input space are hyperplanes. The right figure shows the se-lection of the anomalies by the per-class models.

3.2 Anomaly Detection

The anomaly detection of ALADIN is similar to that used in [7]. A modelof all (labeled + unlabeled) data is built for each class. For each unlabeleditem, the classifier is applied and the unlabeled item is temporarily assignedto its most likely class.

A model is then trained using all of the items assigned to a class. AL-ADIN uses naive Bayes for the model which assumes that all of the inputfeatures are statistically independent (given an item in a class). Thus, theprobability of an item is the product of the probabilities of each feature.Equivalently, an anomaly score can be computed by taking the negative ofthe sum of the log of the probabilities (i.e. loglikelihood) for each feature:

− log P (~x|class c) = −∑

j

log P (xj |class c) (6)

9

where a large anomaly score indicates a low probability (anomalous) item.We build a model of each feature independently. Traffic tends to be anoma-lous if at least one feature is out of the ordinary; the more features thatare strange, the higher the anomaly score. In addition, the more unusual afeature is, the higher the anomaly score. For anomaly detection, ALADINpresents unlabeled items to the analyst with large anomaly scores for eachclass.

The form of the feature probability P (xj |c) depends on the type of fea-ture. If the feature is continuous, the probability is a Gaussian distribution,whose mean and variance is measured over all items assigned to class c. For acategorical feature, the probability is a histogram (a multinomial) estimatedover all items assigned to class c.

In order to search for new malware or styles of intrusion, ALADIN selectsa number of anomalies per class to show to a security analyst for labeling.These are the items that are far away from the centers of the clusters in theright side of figure 3: these items do not fit into the current model of a class,hence are potential new styles of intrusion or malware.

At every labeling iteration, ALADIN presents the same number of itemsto an analyst for labeling, evenly distributed between all known categories:half of the items are uncertain, the other half are anomalous. Thus if allsamples were to be labeled in the order suggested by the algorithm, the an-alyst spends half of the time improving the classification performance, theother half finding potential new problems. It should be noted that anoma-lous items which lie near the decision boundaries may also be considered tobe uncertain items. These items may belong to a different class and henceare labeled incorrectly, or may belong to a new, previously unlabeled class.However the opposite may not be true: anomalous items which are locatedfar from the decision boundaries may turn out to be new categories whichrequire a new class label but most likely do not belong to an existing class.

Instead of ranking based on (6) for anomaly detection, another potentialalgorithm could be to rank the anomalousness based on the output (poste-rior) probability, P (c|~x), resulting from the classification stage. This methoddoes not tend to perform well since the output probability is a function of theclassification boundaries and tends to be dominated by the distance to thenearest boundary. As a result, this type of algorithm may fail to accuratelydetect anomalies located far away from the boundaries.

Sometimes, the logistic regression identifies very few unlabeled data asbelonging to a particular class (due to a severe lack of labeled data). Inthis case, there are fewer uncertain or anomalous items in the rare categorycompared to other categories. When this happens, ALADIN simply ranks all

10

unlabeled data in order of the logistic regression output P (c|~x), and providesthe most likely items for analyst labeling. This balances the number oflabels per category and ensures that very rare categories do not get starvedfor labels during active learning.

The summary of one iteration of ALADIN’s training phase is shown infigure 4.

1. Learn a classifier by minimizing the loss (4) on the labeled samples

2. Evaluate the classifier (3) and the certainty score (5) for the unlabeledsamples

3. Assign all unlabeled samples to the most likely category

4. Compute the model parameters (mean, variance, or histogram) forevery P (~xj |c)

5. Compute the anomaly score (6) for all unlabeled samples

6. Select the next group of samples to be labeled choosing as follows,sweeping through the categories:

(a) select the next, most anomalous unlabeled sample with the high-est anomaly score (6) in each class

(b) OR, select the sample with the smallest certainty score (5) foreach class

(c) if not enough samples for a class are found from 6a) or 6b) selectthe unlabeled sample with the second highest output probabilityP (c|~xj) corresponding to the desired class c

7. Repeat step 6 until the desired number of samples have been labeledfor the iteration

Figure 4: One iteration of ALADIN training, as pseudo-code

3.3 Adaptive Training for the End-User Analyst

The framework can evaluate new samples in two modes as shown in figure 1:standalone where the generic classifier weights are provided by the thirdparty which developed the system or in an adaptive mode where an end-

11

user analyst provides additional labeled samples (uncertain or anomalies)using the active learning algorithm proposed above. To be clear, additionaltraining by the end-user analyst is optional, and very good results can beobtained without further adaptive training.

3.4 Evaluating New Samples

After the weights for the linear classifier have been trained using logisticregression, new samples can be evaluated in an IDS or IPS. In both stan-dalone and adaptive modes, all samples are evaluated using (3). Althoughthe length of the vector can be over 100,000 dimensions, the input data issparse and (3) can be evaluated very efficiently. As a result, evaluation canbe real-time for an IPS or off-line as an IDS which analyzes large log files.

3.5 Relationship to the Interleave and Mix-Ambig-Lowlik Meth-ods

In [7], Pelleg and Moore propose four algorithms: lowlik, ambig, mix-ambig-lowlik, and interleave. The first three algorithms find anomalies consideringall of the data at once. Lowlik identifies anomalies as being the sampleswith the lowest likelihood among all of the samples, while ambig proposesanomalies which are most ambiguous (i.e. samples closest to the decisionboundaries) again on a global scale. Mix-ambig-lowlik is a hybrid methodwhich combines both of these strategies for all data samples. By contrast,the interleave method proposes samples with low likelihood for each indi-vidual class. Thus interleave is similar to step 6a) in ALADIN. Iteratingbetween samples with low likelihood and high uncertainty in 6a) and 6b)is similar in nature to Pelleg’s mix-ambig-lowlik, but differs in the followingway; ALADIN combines selecting the two types of samples (i.e. anomalous,uncertain) on a per class basis, while mix-ambig-lowlik mixes the two typesof samples across the entire data set. Just as Pelleg and Moore show thatthe interleave method outperforms the other methods by searching for rare-classes on a per class basis, ALADIN extends the notion to combine bothstrategies in the active learning scenario on a per class basis.

4 Malware Detection from Firewall Logs of Out-bound Network Traffic

A prototype, large-scale malware detection system has been implementedbased on ALADIN. Security analysts have used this system to analyze daily

12

corporate network transmission logs from Microsoft Corp. generated by theproxy firewalls using Microsoft’s ISA Server. ISA allows network packetmetadata to be collected and stored in SQL databases. The metadata cor-responding to the packets are the basis of the features used in the systemand are a subset of those captured by the standard ISA logs [25].

The data from each of the separate SQL databases collected by the ap-proximately twenty corporate network ISA servers is aggregated on a dailybasis based on the Greenwich Mean Time (GMT). The resulting combineddaily log file for the corporation is then preprocessed using the current avail-able models created from any previously labeled data to predict a label foreach of the unlabeled samples for a particular day.

4.1 Reviewing the Current Results

Scalability is an issue when dealing with extremely large data sets such asthose produced by the corporate ISA logs. To combat this problem, theanalyst first selects the top N (e.g. 1000) samples to label, where N isa user defined input, which have been ranked according to the proposedalgorithm. As explained previously, approximately N/2 samples are usedfor anomaly detection and N/2 samples are used for improving classificationaccuracy. Analyzing the top N samples addresses two issues. First, oncethe samples have been selected initially, any subsequent labeling of data,updating of the models, and re-ranking of the results is only performed onthis initial group of samples. This architecture allows a fast, interactiveexperience when the analyst labels new samples. Second, the analyst wouldbe overwhelmed by being presented with the all of the samples to be labeledat once. By providing only the top N samples for labeling, the analyst canselect a comfortable number of samples to analyze.

Another issue to explore is whether or not an automatic stopping crite-rion should be provided in the algorithm for presenting items from a par-ticular class to the analyst for labeling. We believe that with the analystin the loop, any labeled data can always be used to improve the algorithm.The algorithm will always offer new samples to label if the analyst is soinclined. If the algorithm automatically decides to hide a class for furtherreview, it is possible that new malware whose features lie in the same spaceas a previously labeled class may be hidden with no chance for discovery.Furthermore using the current UI, the analyst can choose to label any itemthey desire in the top N list: they can choose to avoid labeling new items ifa class seems to be well formed. Therefore, we do not suggest providing anautomatic stopping criterion.

13

For a large number of classes, the analyst may be presented with only asingle item from some classes and no items from other classes. Consider thecase with N = 1000 but there are currently 2000 labeled classes, the questionbecomes which items should be presented to the analyst to label. In thiscase, we suggest providing the analyst with the N samples correspondingto the largest negative anomaly score for one iteration and the N sampleswith the smallest uncertainty score for the next iteration.

4.2 Reprocessing the Data

During a labeling session, any updated models are only applied to the unla-beled samples from the original top N samples. The analysts also have theoption of reprocessing all of the unlabeled samples from the daily log files.In the current implementation, this requires an hour or so depending on thesize of the file although much of this time is used to update the predictedlabel and anomaly score that are also stored in the daily ISA SQL table. Ifthe analyst exits the tool, they return to analyzing the same top N sam-ples ranked initially by the algorithm or during the most recent reprocessingstep.

4.3 Implementation Details

The application is written in C# including all of the underlying machinelearning algorithms. The UI consists of DataGridView on a WindowsFormwhich is used to display the rank, user label, predicted label, and the fea-tures. The DataGridView allows the analyst to pivot on any column (rank,label, feature) providing additional insight into the ranking process. Theapplication is multi-threaded: labeling one or more samples causes the ap-plication to create an additional thread which is used to update the modelsbased on the new class labels.

4.4 Malware Detection

We have applied ALADIN to several daily logs averaging 13 million capturedevents for each day. During testing of the tool, a new trojan (“5.exe”) wasdiscovered on the network which had not been previously identified by thecorporate, rule-based misuse NIDS. In addition, other worms and trojanswere also identified as anomalies during various iterations: these additionalmalware samples were also identified by the Network Security team’s rule-based misuse NIDS but were waiting to be classified by analysts.

14

5 Network Intrusion Detection Experiments andResults

In this section, we analyze the components of ALADIN in the network in-trusion detection setting to show that the combination of classification andanomaly detection are required to attain high accuracy quickly. We evaluateALADIN’s performance by comparing it to two different versions of Alm-gren’s active learning intrusion detection system [6]. The first version simplydrops the anomaly detection from ALADIN, while the second version usesRBF SVMs, to be more similar to [6]. Following [6], the SVM uses C = 1and a Gaussian kernel with σ2 = 0.5.

To compare the algorithms, we use the data from the 1999 KDD-Cupcontest for network intrusion detection systems [26]. Although the dataset is somewhat dated, we use the packet labels to conduct oracular ex-periments which are required to simulate how the algorithm performs withperfect knowledge of the class labels. These labels allow the comparison ofthe proposed active learning algorithm and the active learning algorithmpresented in [6]. To be consistent with [6], all features from the 1999 KDD-Cup are used: no feature selection was performed which may give an unfairadvantage to the proposed method.

Specifically, the first 100,000 samples from the file kddcup.data 10 percentare used, and the distribution of the data is shown in table 1. As the tableshows, the data contains three common classes of traffic (normal, smurf,neptune) and seventeen rare classes. From a small number of labeled sam-ples, ALADIN attempts to identify labels for all of the unlabeled samples.As a result, the algorithm may quickly identify all packets belonging to asingle attack. For example, for the single Neptune SYN flood attack, the al-gorithm attempts to label all packets in the attack with the same label. If avariant of a particular attack occurs, a new label can be created representingthe new attack example.

Class Count Class Count Class Count Class Countnormal 56237 satan 539 pod 40 multihop 6neptune 20482 portsweep 278 warezmaster 20 buffer overflow 5smurf 19104 nmap 231 land 17 phf 3back 2002 teardrop 199 imap 12 loadmodule 2ipsweep 760 guess passwd 53 ftp write 8 perl 2

Table 1: Distribution of process classes in the first 100,000 samples of kdd-cup.data 10 percent

15

To begin, 10 randomly chosen samples are labeled before running thefirst iteration of the algorithm for all experiments in this section. For allexperiments, 100 samples are labeled for each iteration so that after 10iterations, approximately 1% of the samples have been labeled. Choosing100 samples to label per iteration corresponds to approximately 5-10 sampleslabeled per class depending on the number of previously labeled classes.

5.1 Anomaly Detection

In this first experiment, we seek to compare how quickly the algorithmsdiscover classes in the data set. The results are illustrated in figure 5 forALADIN and the two supervised algorithms using the KDD-Cup 99 dataset. Three classes were sampled in the original 10 labeled samples. AL-ADIN quickly identifies six new classes in the first iteration and eighteen ofthe twenty classes after the fourth iteration of the algorithm. The other twoalgorithms labeled “Logistic Regression” and “SVM” are standard activelearning algorithms without the additional anomaly detection stage; drop-ping anomaly detection from ALADIN degrades the detection performanceof the system (as expected). Using an RBF SVM only identifies three classesin addition to the three in the original labeled data, while using the Logis-tic Regression identifies sixteen classes. The results clearly show that AL-ADIN’s use of active anomaly detection significantly outperforms standardactive learning. Typically, adding the second stage of anomaly detectionrequires only half the number of samples to be investigated and labeled byan analyst compared to the next best alternative.

5.2 Error Rates for Classification of Unlabeled Data

Next, we analyze the error rates of the three algorithms in figure 6 gen-erated by comparing the predicted labels of the unlabeled data using thesupervised classifier with the oracle labels from the data set. ALADIN per-forms extremely well with an error rate ranging from 5.9% at iteration twoto 3.1% on iteration 10. The other two algorithms also converge to low er-ror rates of 3-4% due to reliance on discriminative classifiers for predictionbut exhibit high error probabilities in the initial iterations during which asufficient number of samples have not been labeled to train accurate classestimators. These large spikes in error rate correspond to a single classifiermispredicting one of the common attacks (e.g. smurf,neptune) as normal.

In tables 2 and 3, we investigate the false positive (FP) and false negative(FN) rates for the seven most prevalent classes. In this multiclass setting,

16

0 1 2 3 4 5 6 7 8 90

5

10

15

20

25

Iteration

Num

ber

of Id

entif

ied

Cla

sses

ALADINLogistic RegressionSVM

Figure 5: Number of identified classes using the proposed ALADIN algo-rithm and two versions of supervised active learning algorithm.

the FP and FN rates are calculated for each class relative to all of the otherclasses. Tables 2 and 3 provide the results after labeling 10 iterations, 1000total samples, and 20 iterations, 2000 total samples, respectively. Thesetables show that the asymptotic error rate for ALADIN in figure 6 is mostlydue to false negatives whereby the anomalous attack classes are incorrectlylabeled as normal. After labeling 1000 samples, ALADIN misclassifies mostor all of the samples in the back and ipsweep classes. The classifier learnsthe ipsweep class but still misclassifies most of the back class after labeling2000 samples.

After analyzing the KDD-Cup 99 data set, we believe that the normalclass is overly broad and that the anomaly detection is often identifyingsubclasses within the normal class for more granular labeling. Thus, normalmay not be easily classified with a single logistic regression classifier. AL-ADIN may work even better in practice where analysts are able to furtherlabel normal traffic as belonging to a known class (e.g. MSN Messenger,Skype, etc.).

17

1 2 3 4 5 6 7 8 9 100

5

10

15

20

25

30

Iteration

Err

or R

ate

(%)

ALADINLogistic RegressionSVM

Figure 6: Error rates for the three anomaly detection algorithms using activelearning.

6 Conclusions and Future Research

This paper presents ALADIN: an active learning framework for creating anintrusion detection or prevention system. The ALADIN framework can alsobe used to detect malware from outbound network traffic. In prior work,active learning was used to either improve the accuracy rate of a misuseclassifier, or was used to detect new rare categories of traffic. ALADINcombines both of these styles of active learning into one hybrid framework:items provided to an analyst are used for both goals.

Experiments on the 1999 KDD-Cup data set show that ALADIN achievesboth goals of high accuracy on known classes, and good detection of un-known rare categories. Because ALADIN is based on logistic regression andnaive Bayes, it can scale to process very large log files. We have testedALADIN on real corporate network firewall logs that contain 13 millionevents.

Although the current UI provides an effective method for examininganomalies, we are exploring new visualization techniques for efficiently ana-lyzing the logs. In this paper, we have not done feature selection to under-stand which features improve ALADIN’s ability to quickly identify anoma-lies and improve classification accuracy. In the future, we plan to investigate

18

True Num Labeled TP FP Incorrectly FN FP FNLabel Samples Count Count Predicted Label Count Rate (%) Rate (%)normal 477 55745 3060 portsweep 9 6.59 0.0269

guess passwd 3ipsweep 3

neptune 47 20435 0 0 0.00 0.00smurf 81 19018 0 normal 5 0.00 0.026back 29 0 0 normal 1973 0.00 100ipsweep 47 56 3 normal 657 0.003 92.14satan 39 455 1 normal 45 0.001 9.000portsweep 54 223 9 satan 1 0.009 0.446

Table 2: False Positive and False Negative Rates after 10 Iterations.

True Num Labeled TP FP Incorrectly FN FP FNLabel Samples Count Count Predicted Label Count Rate (%) Rate (%)normal 923 55275 2306 back 39 5.11 0.071neptune 137 20435 0 0 0.00 0.00smurf 162 18942 0 0 0.00 0.00back 111 7 39 normal 1884 0.00 99.6ipsweep 157 593 0 normal 10 0.003 1.66satan 111 389 0 normal 39 0.001 9.11portsweep 98 180 0 0 0.00 0.00

Table 3: False Positive and False Negative Rates after 20 Iterations.

feature selection on the ISA data once we have a large number of labeledsamples.

As with all algorithms which process extremely large databases, scal-ability is an issue that requires additional investigation. A tradeoff existsbetween fast response required for an interactive algorithm such as AL-ADIN, and processing more data in order to detect additional anomalies. Inthe current design, we focus on the most recent data in order to catch newmalware outbreaks. However if older data is stored, the architecture can beextended so that a separate thread processes the previous data in a reversetime order (i.e. processing the most recent data first and processing thedata backwards towards the original data). Several open questions includethe following. Does predicting the labels of the older, unlabeled samplesaffect the performance of the models on the most recent unlabeled data dueto the often dynamic nature of malware? How well does the effectivenessof the algorithm scale with large numbers of classes? After long periodsof labeling data, a system may have thousands of classes depending on thelevel of analysis desired by the analysts. Processing scalability should not bean issue since the algorithm can easily be implemented to run in parallel on

19

multiple processors. More investigation is needed to ensure the predictiveaccuracy and ability to discover anomalies does not degrade as the numberof categories increases over time.

References

[1] M. Roesch, “Snort - lightweight intrusion detection for networks,” in13th USENIX Systems Administration Conference, 1999, pp. 229–238.

[2] W. Lee, S. Stolfo, and K. Mok, “A data mining framework for buildingintrusion detection models,” in IEEE Symp. on Security and Privacy,1999, pp. 120–132.

[3] A. Ghosh, A. Schwartzbard, and M. Schatz, “Learning program behav-ior profiles for intrusion detection,” in Proc. 1st USENIX Workshop onIntrusion Detection and Network Monitoring, 1999, pp. 51–62.

[4] K. Sequeira and M. Zaki, “ADMIT: anomaly-based data mining forintrusions,” in Proc. ACM SIGKDD Conf., 2002, pp. 386–395.

[5] A. Valdes, “Detecting novel scans through pattern anomaly detection,”in Proc. DISCEX, 2003, pp. 140–151.

[6] M. Almgren and E. Jonsson, “Using active learning in intrusion detec-tion,” in Proc. IEEE Computer Security Foundations Workshop, 2004,pp. 88–98.

[7] D. Pelleg and A. Moore, “Active learning for anomaly and rare-categorydetection,” in Proc. Advances in Neural Information Processing Sys-tems, 2004, pp. 1073–1080.

[8] D. Endler, “Intrusion detection: Applying machine learning to Solarisaudit data,” in Proc. ACSAC, 1998, pp. 267–279.

[9] L. Portnoy, E. Eskin, and S. Stolfo, “Intrusion detection with unla-beled data using clustering,” in Proc. ACM Workshop on Data MiningApplied to Security, 2001.

[10] I. Green, T. Raz, and M. Zviran, “Analysis of active intrusion pre-vention data for predicting hostile activity in computer networks,” inCommunications of the ACM, 2007, pp. 63–68.

20

[11] Z. Zhang and H. Shen, “Online training of svms for real-time intrusiondetection,” in Proc. Int. Conf. on Advanced Information Networkingand Application, 2004, pp. 568–573.

[12] Y. Liao and V. R. Vermuri, “Use of k-nearest neighbor classifier forintrusion detection,” in Computers and Security, 2002, pp. 439–448.

[13] Z. Liu, J. Almhana, V. Choulakian, and R. McGorman, “Online emalgorithm for mixture with application to internet traffic modeling,” inComputational Statistics and Data Analysis, no. 4, 2005, pp. 1052–1071.

[14] D. Gao, M. K. Reiter, and D. Song, “Behavioral distance for intrusiondetection,” in Proc. RAID, 2006, pp. 63–81.

[15] D. Denning, “An intrusion detection model,” in IEEE Trans. on Soft-ware Engineering, 1987, pp. 222–232.

[16] Q. Yin, R. Zhang, and X. Li, “An new intrusion detection method basedon linear prediction,” in Proc. Internation Conf. Computer Security,2004, pp. 160–165.

[17] Y. Gu, A. McCallum, and D. Towsley, “Detecting anomalies in net-work traffic using maximum entropy estimation,” in Internet Measu-ment Conference, 2005, pp. 345–350.

[18] A. Lazarevic, A. Ozgur, L. Ertoz, J. Srivastava, and V. Kumar, “Acomparative study of anomaly detection schemes in network intrusiondetection,” in Proceedings SIAM Conference on Data Mining, 2003.

[19] Z. Anming and J. Chunfu, “Study on the applications of hidden markovmodels to computer intrusion detection,” in Proceedings World Confer-ence on Intelligient Control, 2004, pp. 4352–4356.

[20] X. Li and N. Ye, “Decision tree classifiers for computer intrusion de-tection,” in Parallel and Distrubuted Computing Practices, 2001, pp.179–190.

[21] N. Abe, B. Zadrozny, and J. Langford, “Outlier detection by activelearning,” in Proc. ACM SIGKDD Conf., 2006, pp. 504–509.

[22] G. Schohn and D. Cohn, “Less is more: Active learning with supportvector machines,” in Proc. Int’l. Conf. Machine Learning, 2000, pp.839–846.

21

[23] T. Lane, “Machine learning and data mining for computer security:Methods and applications,” in Semi-supervised learning, M. Maloof,Ed. Springer-Verlag, 2006.

[24] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.

[25] Microsoft, “Web proxy log fields,” Tech. Rep., 2004,http://www.microsoft.com/technet/isa/2004/help/FW C WebLogFields.mspx.

[26] S. Stolfo, “KDD cup 1999 dataset,” UCI KDD repository, Tech. Rep.,1999, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

22

aladin: active learning of anomalies to detect intrusion · pdf filealadin: active learning of...

Documents