large-scale network intrusion detection based on distributed learning algorithm

7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm

1/12

Int. J. Inf. Secur. (2009) 8:2535

DOI 10.1007/s10207-008-0061-2

R EG U LAR C O N TR I BU TI O N

Large-scale network intrusion detection based on distributedlearning algorithm

Daxin Tian Yanheng Liu Yang Xiang

Published online: 14 November 2008

Springer-Verlag 2 008

Abstract As network traffic bandwidth is increasing at an

exponential rate, its impossible to keep up with the speed ofnetworks by just increasing the speed of processors. Besides,

increasingly complex intrusion detection methods only add

further to the pressure on network intrusion detection (NIDS)

platforms, so the continuousincreasing speed and throughput

of network poses new challenges to NIDS. To make NIDS

usable in Gigabit Ethernet, the ideal policy is using a load

balancer to split the traffic data and forward those to differ-

ent detection sensors, which can analyze the splitting data

in parallel. In order to make each slice contains all the evi-

dence necessary to detect a specific attack, the load balancer

design must be complicated and it becomes a new bottleneck

of NIDS. To simplify the load balancer this paper put forward

a distributed neural network learning algorithm (DNNL).

Using DNNL a large data set can be split randomly and each

slice of data is presented to an independent neural network;

these networks can be trained in distribution and each one

in parallel. Completeness analysis shows that DNNLs learn-

ing algorithm is equivalent to training by one neural network

which uses the technique of regularization. The experiments

to check the completeness and efficiency of DNNL are per-

formedon the KDD99 Data Setwhich is a standard intrusion

detection benchmark. Compared with other approaches on

the same benchmark, DNNL achieves a high detection rate

and low false alarm rate.

D. Tian Y. Liu (B)

College of Computer Science and Technology,

Jilin University, 130012 Changchun, China

e-mail: [email protected]

Y. Xiang

School of Management and Information Systems,

Central Queensland University,

Rockhampton, QLD 4702, Australia

Keywords Intrusion detection system Distributed

learning Neural network Network behavior

1 Introduction

With the widespread use of networked computers for criti-

cal systems, computer security is attracting increasing atten-

tion and intrusions have become a significant threat in recent

years. As a second line of defense for computer and net-

work systems, intrusion detection systems (IDS) have been

deployed more and more widely along with network security

techniques such as firewalls. Intrusion detection techniques

can be classified into two categories: misuse detection and

anomaly detection. Misuse detection looks for signatures of

known attacks, and any matched activity is considered an

attack; anomaly detection models a users behaviors, and any

significant deviation from the normal behaviors is considered

the result of an attack. The main shortcoming of IDS is false

alarm which is caused by misinterpreting normal packets as

an attack or misclassifying an intrusion as normal behavior.

This problem is more severe under fast Ethernet, with the

result that network IDS (NIDS) cant be adapted to protect

the backbone network. Since network traffic bandwidth is

increasing at an exponential rate, its impossible to keep up

with the speed of networks by just increasing the speed of

processors.

To resolve the problem and make NIDS which can be used

in Gigabit Ethernet, one approach is improving the detection

speed by moving the matching away from the processor and

on to an FPGA [14], using high performance string match-

ing algorithm [57] and reducing the dimensionality of the

data, thereby minimizing computational time [8,9]. Another

approach is using both distributed and parallel detecting

methods. this is the best way to make NIDS keep up with

the speed of networks. The main idea of distributed NIDS

123


2/12

26 D. Tian et al.

is splitting the traffic data and forwarding them to detection

sensors, so that these sensors can analyze the data in parallel.

Paper [10] presents an approach which allows for meaning-

ful slicing of the network traffic into portions of manageable

size. However, their approach uses a simple Round-Robin

algorithm for load balancing. The splitting algorithm of[11]

ensures that a single slice contains all the evidence necessary

to detect a specific attack, making sensor-to-sensor interac-tion unnecessary. Although the algorithm can dynamically

balance the sensors loads by choosing the sensor with the

lightest load to process the new connections packets, it still

may lead to some sensor losing a packet if the traffic of one

connection is heavy. Paper [12] has a design for a flow-based

dynamic load-balancing algorithm, which divides the data

stream based on the current value of each analyzers load

function. The incoming data packets, which belonged to a

new session, are forwarded to the analyzer that has the least

load currently. Paper [13] presents an active splitter architec-

ture and three methods for improving performance: the first

is early filtering/forwarding, where a fraction of the packetsis processed on the splitter instead of the sensors; the second

is the use of locality buffering, where the splitter reorders

packets in a way that improves memoryaccess location on the

sensors; the third is the use of cumulative acknowledgments,

a method that optimizes the coordination between the traffic

splitter and the sensors. The load balancer of SPANIDS [14]

employs multiple levels of hashing and incorporates feed-

back from the sensor nodes to distribute network traffic over

the sensors without overloading any of them. Although the

methods of [1214] reduce the load on the sensors, it compli-

cates the splitting algorithm and makes the splitter become

the bottleneck of the system.

The traffic splitter is the key of the distributed intrusion

detection system. An ideal splitting algorithm should satisfy

these requirements: (1) the algorithm divides the whole traf-

fic into slices of equal sizes; (2) each slice contains all the

evidence necessary to detect a specific attack; (3) the algo-

rithm is simple and efficient [11]. Through the above analysis

we can find that the primary goal of a NIDS load balancer is

to distribute network packets acrossa set of sensor hosts, thus

reducing the load on each sensor to a level that the sensor can

handle without dropping packets. However, the connection

oriented characteristic makes the load balancer of NIDS is

different from the other environments such as web servers,

distributed systems or clusters. In order to satisfy the require-

ment (2), all the distributed intrusion detection systems pay

more attention to the load balancer, and thus cant satisfy

requirements (1) and (3). In this paper a distributed neural

network learning algorithm (DNNL) is presented which can

be used in distributed anomaly detection system. The idea

of DNNL is different from the common distributed intru-

sion detection system. While the usual methods try to satisfy

requirement (2) through weakening requirements (1) and (3),

DNNL takes the opposite approach, which first considers

satisfying requirements (1) and (3) and through the learning

algorithm to satisfy requirement (2).

Two important characters of neural networks are: distrib-

uted, where knowledge representation is distributed across

many processing units; parallel, where computations take

place in parallel across these distributed representations.

Although each neural network can run in parallel, a groupof neural networks cant run in distribution to cope with one

problem corporately. Since the learning algorithm requires

all the training data to be submitted to the network one by

one until the network is stable after one or more epochs. This

requirement becomes untenable when the amount of data

exceeds the size of main memory, which is obviously possi-

ble for any realistic database, such as astronomy data [15],

biomedical data [16], bioinformatics data [17], etc. DNNL is

not only a parallel but also a distributed learning algorithm

which uses independent neural networks to process part of

the training data. These independent neural networks can

run in distribution and each one processes in parallel, thus itcan not only take advantage of the neural networks parallel

character but also overcome the drawback of concentrated

training. DNNL can also be used in mobile agent [18], dis-

tributed data mining [19], distributed monitoring [20] and

ensemble system [21].

The rest of this paper is organized as follows. Section 2

describes the main idea of DNNL, and details the basic learn-

ing algorithm. Section 3 presents a metric embedding method

and dissimilarity measure algorithm to make DNNL suit the

data which contains categorical and numerical features. The

experimental results on dataset KDDCUP99 are given in

Sect. 4, and conclusions are made in Sect. 5.

2 DNNL

2.1 The process of DNNL

The main process of DNNL is: first, splitting the large sample

data into small subsets and forwarding these slices to distrib-

uted sensors; second, each sensors neural network begins

to be trained by the sliced data in parallel until all of them

are stable; third, rebuilding the new training data based on

the training results of each neural network (the new training

datas amount is much less than the total amount of all the

sliced data); last, a concentrated learning is carried out on the

new training data. The process is shown in Fig. 1.

DNNL involves twophases learning.In thefirst phase(dis-

tributed learning), large data are splitted randomly and sent

to independent neuron networks, all the independent neuron

networks learn the knowledge of each slice in distribution

and every one in parallel. In the second phase (concentrated

learning), the training data is built from the training results of

123


3/12

Large-scale network intrusion detection based on distributed learning algorithm 27

Fig. 1 The process of DNNL

the distributed neural networks. Since the new data is much

less than the original training data, it can be learned by one

neural network in finite time and memory.

2.2 Analysis of DNNLs completeness

The key issue to DNNL is how to build the new data to

ensure the training is complete, that is, the result is equal to

thetrainingon thewholedata by oneneural network.Next we

first present the new datas building method and then analyze

the completeness of DNNL.

A stable neural network maintains the knowledge learned

from the sample data in the weight matrix W(mn), m is the

number of neurons, n is the dimension of each neuron. In

DNNL, the dimension of the neuron of the distributed neural

network is equal to the sample vector x(1n)s dimension.

After the distributed neural network is stable, each row ofW

can be regarded as one clustering center of the sliced data.

The new data are generated from Gaussian distribution in

each point of W. For example: the whole original data set

X has p q samples, X is split into p slices and each slice(i )X(i = 1, . . . , p) has q samples; after the i th neural net-

work trained by the i th slice of data (i )X is stable, its weight(i )W is composed of r rows (r q). The i th slice of the

new data set (i)X is generated from the Gaussian distribu-

tion in each point of (i )W. After generation, (i )X contains t

(r t q) samples. Since t is also much less than q, the

whole new data set X is much less than the whole original

data set X. From the above discussion we can find that after

training each row of the i th neuron networks weight matrix,(i )W represents some samples of (i)X, in which the distance

between these samples and the corresponding row is lower

than one threshold value. So we can get

(i )Wj =(i) Xk + Ak (1)

where k identifies some samples whose clustering center is

the j th row of (i )W, and Ak is the vector representing the

difference between (i )Wj and(i)Xk. The i th slice of new

data set (i )X is generated from the i th neuron network as

(i )Xl =(i) Wj + Bm (2)

where Bm is a random vector whose numbers are generated

from Gaussian distribution. Substituting Eq. (1) into Eq. (2)

gets

(i )Xl =(i) Xk + Ak + Bm (3)

The neural network can be represented by function f (W, x).

After training,input of unlabelled data x andthe outputof f()

is close or equal to future result y. To feedforward network,

the training is a process to find W

W = argmi n

m

i=1

nj=1

yi j f (W, xi )j

(4)

A common choice of the error function is the least mean

square error of the form

C(x) =

mi=1

yi f (W, xi )2 (5)its expected value is

E(C(x)) =

C(x) fd (x, y) dxdy (6)

the function fd (x, y) representing the probability density of

the training data. Substituting Eq. (5) into Eq. (6) gets

E(C(x))

=

mi =1

nj =1

yi j f (W, xi )j

2fd

xi , yi

dxi dyi (7)

When training with X the function f() becomes f (W,

x + a + b), which expands it into Taylor series:

f (W, x + a + b)= f (W, x) + f (W, x + a + b)T (a + b)

+1

2(a + b)T 2 f (W, x + a + b) (a + b) +

= f (W, x) + h (x) (8)

where f() is gradient and 2 f() is Hessian matrix. The

expected error value when training with new data set X can

be written in the form

E(C(x)) = E(C(x)) + ( f (W, x)) (9)

123


4/12

28 D. Tian et al.

where (f (W, x)) is

( f (W, x))

=

mi =1

nj=1

2yi j f (W, xi )j

h (xi )+h (xi )

2

fd xi , yi fd (ai ) fd (bi ) dxi dyi dai dbi (10)From Eq. (9) we can find that training with new data set X

is equivalent to the technique of regularization which adds a

penalty term to the error function for controlling the bias and

variance of a neural network [22].

This neural network learning rule can be considered as

a gradient optimization process when an appropriate energy

function E(w) is selected, the gradient direction is

dw

dt=

E(w)

w(11)

andthe synaptic weights areadjustedin thegradientdirection

w (k + 1) = w (k) E(w)w

(12)

If the data set S in Fig. 2 are trained following this process,

after neural network is stable, that is, the energy function E

reaches to minimum (local or global), the synaptic weights

are the black point in Fig. 2.

If the data set S is randomly (that is not following the

partition boundary) split into two data sets S1 and S2 which

are shown in Figs. 3 and 4, distributed learning first trains

S1 and S2 independently. After they are both stable, S1s

energy function E1 and S2s energy function E2 are both

reaching to minimum, the concentrated learning is carried

on the learning result of data set S1 and S2.During the concentrated learning: since the triangle class

and the plus sign class have been generalized very well by

their synaptic weights, these weights will not be adjusted in

great degree to minimize energy function; However the cross

class and the six-pointed star class dont reach to the opti-

mal state, so and therefore their synaptic weights generated

during the distributed learning will continue to adjust until

reaching to minimum. Although the training results (synap-

tic weight number and value) on the splitted data set and the

whole data set may be different, they have the same gen-

eralization ability because they all aim to make Ss energy

function E reach to minimum.

2.3 Competitive learning algorithm based on kernel

function

In order to gain the advantages of being able to learn from

new data, a neural network must be adaptive or exhibit

plasticity, possibly allowing the creation of new neurons.

On the other hand, if the training data structures are unsta-

ble and the most recently acquired piece of information can

Fig. 2 The data set S and its training result

Fig. 3 The data set S1 and its training result

Fig. 4 The data set S2 and its training result

cause major reorganization, then it is difficult to ascribe much

significance to any particular clustering description. This

problem is even more serious in distributed data training.

SOM [23], dART [24], RPCL [25], etc have presented some

methods to overcome this problem, this paper introduces a

competitive mechanism which absorbs the ideas of above

methods. The learning algorithm is based on Hebb learning

and kernel function. To prevent the knowledge included in

different slices being ignored, DNNL adopts the resonances

mechanism of ART and adds neurons whenever the network

123


5/12


in its current state does not sufficiently match the input. Thus

the learning results of the sensors contain complete or partial

knowledge and the whole knowledge can be learned by the

concentrated learning.

2.3.1 Hebb learning

In DNNL the learning algorithm is based on the HebbianPostulate which states that When an axon of cell A is near

enough to excite a cell B and repeatedly or persistently takes

part in firing it, some growth process or metabolic change

takes place in one or both cells such that As efficiency, as

one of the cells firing B, is increased.

The learning rule for a single neuron can be derived from

an energy function defined as

E(w) =

wTx

+

2w22 (13)

where w is the synaptic weight vector (including a bias or

threshold), x is the input to the neuron, () is a differentiablefunction, and 0 is the forgetting factor. Also,

y =d (v)

dv= f (v) (14)

is the output of the neuron, where v = wTx is the activity

level of the neuron. Taking the steepest descent approach to

derive the continuous-time learning rule

dw

dt= w E(w) (15)

where > 0 is the learning rate parameter, we see that the

gradient of the energy function in Eq. (13) must be computed

with respect to thesynaptic weightvector, that is,wE(w) =

E(w)/w . The gradient of Eq. (13) is

w E(w) = f (v)v

w+ w = yx + w (16)

Therefore, by using the result in Eq. (16) along with that

of Eq. (15), the continuous-time learning rule for a single

neuron is

dw

dt= [yx w] (17)

and the discrete-time learning rule (in vector form) is

w(t + 1) = w(t) + [y(t + 1)x(t + 1) w(t)] (18)

2.3.2 Competitive mechanism based on kernel function

To overcome the problem induced by a traffic splitters, the

inverse distance kernel function is used in Hebb learning.

The basic idea is that not only the winner is rewarded but

also all the losers are penalized at a different rates which are

calculated by the inverse distance function and its input is

the dissimilarity between the sample data and neuron.

The dissimilarity measure function is Minkowski metric:

dp (x, y) =

l

i=1

wi |xi yi |p

1/p(19)

where xi , yi are the i th coordinates ofx and y , i = 1, . . . , l,

and wi 0 is the i th weight coefficient.

When the j th neuron is most similar to the sample, thenthe learning rule of i th neuron is

Wi (t + 1) = Wi (t) + i [x(t + 1) Wi (t)] (20)

where

i =

1 : winner, i = j

K (di ): others, i = 1, . . . , m and i = j(21)

and K (di) is the inverse distance kernel,

K (di) =1

1 + dpi

(22)

If the winners dissimilarity measure d < ( is the thresh-old of dissimilarity), then update the synaptic weight by

learning rule Eq. (20), else add a new neuron and set the

synaptic weight w = x.

2.4 Post-prune algorithm

One of the central issues in network training is to find the

optimal model f(). Judging the efficiency of f() can be bro-

ken into two fundamental aspects: bias and variance. Bias

measures the expected value of the estimator relative to the

true value, and variance measures the variability of the esti-

mator about the expected value. Since DNNL determinesnetwork size by adding neurons incrementally, it may model

noise data into f() and lead to high variance (the phenom-

enon of overfitting). To prevent overfitting, DNNL uses the

post-prune method whose strategy is based on the distance

threshold: if two weights are too similar they will be substi-

tuted by a new weight. The new weight is calculated as

Wnew = (Wold1 t1 + Wold2 t2)/(t1 + t2) (23)

where t1 is the training times of Wold1 , t2 is the training

times ofWold2.

Thepruning process is illustrated in Fig. 5, after pruning E,

F, A and B are aggregated to EF and AB. The prune algorithm

is shown below:

Step 0: If old weights muster (oldW) is null then algorithm

is over, else proceed;

Step 1: Calculate the distance between the first weight (fw)

and the other weights;

Step 2: Find the weight (sw) which is most similar to fw;

Step 3: If the distance between sw and fw is larger than the

pruning threshold, then delete fw from oldW and add

123


6/12

30 D. Tian et al.

Fig. 5 The pruning process

fw into new weights muster (newW) goto step 0; else

continue;

Step 4: Get fws training times value (ft) and sws training

times value (st);

Step 5: Calculate the new weight (nw) and nws training

times value (nt), nw = (fw ft + sw st)/(ft + st),

nt = ft + st;

Step 6: Delete fw and sw from oldW and add nw into newW;

goto step 0.

2.5 Learning algorithm of DNNL

The main learning process of DNNL is:

Step 0: Initialize learning rate parameter and the threshold

of dissimilarity ;

Step 1: Obtain the first input x and set w0 = x as the initial

weight;

Step 2: If training is not over, randomly take a feature vector

x from the feature sample set X and compute the dissimi-

larity measure between x and each synaptic weight using

Eq. (19);

Step 3: Decide the winner neuron j and test tolerance: Ifdj

, add a new neuron and set the synaptic weight

w = x, goto Step 2; else continue;

Step 4: Compute i by using the result of inverse distance

K (di) ;

Step 5: Update the synaptic weight as Eq. (20), goto Step 2.

3 Data preprocessing

KDDCUP 99 data was collected through a simulation on the

U.S.A. military network by 1998 DARPA Intrusion Detec-

tion Evaluation Program, aiming at obtaining the benchmark

dataset in the field of intrusion detection. The full data set

contains training data consisting of 7 weeks of network-

based intrusions inserted in the normal data, and 2 weeks

of network-based intrusions in normal data for a total of

4,999,000 connection records described by 41 characteris-

tics. The records are mainly divided into four types of attack:

probe, denial of service (DOS), user-to-root (U2R) and

remote-to-local (R2L).

3.1 Metric embedding

The set of features presented in the KDD Cup data set con-

tains categorical and numerical features of different sources

and scales. An essential step for handling such data is metric

embedding which transforms the data into a metric space. In

this paper the categorical features are represented by metric

A, each categorical feature Ai expressing g possible categor-

ical values is defined as Ai =

A1i , A

2i , . . . , A

g

i

; the numer-ical features are represented by B; then the metric space X

can be defined as X = {A1, . . . , Am, B1, . . . , Bnm }. That

means each sample data X is described by n features.

3.2 Dissimilarity measure

To numerical features, the value |xi yi | of Minkowski met-

ric can be calculated directly after normalization. But for

categorical features, we need to define a new calculation

method. The Hamming distance is often used to quantify the

extent to which two strings of the same dimension differ. An

early application was in the theory of error-correcting codeswhere the hamming distance measured the error introduced

by noise over a channel when a message, typically a sequence

of bits, is sent between its source and destination. In DNNL

the calculation of|xi yi | for categorical features is similar

to Hamming distance. Ifxi and yi are categorical features, xiis the feature of sample data, yi is the corresponding feature

of one training neuron N, and xi = Aki (k 1, . . . , g), then

|xi yi | = 1 ck

C(24)

123


7/12


where ck is the number which represents how many times Aki

has been learned by neuron N,

ck = num

Aki

(25)

and C is the total number of all the categorical features that

have been learned by neuron N,

C =m

i =1

gj=1

num

Aji

(26)

If the neuron N is the winner of this training epoch, its value

ofck will be added by 1. Using this method to calculate the

Minkowski metric of categorical features, if the value of one

samples categorical feature Ai is Aki , then the neurons with

the larger numAki

are more similar to this sampleregarding

this categorical feature, that is, the value |xi yi | is much

smaller.

4 Experiments

4.1 Benchmark test

In the KDDCUP 99 data set, a smaller data set consisting

of the 10% the overall data set is generally used to eval-

uate algorithm performance. The smaller data set contains

22 kinds of intrusion behaviors and 494,019 records, among

which 97,276 are normal connection records. The test set is

another data set which contains 37 kinds of intrusion behav-

iors and 311,029 records, among which 60,593 are normal.

4.1.1 Performance measures

The recording format of test results is shown in Table 1.

False alarm is partitioned into False Positive (FP, normal is

detected as intrusion) andFalse Negative (FN, intrusionis not

detected). True detection is also partitioned into True False

(TF, intrusion is detected rightly) and True Negative (TN,

normal is detected rightly).

Table 1Recording format of test result

Detection results

Normal Intrusion-1 Intrusion-n

Normal TN00 FP01 FP0n

Actual Intrusion-1 F N10 TP11 FP1n

Intrusion-2 FN20 FP21 FP2n

Behaviors...

.

.

....

.

.

....

Intrusion-n FNn0 FPn1 TPnn

Definition 1 The right detection rate ofi th behavior (TR) is

TR = Tii

nj =1

Ri j (27)

where Ti i is the value which lies in Table 1s i th row and i th

column; Ri j is the value which lies in Table 1s i th row and

j th column.

Definition 2 The right prediction rate of i th behavior (PR)

is

PR = Tii

nj =1

Rj i (28)

where Tii is the value which lies in Table1s i th row and i th

column; Rji is the value which lies in Table1s j th row and

i th column.

Definition 3 Detection rate (DR) is computed as the ratio

between the number of correctly detected intrusions and thetotal number of intrusions. If we regard Table1s record as an

(n + 1) (n + 1) metric R, then

DR =

ni =1

nj =1

Ri j

ni=0

nj =0

Ri j (29)

Definition 4 False positive rate (FPR) is computed as the

ratio between the number of normal behaviors that are incor-

rectly classifies as intrusions and the total number of normal

connections, according to the Table1s record

FPR =

ni=1

FP0i n

i=1

FP0i + TN00

(30)

4.1.2 Experiment results

To test the performance of DNNL, we first divide the 494,019

records into 50 slices. Each slice contains 10,000 records

except the last which contains 4,019 records. When learning

rate = 0.1 and the threshold of dissimilarity = 1.5 the

learning results are shown in Figs. 6 and 7. In Fig. 6, X-axis

represents each slice and Y-axis records the corresponding

number of neurons after the neural networks are stable. In

Fig. 7, Y-axis records the corresponding number of behav-

iors included in each slice. From the results we can find that

the distribution of neurons and behaviors are similar, which

indicates the sensors have learned the knowledge. Since the

behaviors recorded in 18th36th slices and 43th46th are all

smurf intrusion, the number of behavior is 1 and the num-

ber of neurons is 51. After the training on the distributed

learning result, the knowledge is represented by 368 neurons.

There are 37 kinds of intrusion behaviors in the test set.

We first separate them into four kinds of attacks:

123


8/12

32 D. Tian et al.

Fig. 6 The number of neurons of each corresponding slices

Fig. 7 The number of behaviors of each corresponding slices

Probe: {portsweep, mscan, saint, satan, ipsweep, nmap}

DOS: {udpstorm, smurf, pod, land, processtable, warezmas-

ter, apache2, mailbomb, Neptune, back, teardrop}

U2R: {httptunnel, ftp_write, sqlattack, xterm, multihop,

buffer_overflow, perl, loadmodule, rootkit, ps}

R2L: {guess_passwd, phf, snmpguess, named, imap,

snmpgetattack, xlock, sendmail, xsnoop, worm}

The test results are summarized in Table 2.

Comparing the result with the first winner of KDD CUP

99, we see that the TR of DNNL is almost equal to the first

winner. There are two reasons leading to the low TR of U2R

and R2L: first, the size of attack instance that pertained to

U2R and R2L is much smaller than that of other types of

attack; second, U2R and R2L are host-based attacks which

exploit vulnerabilities of the operating systems, not of the

network protocol. Therefore, these are very similar to the

normal data. Table 3 shows the DR and FPR of the first

and second winner of the KDD CUP 99 competition, other

approaches [21] and DNNL. From the comparison we can

find that DNNL provides superior performance.

4.2 Prototype system test

4.2.1 Test environment

In order to address the problem of intrusion detection analy-sis in high-speed networks, the data stream on the high-speed

network link is divided into several smaller slices that are

fed into a number of distributed neural networks. In order

to evaluate the effectiveness of the DNNL, we developed a

prototype IDS using the Libpcap. The test environment is

shown in Fig. 8. We used 12 PCs with 100 Mbps Ethernet

cards to serve as background traffic generator, which could

generate more than 1,000 Mbps TCP and UDP streams with

an average packet size of 1,024 bytes. One IBM server which

runs Web services is the attack object and one attacker sends

attack packets to the server. All the 14 computers are con-

nected to 100Mbps ports on the Huawei Quidway S3526Cswitch. All the packets through these ports are mirrored to

a defined mirror port and then distributed to the neural net-

works.

4.2.2 Packet capture

Every station on a LAN hears every packet transmission, so

there is a destination field and a source field in each packet.

The Ethernet card can be in promiscuous mode or normal

mode. Under promiscuous mode, the card will receive and

deliver every packet. Under normal mode, if the packet desti-

nation address is identical to the station address, the card will

receive and pass the packet up to the software, if it is not, the

card will just drop the packet (filter it). IDS can run under the

promiscuous mode of Ethernet card to analyze every packets

passing through the LAN. Libpcap is the library we are going

to use to grab packets from the network card directly. The

main functions used are:

pcap_open_live() is used to obtain a packet capture

descriptor to look at packets on the network.

pcap_lookupnet() is used to determine the network num-

ber and mask associated with the network device.

pcap_lookupdev() returns a pointer to a network device

suitable for use with open_live() and lookupnet().

pcap_loop() is used to collect and process packets. The

captured packet will be parsedto form thenetwork behav-

ior vector.

Our method of parsing based on the character of network

software structuring technique. In the TCP/IP Reference

Model, the Internet layer defines an official packet format

and protocol called IP (Internet Protocol); the layer above

123


9/12


Table 2 Testing resultsDetection results

Normal Probe DoS U2R R2L TR (%)

Normal 58120 927 649 64 833 96.0

Actual Probe 357 3546 174 21 118 85.1

DoS 256 5092 223518 52 435 97.2

Behaviors U2R 143 39 0 23 23 10.1

R2L 14443 14 1 271 1460 9.0

PR (%) 79.3 36.9 99.6 5.3 50.9

Table 3 Comparison with other approaches

Performances Detection rate False positive rate

Algorithms (DR) (%) (FPR) (%)

Winning entry 91.9 0.5

Second place 91.5 0.6

Best linear GP-FP rate 89.4 0.7

Best GEdIDS-FP rate 91 0.4

DNNL 93.9 0.4

Fig. 8 Test environment

the Internet layer is transport layer, two end-to-end protocols

have been defined here. The first one, TCP (Transmission

Control Protocol) is a reliable connection-oriented protocol

that allows a byte stream originating on one machine to be

delivered without error on any other machine in the Inter-

net. The second protocol in this layer, UDP (User Datagram

Protocol), is an unreliable, connectionless protocol. In the

experiments we use the heads of packets to define the data

structure of network behavior:

typedef struct _EthernetBehavior

{u_int8_t ethernet_dest[12]; /* destination ethernet address */

u_int8_t ethernet_sour[12];/* source ehternet address */

u_int16_t ethernet_type;/* packet type ID field */

}EthernetBehavior;

The ethernet_type shows the nested structure of protocol

headers. It may be an IP, ARP, or some other protocols. For

instance, an IP header can be defined as:

typedef struct _IPBehavior

{

unsigned int header_len; /* the header length */

unsigned int version; /* version of the protocol */

u_int8_t tos; /* type of service */

u_short total_len; /* total length of datagram */

u_short identification; /* identification */

u_int8_t flag_off; /* flags and fragment offset */

u_int8_t time_live; /* the limit of packet lifetimes */

u_int8_t protocol; /* TCP or UDP */

u_int8_t checksum; /* header checksum */

struct in_addr source_addr; /* source address*/

struct in_addr destination_addr; /* destination address */

}IPBehavior;

The variable protocol tells what type of protocol will be

used in the upper layer, it can be TCP, UDP, ICMP, etc. For

example the definition of TCP is:

typedef struct _TCPBehavior

{

u_int16_t sour_port; /* source port */

u_int16_t dest_port; /* destination port */

tcp_seq seq_num; /* sequence number */

tcp_seq ack_num; /* acknowledgement number */

u_int16_t flag; /* flags */

u_int16_t win_size; /* window size */

u_int16_t check_sum; /* header checksum */

u_int16_t urg_pointer; /* urgent pointer */

}TCPBehavior;

123


10/12

34 D. Tian et al.

Fig. 9 Test result

During the period of training, IDS uses the behavior

variables of the normal packets to form the binary behavior

matrix. In detecting time, if an intrusion is detected, IDS will

alarm and display the detailed information of the intruder,

which is parsed from the behavior variables.

4.2.3 Experiment result

In this experiment, we evaluate our proposed method. We

first train the neural network with different normal features,

then use the stable neural network to monitor the systemwhere some abnormal behaviors are happening under the

same environment. A series of experiments are conducted to

analyze the effects of varying the value of intrusion threshold

to system errors. The tests results are graphically represented

in Fig. 9.

We can find that the performance of IDS is sensitive

according to intrusion threshold. As the threshold value

increases, false positive errors increase while false negative

errorsdecrease. Since a false negative error is more important

in IDS, we need to concentrate on the decrease of false neg-

ative errors according to the change of the threshold value.

The optimal threshold value is 1.51.6.

5 Conclusions

The bandwidth of networks increases faster than the speed

of processors. Its impossible to keep up with the speed of

networks by just increasing the processors speed of NIDS.

To resolve the problem, this paper presents a DNNL which

can be used in the anomaly detection methods. Completeness

analysis shows that DNNLs learning algorithm is equivalent

to training by one neural network which adds a penalty term

to the error function for controlling the bias and variance of

a neural network. The main contribution of this approach is:

reducing the complexity of load balancing while still main-

taining the completeness of the network behavior, putting

forward a dissimilarity measure method for categorical and

numerical features, and increasing the speed of the wholesystems. In the experiments, the KDD data set is used which

is the common data set used in IDS research papers. If train-

ing with one neural network it will use 67 h whereas DNNL

takes only less than 1 h. Comparisons with other approaches

on the same benchmark show that DNNLs false alarm rate

is very low.

Acknowledgments This research is supported by both the National

Natural Science Foundation of China under Grant No. 60573128 and

the National Research Foundation for the Doctoral Program of Higher

Education of China under Grant No.20060183043.

References

1. Song, H.Y., Lockwood, J.W.: Efficient packet classification for

network intrusion detection using FPGA. In: Proceedings of

the 13th International Symposium on Field-programmable Gate

Arrays, pp. 238245. Monterey (2005)

2. Yang, W., Fang, B.X., Liu, B., Zhang, H.L.: Intrusion detection

system for high-speed network. J. Comput. Commun.27, 1288

1294 (2004)

3. Baker, Z.K., Prasanna, V.K.: Automatic synthesis of efficient

intrusion detection systems on FPGAs. In: Proceedings of the

14th Field Programmable Logic and Application, pp. 311321.Leuven, Belgium (2004)

4. Baker, Z.K., Prasanna, V.K.: A methodology for synthesis of effi-

cientintrusiondetection systems on FPGAs. In: Proceedings of the

12th Annual IEEE Symposium on Field-Programmable Custom

Computing Machines (FCCM04), pp. 135144. Napa (2004)

5. McAlerney, J., Coit,C., Staniford,S.: Towards faster string match-

ing forintrusion detection or exceeding thespeedof snort.In: Pro-

ceedings of DARPA Information Survivability Conference and

Exposition, pp. 367373. Anaheim (2001)

6. Tuck, N., Sherwood, T., Calder, B., Varghese, G.: Deterministic

memory-efficient string matching algorithms for intrusion detec-

tion. In: Proceedings of the 23rd Conference of the IEEE Com-

munications Society, pp. 26282639. Hong Kong (2004)

7. Tan, L., Sherwood, T.: A high throughput string matching archi-

tecture for intrusion detection and prevention. In: Proceedings ofthe 32nd International Symposium on Computer Architecture,

pp. 112122. Madison, Wisconsin (2005)

8. Aggarwal, C., Yu, S.: An effective and efficient algorithm for

high-dimensional outlier detection. J. Int. J. Very Large Data

Bases 14, 211221 (2005)

9. Rawat, S., Pujari, A.K., Gulati, V.P.: On the use of singular value

decomposition for a fast intrusion detection system. J. Electronic

Notes Theor. Comput. Sci. 142, 215228 (2006)

10. Kruegel, C., Valeur, F., Vigna, G., Kemmerer, R.: Stateful intru-

sion detection for high-speed networks. In: Proceedings of the

IEEE Symposium on Security and Privacy, pp. 285294. Califor-

nia (2002)

123


11/12


11. Lai, H.G., Cai, S.W., Huang, H., Xie, J.Y., Li, H.: A parallel intru-

sion detection system for high-speed networks. In: Proceedings

of Applied Cryptography and Network Security: Second Interna-

tional Conference, pp. 439451. ACNS 2004, Yellow Mountain

(2004)

12. Jiang, W.B., Song, H., Dai, Y.Q.: Real-time intrusion detection

for high-speed networks. J. Comput. Secur.24, 287294 (2005)

13. Xinidis, K., Charitakis, I., Antonatos, S., Anagnostakis, K.G.,

Markatos, E.P.: An active splitter architecture for intrusion detec-

tion and prevention. J. IEEE Trans. Dependable. Secure Com-

put. 3, 3144 (2006)

14. Schaelicke, L., Wheeler, K., Freeland, C.: SPANIDS: a scalable

network intrusion detection loadbalancer. In: Proceedings of the

2nd Conference on Computing Frontiers, pp. 315322. Ischia

(2005)

15. Szalay, A., Gray, J.: The world-wide telescope. Science 293,

20372040 (2001)

16. Martone, M.E., Gupta, A., Ellisman, M.H.: E-neuroscience: chal-

lenges and triumphs in integrating distributed data frommolecules

to brains. Nature Neurosci. 7, 467472 (2004)

17. Wroe, C.,Goble, C.,Greenwood,M., Lord, P., Miles, S.,Papay, J.,

Payne, T., Moreau, L.: Automating experiments using semantic

data on a bioinformatics grid. IEEE Intell. Syst.19, 4855 (2004)

18. Wang, Y.X., Behera, S.R., Wong, J., Helmer, G., Honavar, V.,

Miller, L., Lutz,R., Slagell, M.: Towards the automaticgeneration

of mobile agents for distributed intrusion detectionsystem. J. Syst.

Softw. 79, 114 (2006)

19. Bala, J.,Weng, Y., Williams, A.,Gogia, B.K., Lesser, H.K.: Appli-

cations of Distributed Mining Techniques For Knowledge Discov-

ery in Dispersed Sensory Data. In: Proceedings of the 7th Joint

Conference on Information Sciences, pp. 14. Cary (2003)

20. Kourai, K., Chiba,S.: HyperSpector virtual distributed monitoring

environmentsfor secure intrusion detection.In: Proceedings of the

1st ACM/USENIXInternational Conference on Virtual Execution

Environments, pp. 197207. Chicago (2005)

21. Folino, G., Pizzuti, C., Spezzano, G.: GP ensemble for distributed

intrusion detection systems. In: Proceedings of the 3rd Interna-

tionalConference on Advancedin Pattern Recognition, pp. 5462.

Bath, UK (2005)

22. Geman, S., Bienenstock, E.,Doursat, R.: Neural networks and the

bias/variance dilema. Neural Comput. 4, 158 (1992)

23. Kuo, R.J., An, Y.L., Wang, H.S., Chung, W.J.: Integration of self-

organizing feature maps neural network and genetic K-means

algorithmfor marketsegmentation. J. ExpertSyst. Appl.30, 313

324 (2006)

24. Carpenter, G.A., Milenova, B.L., Noeske, B.W.: Distributed

ARTMAP: a neural network for fast distributed supervised learn-

ing. J. Neural Networks 11, 793813 (1998)

25. Nair, T.M., Zheng, C.L., Fink, J.L., Stuart, R.O., Gribskov, M.:

Rival penalized competitive learning (RPCL): a topology-

determining algorithmfor analyzinggene expression data. J. Com-

put. Biol. Chem. 27, 565574 (2003)

123


12/12

Reproducedwithpermissionof thecopyrightowner. Further reproductionprohibitedwithoutpermission.

large-scale network intrusion detection based on distributed learning algorithm

Documents