large-scale network intrusion detection based on distributed learning algorithm
TRANSCRIPT
-
7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm
1/12
Int. J. Inf. Secur. (2009) 8:2535
DOI 10.1007/s10207-008-0061-2
R EG U LAR C O N TR I BU TI O N
Large-scale network intrusion detection based on distributedlearning algorithm
Daxin Tian Yanheng Liu Yang Xiang
Published online: 14 November 2008
Springer-Verlag 2 008
Abstract As network traffic bandwidth is increasing at an
exponential rate, its impossible to keep up with the speed ofnetworks by just increasing the speed of processors. Besides,
increasingly complex intrusion detection methods only add
further to the pressure on network intrusion detection (NIDS)
platforms, so the continuousincreasing speed and throughput
of network poses new challenges to NIDS. To make NIDS
usable in Gigabit Ethernet, the ideal policy is using a load
balancer to split the traffic data and forward those to differ-
ent detection sensors, which can analyze the splitting data
in parallel. In order to make each slice contains all the evi-
dence necessary to detect a specific attack, the load balancer
design must be complicated and it becomes a new bottleneck
of NIDS. To simplify the load balancer this paper put forward
a distributed neural network learning algorithm (DNNL).
Using DNNL a large data set can be split randomly and each
slice of data is presented to an independent neural network;
these networks can be trained in distribution and each one
in parallel. Completeness analysis shows that DNNLs learn-
ing algorithm is equivalent to training by one neural network
which uses the technique of regularization. The experiments
to check the completeness and efficiency of DNNL are per-
formedon the KDD99 Data Setwhich is a standard intrusion
detection benchmark. Compared with other approaches on
the same benchmark, DNNL achieves a high detection rate
and low false alarm rate.
D. Tian Y. Liu (B)
College of Computer Science and Technology,
Jilin University, 130012 Changchun, China
e-mail: [email protected]
Y. Xiang
School of Management and Information Systems,
Central Queensland University,
Rockhampton, QLD 4702, Australia
Keywords Intrusion detection system Distributed
learning Neural network Network behavior
1 Introduction
With the widespread use of networked computers for criti-
cal systems, computer security is attracting increasing atten-
tion and intrusions have become a significant threat in recent
years. As a second line of defense for computer and net-
work systems, intrusion detection systems (IDS) have been
deployed more and more widely along with network security
techniques such as firewalls. Intrusion detection techniques
can be classified into two categories: misuse detection and
anomaly detection. Misuse detection looks for signatures of
known attacks, and any matched activity is considered an
attack; anomaly detection models a users behaviors, and any
significant deviation from the normal behaviors is considered
the result of an attack. The main shortcoming of IDS is false
alarm which is caused by misinterpreting normal packets as
an attack or misclassifying an intrusion as normal behavior.
This problem is more severe under fast Ethernet, with the
result that network IDS (NIDS) cant be adapted to protect
the backbone network. Since network traffic bandwidth is
increasing at an exponential rate, its impossible to keep up
with the speed of networks by just increasing the speed of
processors.
To resolve the problem and make NIDS which can be used
in Gigabit Ethernet, one approach is improving the detection
speed by moving the matching away from the processor and
on to an FPGA [14], using high performance string match-
ing algorithm [57] and reducing the dimensionality of the
data, thereby minimizing computational time [8,9]. Another
approach is using both distributed and parallel detecting
methods. this is the best way to make NIDS keep up with
the speed of networks. The main idea of distributed NIDS
123
-
7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm
2/12
26 D. Tian et al.
is splitting the traffic data and forwarding them to detection
sensors, so that these sensors can analyze the data in parallel.
Paper [10] presents an approach which allows for meaning-
ful slicing of the network traffic into portions of manageable
size. However, their approach uses a simple Round-Robin
algorithm for load balancing. The splitting algorithm of[11]
ensures that a single slice contains all the evidence necessary
to detect a specific attack, making sensor-to-sensor interac-tion unnecessary. Although the algorithm can dynamically
balance the sensors loads by choosing the sensor with the
lightest load to process the new connections packets, it still
may lead to some sensor losing a packet if the traffic of one
connection is heavy. Paper [12] has a design for a flow-based
dynamic load-balancing algorithm, which divides the data
stream based on the current value of each analyzers load
function. The incoming data packets, which belonged to a
new session, are forwarded to the analyzer that has the least
load currently. Paper [13] presents an active splitter architec-
ture and three methods for improving performance: the first
is early filtering/forwarding, where a fraction of the packetsis processed on the splitter instead of the sensors; the second
is the use of locality buffering, where the splitter reorders
packets in a way that improves memoryaccess location on the
sensors; the third is the use of cumulative acknowledgments,
a method that optimizes the coordination between the traffic
splitter and the sensors. The load balancer of SPANIDS [14]
employs multiple levels of hashing and incorporates feed-
back from the sensor nodes to distribute network traffic over
the sensors without overloading any of them. Although the
methods of [1214] reduce the load on the sensors, it compli-
cates the splitting algorithm and makes the splitter become
the bottleneck of the system.
The traffic splitter is the key of the distributed intrusion
detection system. An ideal splitting algorithm should satisfy
these requirements: (1) the algorithm divides the whole traf-
fic into slices of equal sizes; (2) each slice contains all the
evidence necessary to detect a specific attack; (3) the algo-
rithm is simple and efficient [11]. Through the above analysis
we can find that the primary goal of a NIDS load balancer is
to distribute network packets acrossa set of sensor hosts, thus
reducing the load on each sensor to a level that the sensor can
handle without dropping packets. However, the connection
oriented characteristic makes the load balancer of NIDS is
different from the other environments such as web servers,
distributed systems or clusters. In order to satisfy the require-
ment (2), all the distributed intrusion detection systems pay
more attention to the load balancer, and thus cant satisfy
requirements (1) and (3). In this paper a distributed neural
network learning algorithm (DNNL) is presented which can
be used in distributed anomaly detection system. The idea
of DNNL is different from the common distributed intru-
sion detection system. While the usual methods try to satisfy
requirement (2) through weakening requirements (1) and (3),
DNNL takes the opposite approach, which first considers
satisfying requirements (1) and (3) and through the learning
algorithm to satisfy requirement (2).
Two important characters of neural networks are: distrib-
uted, where knowledge representation is distributed across
many processing units; parallel, where computations take
place in parallel across these distributed representations.
Although each neural network can run in parallel, a groupof neural networks cant run in distribution to cope with one
problem corporately. Since the learning algorithm requires
all the training data to be submitted to the network one by
one until the network is stable after one or more epochs. This
requirement becomes untenable when the amount of data
exceeds the size of main memory, which is obviously possi-
ble for any realistic database, such as astronomy data [15],
biomedical data [16], bioinformatics data [17], etc. DNNL is
not only a parallel but also a distributed learning algorithm
which uses independent neural networks to process part of
the training data. These independent neural networks can
run in distribution and each one processes in parallel, thus itcan not only take advantage of the neural networks parallel
character but also overcome the drawback of concentrated
training. DNNL can also be used in mobile agent [18], dis-
tributed data mining [19], distributed monitoring [20] and
ensemble system [21].
The rest of this paper is organized as follows. Section 2
describes the main idea of DNNL, and details the basic learn-
ing algorithm. Section 3 presents a metric embedding method
and dissimilarity measure algorithm to make DNNL suit the
data which contains categorical and numerical features. The
experimental results on dataset KDDCUP99 are given in
Sect. 4, and conclusions are made in Sect. 5.
2 DNNL
2.1 The process of DNNL
The main process of DNNL is: first, splitting the large sample
data into small subsets and forwarding these slices to distrib-
uted sensors; second, each sensors neural network begins
to be trained by the sliced data in parallel until all of them
are stable; third, rebuilding the new training data based on
the training results of each neural network (the new training
datas amount is much less than the total amount of all the
sliced data); last, a concentrated learning is carried out on the
new training data. The process is shown in Fig. 1.
DNNL involves twophases learning.In thefirst phase(dis-
tributed learning), large data are splitted randomly and sent
to independent neuron networks, all the independent neuron
networks learn the knowledge of each slice in distribution
and every one in parallel. In the second phase (concentrated
learning), the training data is built from the training results of
123
-
7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm
3/12
Large-scale network intrusion detection based on distributed learning algorithm 27
Fig. 1 The process of DNNL
the distributed neural networks. Since the new data is much
less than the original training data, it can be learned by one
neural network in finite time and memory.
2.2 Analysis of DNNLs completeness
The key issue to DNNL is how to build the new data to
ensure the training is complete, that is, the result is equal to
thetrainingon thewholedata by oneneural network.Next we
first present the new datas building method and then analyze
the completeness of DNNL.
A stable neural network maintains the knowledge learned
from the sample data in the weight matrix W(mn), m is the
number of neurons, n is the dimension of each neuron. In
DNNL, the dimension of the neuron of the distributed neural
network is equal to the sample vector x(1n)s dimension.
After the distributed neural network is stable, each row ofW
can be regarded as one clustering center of the sliced data.
The new data are generated from Gaussian distribution in
each point of W. For example: the whole original data set
X has p q samples, X is split into p slices and each slice(i )X(i = 1, . . . , p) has q samples; after the i th neural net-
work trained by the i th slice of data (i )X is stable, its weight(i )W is composed of r rows (r q). The i th slice of the
new data set (i)X is generated from the Gaussian distribu-
tion in each point of (i )W. After generation, (i )X contains t
(r t q) samples. Since t is also much less than q, the
whole new data set X is much less than the whole original
data set X. From the above discussion we can find that after
training each row of the i th neuron networks weight matrix,(i )W represents some samples of (i)X, in which the distance
between these samples and the corresponding row is lower
than one threshold value. So we can get
(i )Wj =(i) Xk + Ak (1)
where k identifies some samples whose clustering center is
the j th row of (i )W, and Ak is the vector representing the
difference between (i )Wj and(i)Xk. The i th slice of new
data set (i )X is generated from the i th neuron network as
(i )Xl =(i) Wj + Bm (2)
where Bm is a random vector whose numbers are generated
from Gaussian distribution. Substituting Eq. (1) into Eq. (2)
gets
(i )Xl =(i) Xk + Ak + Bm (3)
The neural network can be represented by function f (W, x).
After training,input of unlabelled data x andthe outputof f()
is close or equal to future result y. To feedforward network,
the training is a process to find W
W = argmi n
m
i=1
nj=1
yi j f (W, xi )j
(4)
A common choice of the error function is the least mean
square error of the form
C(x) =
mi=1
yi f (W, xi )2 (5)its expected value is
E(C(x)) =
C(x) fd (x, y) dxdy (6)
the function fd (x, y) representing the probability density of
the training data. Substituting Eq. (5) into Eq. (6) gets
E(C(x))
=
mi =1
nj =1
yi j f (W, xi )j
2fd
xi , yi
dxi dyi (7)
When training with X the function f() becomes f (W,
x + a + b), which expands it into Taylor series:
f (W, x + a + b)= f (W, x) + f (W, x + a + b)T (a + b)
+1
2(a + b)T 2 f (W, x + a + b) (a + b) +
= f (W, x) + h (x) (8)
where f() is gradient and 2 f() is Hessian matrix. The
expected error value when training with new data set X can
be written in the form
E(C(x)) = E(C(x)) + ( f (W, x)) (9)
123
-
7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm
4/12
28 D. Tian et al.
where (f (W, x)) is
( f (W, x))
=
mi =1
nj=1
2yi j f (W, xi )j
h (xi )+h (xi )
2
fd xi , yi fd (ai ) fd (bi ) dxi dyi dai dbi (10)From Eq. (9) we can find that training with new data set X
is equivalent to the technique of regularization which adds a
penalty term to the error function for controlling the bias and
variance of a neural network [22].
This neural network learning rule can be considered as
a gradient optimization process when an appropriate energy
function E(w) is selected, the gradient direction is
dw
dt=
E(w)
w(11)
andthe synaptic weights areadjustedin thegradientdirection
w (k + 1) = w (k) E(w)w
(12)
If the data set S in Fig. 2 are trained following this process,
after neural network is stable, that is, the energy function E
reaches to minimum (local or global), the synaptic weights
are the black point in Fig. 2.
If the data set S is randomly (that is not following the
partition boundary) split into two data sets S1 and S2 which
are shown in Figs. 3 and 4, distributed learning first trains
S1 and S2 independently. After they are both stable, S1s
energy function E1 and S2s energy function E2 are both
reaching to minimum, the concentrated learning is carried
on the learning result of data set S1 and S2.During the concentrated learning: since the triangle class
and the plus sign class have been generalized very well by
their synaptic weights, these weights will not be adjusted in
great degree to minimize energy function; However the cross
class and the six-pointed star class dont reach to the opti-
mal state, so and therefore their synaptic weights generated
during the distributed learning will continue to adjust until
reaching to minimum. Although the training results (synap-
tic weight number and value) on the splitted data set and the
whole data set may be different, they have the same gen-
eralization ability because they all aim to make Ss energy
function E reach to minimum.
2.3 Competitive learning algorithm based on kernel
function
In order to gain the advantages of being able to learn from
new data, a neural network must be adaptive or exhibit
plasticity, possibly allowing the creation of new neurons.
On the other hand, if the training data structures are unsta-
ble and the most recently acquired piece of information can
Fig. 2 The data set S and its training result
Fig. 3 The data set S1 and its training result
Fig. 4 The data set S2 and its training result
cause major reorganization, then it is difficult to ascribe much
significance to any particular clustering description. This
problem is even more serious in distributed data training.
SOM [23], dART [24], RPCL [25], etc have presented some
methods to overcome this problem, this paper introduces a
competitive mechanism which absorbs the ideas of above
methods. The learning algorithm is based on Hebb learning
and kernel function. To prevent the knowledge included in
different slices being ignored, DNNL adopts the resonances
mechanism of ART and adds neurons whenever the network
123
-
7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm
5/12
Large-scale network intrusion detection based on distributed learning algorithm 29
in its current state does not sufficiently match the input. Thus
the learning results of the sensors contain complete or partial
knowledge and the whole knowledge can be learned by the
concentrated learning.
2.3.1 Hebb learning
In DNNL the learning algorithm is based on the HebbianPostulate which states that When an axon of cell A is near
enough to excite a cell B and repeatedly or persistently takes
part in firing it, some growth process or metabolic change
takes place in one or both cells such that As efficiency, as
one of the cells firing B, is increased.
The learning rule for a single neuron can be derived from
an energy function defined as
E(w) =
wTx
+
2w22 (13)
where w is the synaptic weight vector (including a bias or
threshold), x is the input to the neuron, () is a differentiablefunction, and 0 is the forgetting factor. Also,
y =d (v)
dv= f (v) (14)
is the output of the neuron, where v = wTx is the activity
level of the neuron. Taking the steepest descent approach to
derive the continuous-time learning rule
dw
dt= w E(w) (15)
where > 0 is the learning rate parameter, we see that the
gradient of the energy function in Eq. (13) must be computed
with respect to thesynaptic weightvector, that is,wE(w) =
E(w)/w . The gradient of Eq. (13) is
w E(w) = f (v)v
w+ w = yx + w (16)
Therefore, by using the result in Eq. (16) along with that
of Eq. (15), the continuous-time learning rule for a single
neuron is
dw
dt= [yx w] (17)
and the discrete-time learning rule (in vector form) is
w(t + 1) = w(t) + [y(t + 1)x(t + 1) w(t)] (18)
2.3.2 Competitive mechanism based on kernel function
To overcome the problem induced by a traffic splitters, the
inverse distance kernel function is used in Hebb learning.
The basic idea is that not only the winner is rewarded but
also all the losers are penalized at a different rates which are
calculated by the inverse distance function and its input is
the dissimilarity between the sample data and neuron.
The dissimilarity measure function is Minkowski metric:
dp (x, y) =
l
i=1
wi |xi yi |p
1/p(19)
where xi , yi are the i th coordinates ofx and y , i = 1, . . . , l,
and wi 0 is the i th weight coefficient.
When the j th neuron is most similar to the sample, thenthe learning rule of i th neuron is
Wi (t + 1) = Wi (t) + i [x(t + 1) Wi (t)] (20)
where
i =
1 : winner, i = j
K (di ): others, i = 1, . . . , m and i = j(21)
and K (di) is the inverse distance kernel,
K (di) =1
1 + dpi
(22)
If the winners dissimilarity measure d < ( is the thresh-old of dissimilarity), then update the synaptic weight by
learning rule Eq. (20), else add a new neuron and set the
synaptic weight w = x.
2.4 Post-prune algorithm
One of the central issues in network training is to find the
optimal model f(). Judging the efficiency of f() can be bro-
ken into two fundamental aspects: bias and variance. Bias
measures the expected value of the estimator relative to the
true value, and variance measures the variability of the esti-
mator about the expected value. Since DNNL determinesnetwork size by adding neurons incrementally, it may model
noise data into f() and lead to high variance (the phenom-
enon of overfitting). To prevent overfitting, DNNL uses the
post-prune method whose strategy is based on the distance
threshold: if two weights are too similar they will be substi-
tuted by a new weight. The new weight is calculated as
Wnew = (Wold1 t1 + Wold2 t2)/(t1 + t2) (23)
where t1 is the training times of Wold1 , t2 is the training
times ofWold2.
Thepruning process is illustrated in Fig. 5, after pruning E,
F, A and B are aggregated to EF and AB. The prune algorithm
is shown below:
Step 0: If old weights muster (oldW) is null then algorithm
is over, else proceed;
Step 1: Calculate the distance between the first weight (fw)
and the other weights;
Step 2: Find the weight (sw) which is most similar to fw;
Step 3: If the distance between sw and fw is larger than the
pruning threshold, then delete fw from oldW and add
123
-
7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm
6/12
30 D. Tian et al.
Fig. 5 The pruning process
fw into new weights muster (newW) goto step 0; else
continue;
Step 4: Get fws training times value (ft) and sws training
times value (st);
Step 5: Calculate the new weight (nw) and nws training
times value (nt), nw = (fw ft + sw st)/(ft + st),
nt = ft + st;
Step 6: Delete fw and sw from oldW and add nw into newW;
goto step 0.
2.5 Learning algorithm of DNNL
The main learning process of DNNL is:
Step 0: Initialize learning rate parameter and the threshold
of dissimilarity ;
Step 1: Obtain the first input x and set w0 = x as the initial
weight;
Step 2: If training is not over, randomly take a feature vector
x from the feature sample set X and compute the dissimi-
larity measure between x and each synaptic weight using
Eq. (19);
Step 3: Decide the winner neuron j and test tolerance: Ifdj
, add a new neuron and set the synaptic weight
w = x, goto Step 2; else continue;
Step 4: Compute i by using the result of inverse distance
K (di) ;
Step 5: Update the synaptic weight as Eq. (20), goto Step 2.
3 Data preprocessing
KDDCUP 99 data was collected through a simulation on the
U.S.A. military network by 1998 DARPA Intrusion Detec-
tion Evaluation Program, aiming at obtaining the benchmark
dataset in the field of intrusion detection. The full data set
contains training data consisting of 7 weeks of network-
based intrusions inserted in the normal data, and 2 weeks
of network-based intrusions in normal data for a total of
4,999,000 connection records described by 41 characteris-
tics. The records are mainly divided into four types of attack:
probe, denial of service (DOS), user-to-root (U2R) and
remote-to-local (R2L).
3.1 Metric embedding
The set of features presented in the KDD Cup data set con-
tains categorical and numerical features of different sources
and scales. An essential step for handling such data is metric
embedding which transforms the data into a metric space. In
this paper the categorical features are represented by metric
A, each categorical feature Ai expressing g possible categor-
ical values is defined as Ai =
A1i , A
2i , . . . , A
g
i
; the numer-ical features are represented by B; then the metric space X
can be defined as X = {A1, . . . , Am, B1, . . . , Bnm }. That
means each sample data X is described by n features.
3.2 Dissimilarity measure
To numerical features, the value |xi yi | of Minkowski met-
ric can be calculated directly after normalization. But for
categorical features, we need to define a new calculation
method. The Hamming distance is often used to quantify the
extent to which two strings of the same dimension differ. An
early application was in the theory of error-correcting codeswhere the hamming distance measured the error introduced
by noise over a channel when a message, typically a sequence
of bits, is sent between its source and destination. In DNNL
the calculation of|xi yi | for categorical features is similar
to Hamming distance. Ifxi and yi are categorical features, xiis the feature of sample data, yi is the corresponding feature
of one training neuron N, and xi = Aki (k 1, . . . , g), then
|xi yi | = 1 ck
C(24)
123
-
7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm
7/12
Large-scale network intrusion detection based on distributed learning algorithm 31
where ck is the number which represents how many times Aki
has been learned by neuron N,
ck = num
Aki
(25)
and C is the total number of all the categorical features that
have been learned by neuron N,
C =m
i =1
gj=1
num
Aji
(26)
If the neuron N is the winner of this training epoch, its value
ofck will be added by 1. Using this method to calculate the
Minkowski metric of categorical features, if the value of one
samples categorical feature Ai is Aki , then the neurons with
the larger numAki
are more similar to this sampleregarding
this categorical feature, that is, the value |xi yi | is much
smaller.
4 Experiments
4.1 Benchmark test
In the KDDCUP 99 data set, a smaller data set consisting
of the 10% the overall data set is generally used to eval-
uate algorithm performance. The smaller data set contains
22 kinds of intrusion behaviors and 494,019 records, among
which 97,276 are normal connection records. The test set is
another data set which contains 37 kinds of intrusion behav-
iors and 311,029 records, among which 60,593 are normal.
4.1.1 Performance measures
The recording format of test results is shown in Table 1.
False alarm is partitioned into False Positive (FP, normal is
detected as intrusion) andFalse Negative (FN, intrusionis not
detected). True detection is also partitioned into True False
(TF, intrusion is detected rightly) and True Negative (TN,
normal is detected rightly).
Table 1Recording format of test result
Detection results
Normal Intrusion-1 Intrusion-n
Normal TN00 FP01 FP0n
Actual Intrusion-1 F N10 TP11 FP1n
Intrusion-2 FN20 FP21 FP2n
Behaviors...
.
.
....
.
.
....
Intrusion-n FNn0 FPn1 TPnn
Definition 1 The right detection rate ofi th behavior (TR) is
TR = Tii
nj =1
Ri j (27)
where Ti i is the value which lies in Table 1s i th row and i th
column; Ri j is the value which lies in Table 1s i th row and
j th column.
Definition 2 The right prediction rate of i th behavior (PR)
is
PR = Tii
nj =1
Rj i (28)
where Tii is the value which lies in Table1s i th row and i th
column; Rji is the value which lies in Table1s j th row and
i th column.
Definition 3 Detection rate (DR) is computed as the ratio
between the number of correctly detected intrusions and thetotal number of intrusions. If we regard Table1s record as an
(n + 1) (n + 1) metric R, then
DR =
ni =1
nj =1
Ri j
ni=0
nj =0
Ri j (29)
Definition 4 False positive rate (FPR) is computed as the
ratio between the number of normal behaviors that are incor-
rectly classifies as intrusions and the total number of normal
connections, according to the Table1s record
FPR =
ni=1
FP0i n
i=1
FP0i + TN00
(30)
4.1.2 Experiment results
To test the performance of DNNL, we first divide the 494,019
records into 50 slices. Each slice contains 10,000 records
except the last which contains 4,019 records. When learning
rate = 0.1 and the threshold of dissimilarity = 1.5 the
learning results are shown in Figs. 6 and 7. In Fig. 6, X-axis
represents each slice and Y-axis records the corresponding
number of neurons after the neural networks are stable. In
Fig. 7, Y-axis records the corresponding number of behav-
iors included in each slice. From the results we can find that
the distribution of neurons and behaviors are similar, which
indicates the sensors have learned the knowledge. Since the
behaviors recorded in 18th36th slices and 43th46th are all
smurf intrusion, the number of behavior is 1 and the num-
ber of neurons is 51. After the training on the distributed
learning result, the knowledge is represented by 368 neurons.
There are 37 kinds of intrusion behaviors in the test set.
We first separate them into four kinds of attacks:
123
-
7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm
8/12
32 D. Tian et al.
Fig. 6 The number of neurons of each corresponding slices
Fig. 7 The number of behaviors of each corresponding slices
Probe: {portsweep, mscan, saint, satan, ipsweep, nmap}
DOS: {udpstorm, smurf, pod, land, processtable, warezmas-
ter, apache2, mailbomb, Neptune, back, teardrop}
U2R: {httptunnel, ftp_write, sqlattack, xterm, multihop,
buffer_overflow, perl, loadmodule, rootkit, ps}
R2L: {guess_passwd, phf, snmpguess, named, imap,
snmpgetattack, xlock, sendmail, xsnoop, worm}
The test results are summarized in Table 2.
Comparing the result with the first winner of KDD CUP
99, we see that the TR of DNNL is almost equal to the first
winner. There are two reasons leading to the low TR of U2R
and R2L: first, the size of attack instance that pertained to
U2R and R2L is much smaller than that of other types of
attack; second, U2R and R2L are host-based attacks which
exploit vulnerabilities of the operating systems, not of the
network protocol. Therefore, these are very similar to the
normal data. Table 3 shows the DR and FPR of the first
and second winner of the KDD CUP 99 competition, other
approaches [21] and DNNL. From the comparison we can
find that DNNL provides superior performance.
4.2 Prototype system test
4.2.1 Test environment
In order to address the problem of intrusion detection analy-sis in high-speed networks, the data stream on the high-speed
network link is divided into several smaller slices that are
fed into a number of distributed neural networks. In order
to evaluate the effectiveness of the DNNL, we developed a
prototype IDS using the Libpcap. The test environment is
shown in Fig. 8. We used 12 PCs with 100 Mbps Ethernet
cards to serve as background traffic generator, which could
generate more than 1,000 Mbps TCP and UDP streams with
an average packet size of 1,024 bytes. One IBM server which
runs Web services is the attack object and one attacker sends
attack packets to the server. All the 14 computers are con-
nected to 100Mbps ports on the Huawei Quidway S3526Cswitch. All the packets through these ports are mirrored to
a defined mirror port and then distributed to the neural net-
works.
4.2.2 Packet capture
Every station on a LAN hears every packet transmission, so
there is a destination field and a source field in each packet.
The Ethernet card can be in promiscuous mode or normal
mode. Under promiscuous mode, the card will receive and
deliver every packet. Under normal mode, if the packet desti-
nation address is identical to the station address, the card will
receive and pass the packet up to the software, if it is not, the
card will just drop the packet (filter it). IDS can run under the
promiscuous mode of Ethernet card to analyze every packets
passing through the LAN. Libpcap is the library we are going
to use to grab packets from the network card directly. The
main functions used are:
pcap_open_live() is used to obtain a packet capture
descriptor to look at packets on the network.
pcap_lookupnet() is used to determine the network num-
ber and mask associated with the network device.
pcap_lookupdev() returns a pointer to a network device
suitable for use with open_live() and lookupnet().
pcap_loop() is used to collect and process packets. The
captured packet will be parsedto form thenetwork behav-
ior vector.
Our method of parsing based on the character of network
software structuring technique. In the TCP/IP Reference
Model, the Internet layer defines an official packet format
and protocol called IP (Internet Protocol); the layer above
123
-
7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm
9/12
Large-scale network intrusion detection based on distributed learning algorithm 33
Table 2 Testing resultsDetection results
Normal Probe DoS U2R R2L TR (%)
Normal 58120 927 649 64 833 96.0
Actual Probe 357 3546 174 21 118 85.1
DoS 256 5092 223518 52 435 97.2
Behaviors U2R 143 39 0 23 23 10.1
R2L 14443 14 1 271 1460 9.0
PR (%) 79.3 36.9 99.6 5.3 50.9
Table 3 Comparison with other approaches
Performances Detection rate False positive rate
Algorithms (DR) (%) (FPR) (%)
Winning entry 91.9 0.5
Second place 91.5 0.6
Best linear GP-FP rate 89.4 0.7
Best GEdIDS-FP rate 91 0.4
DNNL 93.9 0.4
Fig. 8 Test environment
the Internet layer is transport layer, two end-to-end protocols
have been defined here. The first one, TCP (Transmission
Control Protocol) is a reliable connection-oriented protocol
that allows a byte stream originating on one machine to be
delivered without error on any other machine in the Inter-
net. The second protocol in this layer, UDP (User Datagram
Protocol), is an unreliable, connectionless protocol. In the
experiments we use the heads of packets to define the data
structure of network behavior:
typedef struct _EthernetBehavior
{u_int8_t ethernet_dest[12]; /* destination ethernet address */
u_int8_t ethernet_sour[12];/* source ehternet address */
u_int16_t ethernet_type;/* packet type ID field */
}EthernetBehavior;
The ethernet_type shows the nested structure of protocol
headers. It may be an IP, ARP, or some other protocols. For
instance, an IP header can be defined as:
typedef struct _IPBehavior
{
unsigned int header_len; /* the header length */
unsigned int version; /* version of the protocol */
u_int8_t tos; /* type of service */
u_short total_len; /* total length of datagram */
u_short identification; /* identification */
u_int8_t flag_off; /* flags and fragment offset */
u_int8_t time_live; /* the limit of packet lifetimes */
u_int8_t protocol; /* TCP or UDP */
u_int8_t checksum; /* header checksum */
struct in_addr source_addr; /* source address*/
struct in_addr destination_addr; /* destination address */
}IPBehavior;
The variable protocol tells what type of protocol will be
used in the upper layer, it can be TCP, UDP, ICMP, etc. For
example the definition of TCP is:
typedef struct _TCPBehavior
{
u_int16_t sour_port; /* source port */
u_int16_t dest_port; /* destination port */
tcp_seq seq_num; /* sequence number */
tcp_seq ack_num; /* acknowledgement number */
u_int16_t flag; /* flags */
u_int16_t win_size; /* window size */
u_int16_t check_sum; /* header checksum */
u_int16_t urg_pointer; /* urgent pointer */
}TCPBehavior;
123
-
7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm
10/12
34 D. Tian et al.
Fig. 9 Test result
During the period of training, IDS uses the behavior
variables of the normal packets to form the binary behavior
matrix. In detecting time, if an intrusion is detected, IDS will
alarm and display the detailed information of the intruder,
which is parsed from the behavior variables.
4.2.3 Experiment result
In this experiment, we evaluate our proposed method. We
first train the neural network with different normal features,
then use the stable neural network to monitor the systemwhere some abnormal behaviors are happening under the
same environment. A series of experiments are conducted to
analyze the effects of varying the value of intrusion threshold
to system errors. The tests results are graphically represented
in Fig. 9.
We can find that the performance of IDS is sensitive
according to intrusion threshold. As the threshold value
increases, false positive errors increase while false negative
errorsdecrease. Since a false negative error is more important
in IDS, we need to concentrate on the decrease of false neg-
ative errors according to the change of the threshold value.
The optimal threshold value is 1.51.6.
5 Conclusions
The bandwidth of networks increases faster than the speed
of processors. Its impossible to keep up with the speed of
networks by just increasing the processors speed of NIDS.
To resolve the problem, this paper presents a DNNL which
can be used in the anomaly detection methods. Completeness
analysis shows that DNNLs learning algorithm is equivalent
to training by one neural network which adds a penalty term
to the error function for controlling the bias and variance of
a neural network. The main contribution of this approach is:
reducing the complexity of load balancing while still main-
taining the completeness of the network behavior, putting
forward a dissimilarity measure method for categorical and
numerical features, and increasing the speed of the wholesystems. In the experiments, the KDD data set is used which
is the common data set used in IDS research papers. If train-
ing with one neural network it will use 67 h whereas DNNL
takes only less than 1 h. Comparisons with other approaches
on the same benchmark show that DNNLs false alarm rate
is very low.
Acknowledgments This research is supported by both the National
Natural Science Foundation of China under Grant No. 60573128 and
the National Research Foundation for the Doctoral Program of Higher
Education of China under Grant No.20060183043.
References
1. Song, H.Y., Lockwood, J.W.: Efficient packet classification for
network intrusion detection using FPGA. In: Proceedings of
the 13th International Symposium on Field-programmable Gate
Arrays, pp. 238245. Monterey (2005)
2. Yang, W., Fang, B.X., Liu, B., Zhang, H.L.: Intrusion detection
system for high-speed network. J. Comput. Commun.27, 1288
1294 (2004)
3. Baker, Z.K., Prasanna, V.K.: Automatic synthesis of efficient
intrusion detection systems on FPGAs. In: Proceedings of the
14th Field Programmable Logic and Application, pp. 311321.Leuven, Belgium (2004)
4. Baker, Z.K., Prasanna, V.K.: A methodology for synthesis of effi-
cientintrusiondetection systems on FPGAs. In: Proceedings of the
12th Annual IEEE Symposium on Field-Programmable Custom
Computing Machines (FCCM04), pp. 135144. Napa (2004)
5. McAlerney, J., Coit,C., Staniford,S.: Towards faster string match-
ing forintrusion detection or exceeding thespeedof snort.In: Pro-
ceedings of DARPA Information Survivability Conference and
Exposition, pp. 367373. Anaheim (2001)
6. Tuck, N., Sherwood, T., Calder, B., Varghese, G.: Deterministic
memory-efficient string matching algorithms for intrusion detec-
tion. In: Proceedings of the 23rd Conference of the IEEE Com-
munications Society, pp. 26282639. Hong Kong (2004)
7. Tan, L., Sherwood, T.: A high throughput string matching archi-
tecture for intrusion detection and prevention. In: Proceedings ofthe 32nd International Symposium on Computer Architecture,
pp. 112122. Madison, Wisconsin (2005)
8. Aggarwal, C., Yu, S.: An effective and efficient algorithm for
high-dimensional outlier detection. J. Int. J. Very Large Data
Bases 14, 211221 (2005)
9. Rawat, S., Pujari, A.K., Gulati, V.P.: On the use of singular value
decomposition for a fast intrusion detection system. J. Electronic
Notes Theor. Comput. Sci. 142, 215228 (2006)
10. Kruegel, C., Valeur, F., Vigna, G., Kemmerer, R.: Stateful intru-
sion detection for high-speed networks. In: Proceedings of the
IEEE Symposium on Security and Privacy, pp. 285294. Califor-
nia (2002)
123
-
7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm
11/12
Large-scale network intrusion detection based on distributed learning algorithm 35
11. Lai, H.G., Cai, S.W., Huang, H., Xie, J.Y., Li, H.: A parallel intru-
sion detection system for high-speed networks. In: Proceedings
of Applied Cryptography and Network Security: Second Interna-
tional Conference, pp. 439451. ACNS 2004, Yellow Mountain
(2004)
12. Jiang, W.B., Song, H., Dai, Y.Q.: Real-time intrusion detection
for high-speed networks. J. Comput. Secur.24, 287294 (2005)
13. Xinidis, K., Charitakis, I., Antonatos, S., Anagnostakis, K.G.,
Markatos, E.P.: An active splitter architecture for intrusion detec-
tion and prevention. J. IEEE Trans. Dependable. Secure Com-
put. 3, 3144 (2006)
14. Schaelicke, L., Wheeler, K., Freeland, C.: SPANIDS: a scalable
network intrusion detection loadbalancer. In: Proceedings of the
2nd Conference on Computing Frontiers, pp. 315322. Ischia
(2005)
15. Szalay, A., Gray, J.: The world-wide telescope. Science 293,
20372040 (2001)
16. Martone, M.E., Gupta, A., Ellisman, M.H.: E-neuroscience: chal-
lenges and triumphs in integrating distributed data frommolecules
to brains. Nature Neurosci. 7, 467472 (2004)
17. Wroe, C.,Goble, C.,Greenwood,M., Lord, P., Miles, S.,Papay, J.,
Payne, T., Moreau, L.: Automating experiments using semantic
data on a bioinformatics grid. IEEE Intell. Syst.19, 4855 (2004)
18. Wang, Y.X., Behera, S.R., Wong, J., Helmer, G., Honavar, V.,
Miller, L., Lutz,R., Slagell, M.: Towards the automaticgeneration
of mobile agents for distributed intrusion detectionsystem. J. Syst.
Softw. 79, 114 (2006)
19. Bala, J.,Weng, Y., Williams, A.,Gogia, B.K., Lesser, H.K.: Appli-
cations of Distributed Mining Techniques For Knowledge Discov-
ery in Dispersed Sensory Data. In: Proceedings of the 7th Joint
Conference on Information Sciences, pp. 14. Cary (2003)
20. Kourai, K., Chiba,S.: HyperSpector virtual distributed monitoring
environmentsfor secure intrusion detection.In: Proceedings of the
1st ACM/USENIXInternational Conference on Virtual Execution
Environments, pp. 197207. Chicago (2005)
21. Folino, G., Pizzuti, C., Spezzano, G.: GP ensemble for distributed
intrusion detection systems. In: Proceedings of the 3rd Interna-
tionalConference on Advancedin Pattern Recognition, pp. 5462.
Bath, UK (2005)
22. Geman, S., Bienenstock, E.,Doursat, R.: Neural networks and the
bias/variance dilema. Neural Comput. 4, 158 (1992)
23. Kuo, R.J., An, Y.L., Wang, H.S., Chung, W.J.: Integration of self-
organizing feature maps neural network and genetic K-means
algorithmfor marketsegmentation. J. ExpertSyst. Appl.30, 313
324 (2006)
24. Carpenter, G.A., Milenova, B.L., Noeske, B.W.: Distributed
ARTMAP: a neural network for fast distributed supervised learn-
ing. J. Neural Networks 11, 793813 (1998)
25. Nair, T.M., Zheng, C.L., Fink, J.L., Stuart, R.O., Gribskov, M.:
Rival penalized competitive learning (RPCL): a topology-
determining algorithmfor analyzinggene expression data. J. Com-
put. Biol. Chem. 27, 565574 (2003)
123
-
7/31/2019 Large-Scale Network Intrusion Detection Based on Distributed Learning Algorithm
12/12
Reproducedwithpermissionof thecopyrightowner. Further reproductionprohibitedwithoutpermission.