spam filtering for network traffic security on a multi-core environment

14
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2009; 21:1307–1320 Published online 24 April2009 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.1435 Spam filtering for network traffic security on a multi-core environment Rafiqul Islam 1, , , Wanlei Zhou 1 , Yang Xiang 2 and Abdun Naser Mahmood 3 1 School of Engineering and Information Technology, Deakin University, Melbourne, Australia 2 School of Management and Information Systems, Centre for Intelligent and Networked Systems, Central Queensland University, Australia 3 Computer Science and Information Technology, RMIT University, Melbourne, Australia SUMMARY This paper presents an innovative fusion-based multi-classifier e-mail classification on a ubiquitous multi- core architecture. Many previous approaches used text-based single classifiers to identify spam messages from a large e-mail corpus with some amount of false positive tradeoffs. Researchers are trying to prevent false positive in their filtering methods, but so far none of the current research has claimed zero false positive results. In e-mail classification false positive can potentially cause serious problems for the user. In this paper, we use fusion-based multi-classifier classification technique in a multi-core framework. By running each classifier process in parallel within their dedicated core, we greatly improve the performance of our multi-classifier-based filtering system in terms of running time, false positive rate, and filtering accuracy. Our proposed architecture also provides a safeguard of user mailbox from different malicious attacks. Our experimental results show that we achieved an average of 30% speedup at an average cost of 1.4ms. We also reduced the instances of false positives, which are one of the key challenges in a spam filtering system, and increases e-mail classification accuracy substantially compared with single classification techniques. Copyright © 2009 John Wiley & Sons, Ltd. Received 26 November 2008; Revised 8 February 2009; Accepted 12 February 2009 KEY WORDS: ubiquitous multi-core framework; multi-core; text classifier; multiple classifiers; spam filters; classifier; spam Correspondence to: Rafiqul Islam, School of Engineering and Information Technology, Deakin University, Melbourne, Australia. E-mail: [email protected] Copyright 2009 John Wiley & Sons, Ltd.

Upload: rafiqul-islam

Post on 12-Jun-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Spam filtering for network traffic security on a multi-core environment

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 2009; 21:1307–1320Published online 24 April 2009 inWiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.1435

Spam filtering for networktraffic security on a multi-coreenvironment

Rafiqul Islam1,∗,†, Wanlei Zhou1, Yang Xiang2 andAbdun Naser Mahmood3

1School of Engineering and Information Technology, Deakin University,Melbourne, Australia2School of Management and Information Systems, Centre for Intelligent andNetworked Systems, Central Queensland University, Australia3Computer Science and Information Technology, RMIT University, Melbourne,Australia

SUMMARY

This paper presents an innovative fusion-based multi-classifier e-mail classification on a ubiquitous multi-core architecture. Many previous approaches used text-based single classifiers to identify spam messagesfrom a large e-mail corpus with some amount of false positive tradeoffs. Researchers are trying to preventfalse positive in their filtering methods, but so far none of the current research has claimed zero falsepositive results. In e-mail classification false positive can potentially cause serious problems for the user.In this paper, we use fusion-based multi-classifier classification technique in a multi-core framework. Byrunning each classifier process in parallel within their dedicated core, we greatly improve the performanceof our multi-classifier-based filtering system in terms of running time, false positive rate, and filteringaccuracy. Our proposed architecture also provides a safeguard of user mailbox from different maliciousattacks. Our experimental results show that we achieved an average of 30% speedup at an averagecost of 1.4ms. We also reduced the instances of false positives, which are one of the key challenges ina spam filtering system, and increases e-mail classification accuracy substantially compared with singleclassification techniques. Copyright © 2009 John Wiley & Sons, Ltd.

Received 26 November 2008; Revised 8 February 2009; Accepted 12 February 2009

KEY WORDS: ubiquitous multi-core framework; multi-core; text classifier; multiple classifiers; spam filters;classifier; spam

∗Correspondence to: Rafiqul Islam, School of Engineering and Information Technology, Deakin University, Melbourne,Australia.

†E-mail: [email protected]

Copyright q 2009 John Wiley & Sons, Ltd.

Page 2: Spam filtering for network traffic security on a multi-core environment

1308 R. ISLAM ET AL.

1. INTRODUCTION

The problem of unsolicited bulk e-mails, more commonly known as spam, has been around sincee-mail was first used by the general public. In 2005, 80% of the total e-mail volume on the Internetwas considered spam [1]. The cost of managing spam is not proportional to the cost of sendingthese messages. While the cost of sending spam is negligible, the cost to corporations in terms ofnetwork resources, delayed e-mails and employee productivity need to be addressed. It is estimatedthat an average internet user spends 10 days a year dealing with spam [2]. There have been manyproposals for dealing with spam, from legislation to recent advances in the machine learning contentclassification [2].While the previous research in spam classification is primarily concerned with using a text-based

single classifier to detect spam messages, we have developed a novel spam filter architecture usinga multi-classifier approach [3–5].The use of multi-classifier systems provides a very high spam detection rate, but comes with a

high processing cost for each of the serially executed classifiers. In order for multi-classifier systemsto be more efficient, classification needs to be done in parallel. In order to improve its efficiencywe can run our multi-classifier classification (MCC) spam filtering system on either a single coreclustered-based computing system or a multi-core system.Cluster-based single core systems are usually built from computers that are coupled together to

form a single virtual machine. By using this cluster of computers to execute our multi-classifieralgorithms, we will be able to perform parallel operations, and achieve improved speed and per-formance. Multi-core CPUs were released in early 2000 but have become more affordable to thegeneral public since 2005, and are a combination of two or more independent microprocessor coresinto a single chip [6].In this paper we build upon our previous work [7,8], by combining our fusion-based multi-

classifier architecture with our ubiquitous multi-core framework (MuM), in order to provide highperformance while at the same improving the accuracy of spam detection. Based on our previouswork on security and multimedia, we have found that our ubiquitous multi-core framework can beapplied to most areas of computer science (as long as the system is multi-core).Some of the benefits of MuM are as follows, first it is cheaper to run and maintain in com-

parison with many high-end single core clustered computers. This is due to the fact that parallelprocessing of data occurs within one CPU, keeping communication overheads much lower asthe signal has to travel a shorter distance [9]. Second, we have discovered that its implementa-tion complexity is much less in comparison with a cluster of different machines. Some of theweaknesses of MuM are: first, cluster-based computing provides better redundancy compared withMuM, but the cost of running a redundant multi-core machine would still be cheaper, comparedwith running multiple high-end cluster-based computers. Second, if MuM goes off-line, then spamfilter is completely off-line as well. However, having MuM distributed on multiple multi-coremachines can be used as a backup in order to overcome this problem. The remainder of this pa-per is organized as follows: Section 2 provides a review of multi-core and multi-classifier spamfiltering architectures. Section 3 details our proposed multi-core fusion-based multi-classifier spamfilter architecture. Section 4 presents the results of our proposed architecture followed by theconclusion.

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 21:1307–1320DOI: 10.1002/cpe

Page 3: Spam filtering for network traffic security on a multi-core environment

SPAM FILTERING FOR NETWORK TRAFFIC SECURITY 1309

2. RELATED WORK

2.1. Multi-core

The ability of manufacturing faster single core systems has reached a threshold due to the heatdissipation caused by placing too many transistors on a single chip. In order to overcome thisproblem, hardware manufacturers have developed multi-core CPU architectures [10,11]. Multi-core or chip-level microprocessor systems combine two or more independent microprocessor coresinto a single chip [6]. Each microprocessor core has its own independent cache memory (L1) andshares a common cache (L2) with other cores and peripheral devices. The next version of AMDmulti-core processors will come with their own independent L1 and L2 caches.In terms of software, a multi-core architecture provides improved multitasking performance by

concurrently executing software codes on their own cores. Thus, multi-core architectures are idealfor parallel computing, where process threads can be run in parallel on different cores. Unfortunately,most of the existing softwares do not take advantage of parallel processing. Researchers are currentlyrevisiting parallel programming for use in a multi-core environment [7,8].

2.2. Bodyguard framework

We have developed a multi-core defence framework called bodyguard [7]. From this framework,we developed the Farmer bodyguard system. The basic strategy of the bodyguard framework is toseparate the security processes from all other processes (e-mail, browser, etc.), and assign themto a set of cores. The remaining cores within the system are assigned to the applications thatrequire security. The bodyguard framework is made up of a forward bodyguard (FB) and a sidebodyguard (SB). For example, in our Farmer bodyguard system, the SB is responsible for providinga fast decision on whether to filter out any attack traffic. Upon detecting an attack, FB will thenmove in front of the application in order to protect it and initiate a filtering procedure. There aremany advantages that come with the use of the bodyguard framework, such as efficient use ofresources, performance increases and real-time detection and filtering.

2.3. Multi-classifier classification of spam e-mail

Automated spam classification has traditionally been done using a single classifier technique. Theclassification algorithms such as Support Vector Machine (SVM) [12], Naı̈ve Bayesian (NB) [13]and Boosting [14] are used in content-based spam filtering. While single classifier techniques arefairly accurate, they require a lot of data for training. We performed comprehensive analyses ofthese classifiers in [5] and found that different classification algorithms return different results. Theaccuracy of the classification is reduced if the classifiers are used to classify generic content. Inorder to improve the classification accuracy, multi-classifier techniques were proposed [15–17].The multi-classifier technique for spam detection uses an ensemble of classifiers to classify

e-mail content. These classifiers are arranged in a two level hierarchy, with one classifier overseeingthe results provided by two or more classifiers below it. These lower level classifiers are usuallyweakly trained, so as to reduce processing time. The lower level classifiers will provide a score for

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 21:1307–1320DOI: 10.1002/cpe

Page 4: Spam filtering for network traffic security on a multi-core environment

1310 R. ISLAM ET AL.

a message. The top level classifier will aggregate the results from the lower level classifiers andprovides the final decision on whether an e-mail message is spam or legitimate.One of the first multi-classifier techniques for spam detection used a combination of NB and

k-Nearest Neighbour (k-NN) classifier techniques in a stacking framework overseen by a memory-based classifier [15]. Several other ensemble techniques for spam detection include NB bagging[16] and semi-supervised labelled messages [17].The current research has shown that aggregating scores from multiple classifiers improve the

accuracy of classifying an e-mail message. However, the current spam filter systems only use oneor two similar type of classifiers, instead of a diverse range of classifiers, in their architecture. Sincedifferent classifiers provide different results, the reliance of similar type classifiers will limit theclassification accuracy.In this paper, we are interested in using multi-classifier spam filters on a multi-core system to

further improve the accuracy of detecting spam messages. There is currently no research in usingmulti-core systems for improving the accuracy of spam detection. In Section 3, we will present aninnovative multi-core framework for implementing MCC architecture to detect spam e-mails.

3. MULTI-CLASSIFIER UBIQUITOUS MULTI-CORE (MUM)

We have developed a generic fusion-based MCC spam filter architecture that eliminates misiden-tification of legitimate e-mails as spam (false positive) during spam detection. The spam filterarchitecture was originally designed and developed for a single SVM classifier as detailed in [3,4].We extended this architecture to be used for an ensemble-based generic multi-classifier that pro-cesses the information in serial [5]. In this section, we are proposing a modified version of our spamclassification architecture so that it can be executed in a multi-core environment, using differentmachine learning classification algorithms.

3.1. Design of MCC filter

Figure 1 provides a description of our proposed e-mail classification using MCC technique.Our architecture should be used in a two-stage approach, at the mail server and at the recipient’s

mailbox. The e-mail server will automate the e-mail classification process while the user will begiven the option to manually identify messages that do not fall collectively within the category oflegitimate or spam. E-mails messages that cannot be identified as either legitimate (TP) or spam(TN) are termed grey list (GL) messages. The characteristics of e-mail messages that have beensuccessfully identified as either legitimate or spam are used to retrain the multi-classifiers in orderto reduce the requirement for the user to manually identify spam messages.Before our architecture can be used to classify e-mail messages, all of the classifiers need

to be trained to recognize the attributes to be classified, they may be Boolean, frequency orN-gram attributes. The classifiers can be used to check for spam as well as non-spam attributes in ane-mail message. This training is done using attributes extracted from training (Tr) data. The trainingprocess is an off-line process that is done when the classifiers are idle.In the first stage, the e-mail server receives all incoming e-mails for the organization. The server

will index the e-mail corpora. All of the incoming indexed e-mails (Ts) will be sent to the adaptive

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 21:1307–1320DOI: 10.1002/cpe

Page 5: Spam filtering for network traffic security on a multi-core environment

SPAM FILTERING FOR NETWORK TRAFFIC SECURITY 1311

Figure 1. Multi-classifier classification (MCC) spam filter architecture.

section. The main function of the adaptive section is to allocate the e-mail messages to the classifiersand collect all of the results from the classifiers. The results of the e-mail classification will begiven a value of 1 for true positive (legitimate) or 0 for true negative (spam).These results will be forwarded to the decision fusion to calculate the final result for an e-mail

message. If the decision fusion component receives the same results for a particular e-mail messages,it can be classified as either legitimate (TP) or spam (TN). If the total result is neither 0 or 1, thene-mail is considered as a GL e-mail. This process can be represented as follows:

f (WMe) =∑N

i=1CiMe

N(1)

The results of N classifiers (Ci ) for message (Me) will return f (WMe) = 1 for true positive (legit-imate), f (WMe) = 0 for true negative (spam) and 0 < f (WMe) < 1 for grey list.All of these messages, legitimate, spam and grey will be forwarded to the user’s mailbox.

The detailed functions and algorithms for the message transformation, adaptive section, featureextraction/feature selection (FE/FS) and decision fusion mechanisms will not be shown in thispaper as they can be found in [3].Our approach of using three categories for e-mail messages, legitimate (white list), spam (black

list) and unidentified (grey list) will provide greater accuracy in classifying spam messages and

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 21:1307–1320DOI: 10.1002/cpe

Page 6: Spam filtering for network traffic security on a multi-core environment

1312 R. ISLAM ET AL.

legitimate messages from the vast bulk of e-mails received by the e-mail server. Once the messageshave been categorized, they will be sent to the user (stage 2).The second stage occurs once the e-mail messages are received by the user. These messages will

be presented in their respective mailboxes. The true positive (legitimate) e-mails will be sent to theuser’s inbox (TP) while the true negative (spam)messages will be sent to the spammailbox (TN). Theunidentified messages (GL) will be analysed by the user to categorize them as legitimate or spam.Such an approach is beneficial since categorizing e-mail messages is subjective. Some users mightconsider a message to be spam while other users might consider the same message as legitimate.Those GL messages that are legitimate will be sent to the user’s inbox while the spam messages

will be in the spam mailbox. In order to automate the process of detecting spam messages, ourarchitecture has a dynamic feature selection component. This component (FE/FS) will extract therelevant features from the legitimate and spam e-mail messages and send this data to the mail serverin order to train the classifiers. This training data (Tr) will be updated every time the user identifiesa grey list e-mail. This ensures that the messages identified as legitimate or spam are according tothe personal preference of the user.In our approach, the use of grey lists provides the user with fine-grain controls to classify messages

as legitimate or spam. We also update the training data on the mail server using dynamic featureselection so that most of the e-mail messages will be identified correctly before reaching the user.These two stages ensure that our MCC architecture is scalable and can eliminate the false positiveproblem.

3.2. Design of multi-classifier classification filter with ubiquitous multi-core framework(MuM)

In order to improve the classification of e-mail messages, we build upon the previous work withmulti-core systems by developing a fusion-based ubiquitous multi-classifier architecture (FUMA).FUMA uses a fully trained data set to generate results and supports any machine learning classifiertechnique such as SVM, NB, k-NN and Boosting.The execution of all the classifiers is done in parallel in order to reduce the time taken to classify a

message. In a single core CPU, the execution of multiple classifiers at the same time will fully utilize(if not all) the available CPU power. We believe that by our application of the ubiquitous multi-core framework we can greatly improve the performance and resource usage of our multi-classifierarchitecture.While most multi-core research looks at improving the communication between cores and

application, through our development of a ubiquitous multi-core framework, we will reduce the CPUcomputation of n full featured classifiers in order to correctly identify legitimate e-mail messagesfrom spam messages.Our proposed multi-classifier classification filter with ubiquitous multi-core (MuM) framework

used by the classifiers is shown in Figure 2. Each of the classifiers (Ci ) in the spam filter systemwill run on their own independent core. The same e-mail input will be sent by the adaptive sectionto each of the classifiers. The classifiers will run in parallel, thus improving the speed of analysingthe e-mails. Each classifier process (Cn) will execute their sub processes (Pm) in parallel. Once theclassifiers have completed their analysis, they will send the results back to the adaptive section asdescribed in Section 3.1.

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 21:1307–1320DOI: 10.1002/cpe

Page 7: Spam filtering for network traffic security on a multi-core environment

SPAM FILTERING FOR NETWORK TRAFFIC SECURITY 1313

Figure 2. Multi-classifier classification filter with ubiquitous multi-core (Mum).

3.3. Benefits and weaknesses of MuM

Using ubiquitous multi-core framework has the following benefits for MuM. First, the partitioningof each classifier and its tasks to a separate core will reduce the computation burden of the overallmail server system.Second, the reduction of memory storage requirements for e-mail messages (Ts). As the same

e-mail messages (Ts) are sent to all classifiers, the system buffer will store the e-mail messageonce.Third, in terms of the processing time, all of the classifiers will process the e-mail messages

in parallel. Unlike ensemble-based multi-core architectures, our approach allows a classifier toprocess an e-mail message independently from the other classifiers. This allows faster processingof messages compared with other architectures.Fourth, the multi-classifiers are trained using (Tr) data when the system is idle. As each core is

independent, the training can be done at different times as some cores will complete the classificationtasks faster than other cores. This will mean that the mail server resource usage will be optimized.Lastly, MuM is robust, as the adaptive selection can still provide accurate e-mail classification

if one of the core fails. The adaptive selection component can choose to either reduce the numberof classifiers or redistribute the classifiers (and the sub-tasks) to another core. Although this is anon-optimum solution, it is a robust solution in the event that one of the core fails during operation.

4. RESULTS

We evaluate the performance and accuracy of our MuM by simulating our fusion-based MCCarchitecture on a 4-core multi-core system. We have dedicated 3 cores for implementing 3 differentclassifiers and use the fourth core to implement the other components, such as initial transformation,

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 21:1307–1320DOI: 10.1002/cpe

Page 8: Spam filtering for network traffic security on a multi-core environment

1314 R. ISLAM ET AL.

adaptive section and decision fusion, of our multi-classifier architecture (Figure 1). This methodensures that the performance of the classifiers is not affected by the adaptive section and decisionfusion components. The performance of MuM is described in detail below.

4.1. Multi-core performance benchmark

Once we have the execution time ts , computational time tcomp, and communication time tcom, we canestablish the speedup factor (formula 2) and computation/communication (C/C) ratio (formula 3)from a single core to a multi-core system as

speed factor= tstcp

= tstcomp + tcom

(2)

where ts will stand for execution time on a single core processor (tcp), this includes computationtime and communication time

C/Cratio = tcomp

tcom(3)

Apart from speedup and the Computation/Communications ratios, we also evaluate the multi-classifier algorithm, through the use of Time Complexity or ‘big-oh’, also referred to as ‘order ofmagnitude’ [18]

f (x)= O(g(x))

[0 ≤ f (x) ≤ cg(x)] for all x ≥ 0(4)

where f (x) and g(x) are functions of x . A positive constant, c, has to exist for all otherwise itis zero. To evaluate Time complexity, we use the total sum of computation and communication(formula 2).

Time complexity= TP = tcomp + tcom =(

n

cp + 1

)+

(2tstartup +

(n

cp + 1

)tmsgdata

)(5)

Table I. Results of multi-core speedup and the costs.

Core 1 Core 2 Core 3 Core 4

Exe time (ms) 1.4 1.5 1.4 1.4Comp time (ms) 0.4 0.10 0.4 0.09Comm time (ms) 1 0.5 1 0.5Speed ratio (%) 50 20 50 20C/C 0.4 2.0 0.4 1.8Time complex 3.0 3.0 3.0 3.0Cost 1.4 1.5 1.4 1.4Cost-optimal 2.0 2.0 2.0 2.0

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 21:1307–1320DOI: 10.1002/cpe

Page 9: Spam filtering for network traffic security on a multi-core environment

SPAM FILTERING FOR NETWORK TRAFFIC SECURITY 1315

Figure 3. Min(90%)–Max(100%) CPU usage that was achieved during the time of simulation.

Table II. Comparison of precision-recall of individual cores with combined cores.

Data Condition variable Core 1 Core 2 Core 3 Combined-cores

Data 1 −1 −0.5555556 −0.7777778 −0.5555556 −0.7777778+1 1.000 0.96875 1.000 1.000

Data 2 −1 −0.4285714 −0.7142857 −0.7142857 −0.9285714+1 0.7777778 1.000 0.6666667 1.000

Data 3 −1 −0.4545455 −0.6363636 −0.2727273 −0.6363636+1 0.8571429 1.000 0.7142857 1.000

Data 4 −1 −0.5555556 −0.5555556 −0.5555556 −0.9444444+1 1.000 0.875 1.000 1.000

Data 5 −1 −0.5384616 −0.2307692 −0.3846154 −0.6923077+1 0.8333333 1.000 0.8333333 1.000

Data 6 −1 −0.6000 −0.200 −0.400 −0.800+1 0.8666667 1.000 0.7333333 1.000

where n is the number of threads on each core processor. The last benchmark we will use is thecost and cost-optimal.

Cost= (execution time) ∗ (total number of processor used)

Cost Optimal= time complexity ∗ number of processor= (n log n)

4.2. Multi-core system evaluation

To measure and evaluate the performance of MuM, we wrote 4 simple programs to simulate themulti-classifier applications. We assigned them to 4 cores within our multi-core system by usingaffinity methods [7,8]. The multi-classifier functions are simulated, by the 4 programs just todemonstrate our framework, though 3 actual multi-classifier programs are planned in the future.Based on our evaluations, displayed in Table I and Figure 3, we see that a speed average of

30% was achieved at the average cost of 1.4ms. This was achieved by separating each applicationand allowing them to run on their own separate cores. The time complexity results also show

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 21:1307–1320DOI: 10.1002/cpe

Page 10: Spam filtering for network traffic security on a multi-core environment

1316 R. ISLAM ET AL.

-1.00

-0.78

-0.56

-0.33

-0.11

0.11

0.33

0.56

0.78

1.00

Precision-recall curve

Data_3

% o

f val

ue

Core_1

Core_2

Core_3

Combined

-1.00

-0.78

-0.56

-0.33

-0.11

0.11

0.33

0.56

0.78

1.00

Precision-recall curve

Data_5

% o

f val

ue

Core_1

Core_2

Core_3

Combined

-1.00

-0.78

-0.56

-0.33

-0.11

0.11

0.33

0.56

0.78

1.00

-1.0

Precision-recall curve

Data_1

% o

f val

ueCore_1

Core_2

Core_3

Combined

-0.8 -0.6 -0.3 -0.1 0.1 0.3 0.6 0.8 1.0

-1.0 -0.8 -0.6 -0.3 -0.1 0.1 0.3 0.6 0.8 1.0

-1.0 -0.8 -0.6 -0.3 -0.1 0.1 0.3 0.6 0.8 1.0

-1.00

-0.78

-0.56

-0.33

-0.11

0.11

0.33

0.56

0.78

1.00

Precision-recall curve

Data_4

% o

f val

ue

Core_1

Core_2

Core_3

Combined

-1.00

-0.78

-0.56

-0.33

-0.11

0.11

0.33

0.56

0.78

1.00

Precision-recall curve

Data_6

% o

f val

ue

Core_1

Core_2

Core_3

Combined

-1.00

-0.78

-0.56

-0.33

-0.11

0.11

0.33

0.56

0.78

1.00Precision-recall curve

Data_2

% o

f val

ue

Core_1

Core_2

Core_3

Combined

-1.0 -0.8 -0.6 -0.3 -0.1 0.1 0.3 0.6 0.8 1.0

-1.0 -0.8 -0.6 -0.3 -0.1 0.1 0.3 0.6 0.8 1.0

-1.0 -0.8 -0.6 -0.3 -0.1 0.1 0.3 0.6 0.8 1.0

Figure 4. Precision-recall curve for six data sets.

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 21:1307–1320DOI: 10.1002/cpe

Page 11: Spam filtering for network traffic security on a multi-core environment

SPAM FILTERING FOR NETWORK TRAFFIC SECURITY 1317

Table III. Average ROC of classification algorithms.

Data ROC estimation Core 1 Core 2 Core 3 Combined cores

Data (1–6) AUC 0.852818 0.87327 0.826265 0.949143AUC’s St Err 0.037893 0.033725 0.039185 0.022833395% of CI 0.75851– 0.7847066– 0.73109– 0.86701167–

0.911685 0.923985 0.8886267 0.9777817P-value (<0.001) (<0.001) (<0.001) (<0.001)

that the efficiency of our algorithm is at 3.0. This means that for 4 computational steps (estimate)we achieved 3 data items. Hence, the more computations that are done the more data items wecomplete. For example, 5 computational steps will give us 3.5 data items. One of the results, theComputation/Communication Ratio, shows that it was less then Time Complexity. This means thatit will not improve speedup or efficiency beyond the figures we already have. Lastly, we see thatthe cost of running our program was below the cost-optimal, and at the same time achieving anaverage of 95% CPU (see Figure 3). This means that our model/program was quite cost efficient torun and the resource usage (CPU) was almost fully optimized. In addition, due to Time Complexitybeing higher than Computation/Communication Ratio it would not be worthwhile trying to sendour costs up to reach the optimal, as we would gain no performance benefit.

4.3. Performance of MCC

We implemented 3 text-based classifier algorithms to measure their accuracy on a single core systemcompared with our multi-core system. We executed SVM classifier on core-1, AdaBoost classifieron core-2 and Naive Bayesian classifier on core-3. Each of these classifiers was implemented intheir own process thread to simulate a multi-core system. We ran each type of classifier through6 e-mail data sets from public data set PUA [13]. We have converted the data sets in six differentparts based on our experimental set-up to test their accuracy in correctly detecting spam messages.Table II shows the precision (false positive) and recall (false negative) in classifying e-mail

messages. From all the results, our proposed multi-classifier system did not return any false positivevalues. We strongly state the case that the use of a fusion-based multi-classifier will eliminate thefalse positive problem. Our simulations also clearly show that a multi-classifier returns very fewfalse negative results compared with just using single classifier algorithms.Figure 4 shows the comparison of precision-recall curve of six data sets for output from individual

cores along with combined outputs from three cores. It can be seen from Figure 4 that precision isalways better for the combined result compared with the individual result and it is 100% for all datasets, which is promising. It is clear that the MCC approach reduces the number of false positiveinstances to zero level.The average receiver operating characteristic (ROC) report of our multi-classifier approach is

shown in Table III. From the results of our experiments with 6 data sets, the accuracy of ourfusion-based multi-classifier system is 10–13% higher than any single classifier algorithm.Figure 5 shows the ROC curve for sensitivity and specificity analysis of the classifier classification

algorithm accuracy in detecting spam from legitimate e-mail messages. The figures show the ROCcurve for all 6 data sets.

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 21:1307–1320DOI: 10.1002/cpe

Page 12: Spam filtering for network traffic security on a multi-core environment

1318 R. ISLAM ET AL.

Figure 5. The sensitivity and specificity curve for six data sets.

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 21:1307–1320DOI: 10.1002/cpe

Page 13: Spam filtering for network traffic security on a multi-core environment

SPAM FILTERING FOR NETWORK TRAFFIC SECURITY 1319

5. CONCLUSION AND FUTURE WORK

In this paper, we introduced a novel multi-core-based framework that we used in our fusion-basedMCC spam filter architecture. Our proposed multi-core framework ensures that the multi-classifiersperform more efficiently, thus reducing the burden on system resources while detecting spam frome-mail messages. We have shown through simulations that our proposed multi-classifier architectureperforms better than any other text-based single classifier system. Our multi-classifier architectureeliminates all false positive results when detecting spam.For future work, we plan to implement our multi-classifier system on an actual multi-core-based

enterprise grid to gain better results, particularly in terms of the number of classifiers required, toprovide high accuracy without incurring high system resource usage.

ACKNOWLEDGEMENTS

We would like to express our sincere thanks to our research group members Mr Ashley Chonka and Mr JaipalSing for their help and support during this research.

REFERENCES

1. Barracuda. Barracuda Networks Releases Annual Spam Report, 2008. Available at: http://www.barracudanetworks.com/ns/news and events/index.php?nid=232, 2007 [21 February 2008].

2. Carpinter JM. Evaluating ensemble classifiers for spam filtering. Honours Thesis, University of Canterbury, 2005.3. Islam MR, Zhou W, Choudhury MU. Dynamic feature selection for spam filtering using support vector machine.

Proceedings of the 6th IEEE/ACIS International Conference on Computer and Information Science, Melbourne, Australia,11–13 July 2007; 757–762.

4. Islam MR. Zhou W. An innovative analyser for email classification based on grey list analysis. Proceedings of the IFIPInternational Conference on Network and Parallel Computing, China, 18–21 September 2007; 176–182.

5. Islam MR, Zhou W. Email categorization using multi-stage classification technique. Proceedings of the 8th InternationalConference on Parallel and Distributed Computing, Applications and Technologies, Adelaide, Australia, 3–6 December2007; 51–58.

6. Advanced Micro Devices, Multi-core processors the next evolution in computing, 2005. Available at: http://multi-core.amd.com/Resources/33211A, Multi-Core WP en.pdf [Accessed on: 18 April 2008].

7. Chonka A, Zhou W, Knapp K, Xiang Y. Protecting information systems from DDoS attack using multi-core methodology.Proceedings of the IEEE 8th International Conference on Computer and Information Technology, Sydney, Australia,2008; 270–275.

8. Chonka A, Zhou W, Ngo L, Xiang Y. Ubiquitous multicore (UM) methodology for multimedia. The 2008 InternationalSymposium on Computer Science and its Applications. IEEE: New York, 2008.

9. Chai L, Gao Q, Panda DK. Understanding the impact of multi-core architecture in cluster computing: A case study withIntel dual-core system. Proceedings of the 7th IEEE International Symposium on Cluster Computing and the Grid, May2007; 471–478.

10. Intel. Intel multi-core technology. 2008. Available at: http://www.intel.com/multi-core/index.htm [Accessed on: 20 April2008].

11. Advanced Micro Devices. Amd multi-core products. 2008. Available from: http://multi-core.amd.com/us-en/AMD-Multi-Core roducts.aspx [Accessed on: 20 April 2008].

12. Drucker H, Wu D, Vapnik V. Support vector machines for spam categorization. IEEE Transactions on Neural Networks1999; 10:1048–1054.

13. Androutsopoulos I, Koutsias J, Chandrinos KV, Paliouras G, Spyropoulos CD. An evaluation of naive Bayesian anti-spamfiltering. Proceedings of the 11th European Conference on Machine Learning, Spain (Lecture Notes in AI), 2000; 9–17.

14. Carreras X, Marquez L. Boosting trees for anti-spam email filtering. Proceedings of the European Conference in RecentAdvances in NLP, Bulgaria, 2001; 58–64.

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 21:1307–1320DOI: 10.1002/cpe

Page 14: Spam filtering for network traffic security on a multi-core environment

1320 R. ISLAM ET AL.

15. Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P. Stacking classifiers foranti-spam filtering of e-mail. Proceedings of the 6th Conference on Empirical Methods in Natural Language Processing,Sydney, Australia, 2001; 44–50.

16. Yang Z, Nie X, Xu W, Guo J. An approach to spam detection by Naive Bayes ensemble based on decision induction.Proceedings of the 6th International Conference on Intelligent Systems Design and Applications, China, 2006; 861–866.

17. Cheng V, Li C. Personalized spam filtering with semi-supervised classifier ensemble, Proceedings of the IEEE/WIC/ACMInternational Conference on Web Intelligence, Hong Kong, China, 2006; 195–201.

18. Knuth DE. Big omicron and big omega and big theta. ACM SIGACT News 1976; 8:18–24.

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 21:1307–1320DOI: 10.1002/cpe