e ective malware detection based on behaviour and data features · 2020-06-30 ·...

Effective Malware Detection based on Behaviourand Data Features

Zhiwu Xu, Cheng Wen, Shengchao Qin, and Zhong Ming

College of Computer Science and Software Engineering, Shenzhen University, China{xuzhiwu,sqin,mingz}@szu.edu.cn, [email protected]

Abstract. Malware is one of the most serious security threats on theInternet today. Traditional detection methods become ineffective as mal-ware continues to evolve. Recently, various machine learning approacheshave been proposed for detecting malware. However, either they focusedon behaviour information, leaving the data information out of consider-ation, or they did not consider too much about the new malware withdifferent behaviours or new malware versions obtained by obfuscationtechniques. In this paper, we propose an effective approach for malwaredetection using machine learning. Different from most existing work, wetake into account not only the behaviour information but also the datainformation, namely, the opcodes, data types and system libraries used inexecutables. We employ various machine learning methods in our imple-mentation. Several experiments are conducted to evaluate our approach.The results show that (1) the classifier trained by Random Forest per-forms best with the accuracy 0.9788 and the AUC 0.9959; (2) all thefeatures (including data types) are effective for malware detection; (3)our classifier is capable of detecting some fresh malware; (4) our classifierhas a resistance to some obfuscation techniques.

1 Introduction

Malware, or malicious software is a generic term that encompasses viruses, tro-jans, spywares and other intrusive codes. They are spreading all over the worldthrough the Internet and are increasing day by day, thus becoming a seriousthreat. According to the recent report from McAfee [1], one of the world’s leadingindependent cybersecurity companies, there are more than 650 million malwaresamples detected in Q1, 2017, in which more than 30 million ones are new. Sothe detection of malware is of major concern to both the anti-malware industryand researchers.

To protect legitimate users from these threats, anti-malware software prod-ucts from different companies provide the major defence against malware, suchas Comodo, McAfee, Kaspersky, Kingsoft, and Symantec, wherein the signature-based method is employed. However, this method can be easily evaded by mal-ware writers through the evasion techniques such as packing, variable-renaming,and polymorphism [2]. To overcome the limitation of the signature-based method,heuristic-based approaches are proposed, aiming to identify the malicious be-haviour patterns, through either static analysis or dynamic analysis. But the

increasing number of malware samples makes this method no longer effective.Recently, various machine learning approaches like Support Vector Machine, De-cision Tree and Naive Bayes have been proposed for detecting malware [3]. Thesetechniques rely on data sets that include several characteristic features for bothmalware and benign software to build classification models to detect (unknown)malware. Although these approaches can get a high accuracy (for the stationarydata sets), it is still not enough for malware detection. On one hand, most ofthem focus on the behaviour features such as binary codes [4–6], opcodes [6–8]and API calls [9–11], leaving the data information out of consideration. While afewl of them do consider the data information, they consider only simple featureslike strings [12, 13] and file relations [14, 15]. On the other hand, as malware con-tinues to evolve, some new and unseen malware have different behaviours andfeatures. Even the obfuscation techniques can make malware difficult to detect.Hence, more datasets and experiments are still needed to keep the detectioneffective.

In this paper, we propose an effective approach to detecting malware basedon machine learning. Different from most existing work, we take into accountnot only the behaviour information but also the data information. Generally,the behaviour information reflects what the software intends to behave, whilethe data information indicates which datas the software intends to perform onor how data are organised. Our approach tries to learn a classifier from existingexecutables with known categories first, and then uses this classifier to detectnew, unseen executables. In detail, we take the opcodes, data types and systemlibraries that are used in executables, which are collected through static analy-sis, as representative features. As far as we know, our approach is the first oneto consider data types as features for malware detection. Moreover, in our im-plementation, we employ various machine learning methods, such as K-NearestNeighbor, Native Bayes, Decision Tree, Random Forest, and Support VectorMachine to train our classifier.

Several experiments are conducted to evaluate our approach. Firstly, we con-ducted 10-fold cross validation experiments to see how well various machinelearning methods perform. The experimental results show that the classifiertrained by Random Forest performs best here, with the accuracy 0.9788 andthe AUC 0.9959. Secondly, we conducted experiments to illustrate that all thefeatures are effective for malware detection. The result also shows that in somecase using type information is better than using the other two. Thirdly, to testour approach’s ability to detect genuinely new malware or new malware versions,we ran a time split experiment: we used our classifier to detect the malware sam-ples which are newer than the ones in our data set. Our classifier can detect 81%of the fresh samples, which indicates that our classifier is capable of detectingsome fresh malware. The results also suggest that malware classifiers should beupdated often with new data or new features in order to maintain the classifica-tion accuracy. Finally, one reason that makes malware detection difficult is thatmalware writers can use obfuscation techniques to evade the detection. So forthat, we performed experiments to test our approach’s ability to detect new mal-

ware samples that are obtained by obfuscating the existing ones through someobfuscation tools. All the obfuscated malware samples can be detected by ourclassifier, demonstrating that our classifier has a resistance to some obfuscationtechniques.

The remainder of this paper is organised as follows. Section 2 presents therelated work. Section 3 describes our approach, and followed by the implemen-tation and experimental results in Section 4. Finally, Section 5 concludes.

2 Related Work

There have been lots of work on Malware Detection. Here we focus on somerecent related work based on machine learning. Interested readers can refer tothe surveys [3, 16] for more details.

Over the past decade, intelligent malware detection systems have been de-veloped by applying data mining and machine learning techniques. Some ap-proaches employ the N-grams method to train a classifier based on the binarycode content[4–6]. Rather surprisingly, a 1-gram model can distinguish malwarefrom benign software quite well for some data set. But these approaches canbe evaded easily by control-flow obfuscation. Some approaches [6–8] take theopcodes, which reflect the behaviours of the software of interest quite well, asfeatures to classify malware and benign software. Some other approaches con-sider API calls, intents, permissions and commands as features [9–11], based ondifferent classifiers. All these works focus more on the behaviour information,while our current work considers not only behaviour information but also datainformation.

There have been some research work that takes data information into ac-count. Ye et al. [12] propose a malware detection approach based on interpretablestrings using SVM ensemble with bagging. Islam et al. [13] consider malwareclassification based on integrated static and dynamic features, which includesprintable strings. Both Karampatziakis et al. [14] and Tamersoy et al. [15] pro-pose the malware detection approach based on the file-to-file relation graphs.More precisely, a file’s legitimacy can be inferred by analyzing its relations (co-occurrences) with other labeled (either benign or malicious) peers. Clearly, boththe string information and the file relation are quite simple and not all malwareshare the same information.

Some recent research work employ deep learning to help detect malware. Saxeand Berlin [17] propose a deep neural network malware classifier that achievesa usable detection rate at an extremely low false positive rate and scales to realworld training example volumes on commodity hardware. Hardy et al. [18] de-velop a deep learning framework for intelligent malware detection based on APIcalls. Ye et al. [19] propose a heterogeneous deep learning framework composedof an AutoEncoder stacked up with multilayer restricted Boltzmann machinesand a layer of associative memory to detect new unknown malware. In addition,concept drift is also borrowed into malware detection. Roberto et al. [20] proposea fully tuneable classification system Transcend that can be tailored to be re-

silient against concept drift to varying degrees depending on user specifications.Our approach employ several supervised machine-learning methods.

3 Approach

In this section, we first give a definition of our malware detection problem, thenpresent our approach.

The malware detection problem can be stated as follows: given a datasetD = {(e1, c1), . . . , (em, cm)}, where ei ∈ E is an executable file, ci ∈ C ={benign,malicious} is the corresponding actual category of ei which may beunknown, and m is the number of executable files, the goal is to find a functionf : E → C such that ∀i ∈ {1, . . . ,m}. f(ei) ≈ ci.

Indeed, the goal is to find a classifier which classifies each executable fileas precise as possible. Our approach tries to learn a classifier from existing ex-ecutables with known categories first, and then uses this classifier to predictcategories for new, unseen executables. Figure 1 shows the framework of our ap-proach, which consists of two components, namely the feature extractor and themalware classifier. The feature extractor extracts the feature information (i.e.,opcodes, data types and system libraries) from the executables and representsthem as vectors. While the malware classifier is first trained from an availabledataset of executables with known categories by a supervised machine-learningmethod, and then can be used to detect new, unseen executables. In the follow-ing, we describe both components in more detail.

Fig. 1. the framework of our approach

3.1 Feature Extractor

This section presents the processing of the feature extractor, which consists ofthe following steps: (1) decompilation, (2) information extraction, and (3) featureselection and representation.

Decompilation In this paper, we focus on the executables on Windows, namely,exe files and dll files. An instruction or a datum in an executable file can be rep-resented as a series of binary codes, which are clearly not easy to read. So the firststep is to transform the binary codes into a readable intermediate representationsuch as assembly codes by a decompilation tool.

Information Extraction Next, the extractor parses the asm files decompiledfrom executables to extract the information, namely, opcodes, data types andsystem libraries. Generally, the opcodes used in an executable represent its in-tended behaviours, while the data types indicate the structures of the data itmay perform on. In addition, the imported system libraries, which reflect theinteraction between the executable and the system, are also considered. All suchinformation describes the possible mission of an executable in some sense, andsimilar executables share the similar information.

The opcode information can be extracted either statically or dynamically.Here we just collect and count the opcodes from an asm file instruction byinstruction, yielding an opcode list lopcode, which records the opcodes appearingin the asm file and their corresponding times. For data type information, wefirst discover the possible variables, including local and global ones, and recovertheir types. Then we do a statistical analysis on the recovered types, whereinwe just consider their data sizes for the composed types for simplicity, yieldinga type list ltype containing the recovered types and their corresponding times.The extraction of library information is similar to the one of opcodes, yieldingthe library list llibrary.

Feature Selection and Representation There may be too many differentterms in the collected lists lopcode, ltype, llibrary from all the executables, and notall of them are of interest. For example, push and mov are commonly used inboth malware and benign softwares. For that, the extractor creates a dictionaryto keep record of the interesting terms. Only the terms in this dictionary arereserved in the lists. Take the opcode lists for example. The extractor performsa statistical analysis on all the opcode lists from an available dataset, using thewell-known scheme Term Frequency and Inverse Document Frequency (TF-IDF)method [21] to measure the statistical dependence. Next the extractor selects thetop k weight opcodes to form an opcode dictionary, and then filters out the termsnot in this dictionary for each lopcode. The same applies to ltype and llibrary.

After that, the extractor builds a profile for each executable file by concate-nating these three lists collected from its corresponding asm file, namely, theexecutable is represented as a profile [lopcode, ltype, llibrary]. Assume that there isa fixed order for all the features (i.e., the opcode, type and library terms). Thena profile can be simplified as a vector (x1, x2, ..., xn), where xi is the value of thei-th feature and n is the total number of features. An example vector is shown

in the following:

x =

915...2302...01...

⇐=

mov : 9add : 1lea : 5...

lopcode

bool : 2int : 3double : 016-bytes : 2...

ltype

USER32.dll : 0KERNEL32.dll : 1...

llibrary

profile

3.2 Malware Classifier

In this section, we present the malware classifier. As mentioned before, we firsttrain our malware classifier from an available dataset of executables with knowncategories by a supervised machine-learning method, and then use it detect new,unseen executables.

Classifier Training Now by our feature extractor, an executable e can berepresented as a vector x. Let X denote the feature space for all possible vectors,D0 represent the available dataset with known categories, and 0 and 1 representbenign and malicious respectively. Our training problem is to find a classifierC : X → [0, 1] such that

min Σ(x,c)∈D0d(C(x)− c),

where d denotes the distance function. Clearly, there are many classifier algo-rithms to solve this problem, such as K-Nearest Neighbor [22], Native Bayes [23],Decision Tree [24], Random Forest [25], Support Vector Machine [26], and SGDClassifier [27]. In our implementation we have made all these methods available,so users can select any one they like. Moreover, we have also conducted someexperiments with these algorithms and found out that the classifier trained byRandom Forest performs best here (more details will be given in Section 4).

Malware Detection Once it is trained, the classifier C can be used to detectmalware: returns the category as close as the classifier returns. Formally, givenan executable e and its feature vector x, the goal of the detection is to find c ∈ Csuch that

min d(C(x)− c).

4 Implementation and Experiments

4.1 Implementation

We have implemented our approach as a prototype in Python. We use IDAdisassembler [28] as the front end to disassemble the executable files, and usescikit-learn [29], a simple and efficient tool for data mining and data analysis, toimplement various classifiers. We also use BITY [30], a static tool to learn typesfor binary codes, to recover the data types for the discovered variables. For thefeature dictionary, we select the top 500 terms for opcodes and all the terms fordata types and system libraries (due to that their numbers are relatively low).

Our dataset consists of a malware dataset and a benign software dataset.The malware dataset consists of the samples of the BIG 2015 Challenge [31] andthe samples before 2017 in theZoo aka Malware DB [32], with 11376 samples intotal; while the benign software dataset are collected from QIHU 360 softwarecompany, which is the biggest Internet security company in China, with 8003samples in total.

4.2 Experiments

In this section, we conduct a series of experiments to evaluate our approach.Firstly, we conduct a set of cross-validation experiments to evaluate how welleach classifier method performs based on the behaviour and data features. Mean-while, we also measure the runtime for each classifier method. The runtime ofthe feature extractor is measured as well. Secondly, based on each kind of fea-ture we also conduct experiments to see their effectiveness. Thirdly, to test ourapproach’s ability to detect genuinely new malware or new malware versions,we also run a time split experiment. Finally, we conduct some experiments totest our approach’s ability to detect new malware samples that are obtained byobfuscating the existing ones.

To quantitatively validate the experimental results, we use the performancemeasures shown in Table 1. All our experiments are conducted in the environ-ment: 64 Bit Windows 10 on an Intel (R) Core i5-4590 Processor (3.30GHz) with8GB of RAM.

Table 1. Performance Measures in Malware Detection

Measure Description

True Positive (TP) Number of files correctly classified as maliciousTrue Negative (TN) Number of files correctly classified as benignFalse Positive (FP) Number of files mistakenly classified as maliciousFalse Negative (FN) Number of files mistakenly classified as benign

Precision (PPV) TP / (TP+FP)TP Rate (TPR) TP / (TP+FN)FP Rate (FPR) FP / (TP+FN)Accuracy (ACY) (TP+TN) / (TP+TN+FP+FN)

Cross Validation Experiments To evaluate the performance of our approach,we conduct 10-fold cross validation [33] experiments: in each experiment, werandomly split our dataset into 10 equal folds; then for each fold, we train theclassifier model based on the data from the other folds (i.e., the training set),and test the model on this fold (i.e., the testing set). The learning methods weuse in our experiments are listed as follows:

– K-Nearest Neighbour (KNN): experiments are performed under the rangek = 1, k = 3, k = 5, and k = 7.

– Native Bayes (NB): several structural learning algorithms, that is, GaussianNaive Bayes, Multinomial Naive Bayes, and Bernoulli Naive Bayes are usedin our experiments.

– Decision Trees (DT): we use decision tree with two criteria, namely, “gini” forthe Gini impurity and “entropy” for the information gain. Random Forest(RF) with 10 trees are used as well.

– Support Vector Machines (SVM): we perform our experiments with a linearkernel, a sigmoid kernel, and a Radial Basis Function based kernel (RBF),respectively. We also try the linear models with stochastic gradient descent(SGD).

Table 2 shows the detailed results of the experiments and Figure 2 showssome selected Receiver Operating Characteristic (ROC) curve. All the KNN clas-sifiers and DT classifiers perform quite well, with an accuracy greater than 97%,among which Random Forest with 10 entropy trees have produced the best re-sult. Rather surprisingly, 1-KNN performs better than the other KNNs. Whileall the NB classifiers perform a bit worse, with the accuracy from 66.38% to82.79%. For SVN classifiers, most of them get an accuracy greater than 95%,except for the one with sigmoid kernel whose accuracy is 85.80%. ConcerningROC curve, most classifiers can produce much better classification results thanrandom classification results, except for Gaussian Naive Bayes with the Area Un-der the ROC (AUC) 0.5931, which is just a little bit bigger than 0.5. RandomForest with 10 entropy trees perform the best as well (with the AUC 0.9959).One of the main reasons for Random Forest to outperform others is that it is anensemble learning method which may correct the Decision Trees’ habit of over-fitting to the training set. In the future, some other ensemble learning methodswill be also considered.

Runtime Performance Meanwhile, we count the training times (the runtimecosted by building the classifier based on the training set) and the testing time(the time needed by evaluating the testing set) in seconds for each cross valida-tion experiment. The results are shown in Table 3. In term of the training time,SVM has been the worst for our data set (with time ranging from 150.022s to1303.607s), while Naive Bayes has been the best (with time ranging from 1.535sto 3.093). Random Forest has also performed well on training, with time between3.791s and 4.115s. Concerning the testing time, KNN has been the worst (dueto its lazy learning), while Random Forest, Naive Bayes and SGD classifier have

Table 2. Results of Different Methods

Classifier PPV TPR FPR ACY

KNN K=1 0.9609 0.9830 0.0281 0.9764KNN K=3 0.9572 0.9798 0.0307 0.9736KNN K=5 0.9556 0.9763 0.0319 0.9715KNN K=7 0.9556 0.9763 0.0319 0.9715DT criterion=’gini’ 0.9658 0.9797 0.0243 0.9773DT criterion=’entropy’ 0.9685 0.9787 0.0223 0.9781RF (n=10, gini) 0.9678 0.9787 0.0228 0.9778RF (n=10, entropy) 0.9692 0.9800 0.0218 0.9788Gaussian Naive Bayes 0.9381 0.1990 0.0092 0.6638Multinomial Naive Bayes 0.9494 0.6590 0.0247 0.8446Bernoulli Naive Bayes 0.8726 0.6831 0.0701 0.8279SVM kernel=’linear’ 0.9581 0.9896 0.0304 0.9778SVM kernel=’rbf’ 0.9686 0.9119 0.0207 0.9514SVM kernel=’sigmoid’ 0.9671 0.7308 0.0232 0.8580SGD Classifier 0.9582 0.9862 0.0302 0.9765

Fig. 2. ROC Curve of Different Methods

performed quite well, with time small than 1s. Considering the accuracy, ROCand the runtime, we suggest users to use the Random Forest classifier.

Table 3. Runtime for Various Classifiers

Classifier Training Time (s) Testing Time (s)

KNN K=1 16.477 178.789KNN K=3 16.369 199.474KNN K=5 16.517 207.052KNN K=7 16.238 210.557DT criterion=’gini’ 23.442 0.067DT criterion=’entropy’ 13.485 0.066RF (n=10, gini) 4.115 0.086RF (n=10, entropy) 3.791 0.077Gaussian Naive Bayes 3.093 0.480Multinomial Naive Bayes 1.535 0.035Bernoulli Naive Bayes 1.826 0.828SVM kernel=’linear’ 150.022 14.494SVM kernel=’rbf’ 799.310 50.196SVM kernel=’sigmoid’ 1303.607 130.178SGD Classifier 22.569 0.048

We have also evaluated how the feature extractor performs. For that we firstmeasure the total time needed by the decompilation tool IDA Pro to decompilethe executables. Our data set contains 19379 executables, with the total size of250GB. It takes 15.6 hours to decompile all the files, with an average 0.22s/MB.IDA Pro is a heavy-weight tool, so using a lightweight one might have improvedthe time. Secondly, we measure the total time needed by extracting features fromthe decompiled asm files. The sizes of asm files range from 142KB to 144.84MB,with the total size of 182GB. It takes 10.5 hours to extract the features for allthe asm files, with an average 0.20s/MB. Both the decompilation time and theextracting time are considered to be acceptable.

Feature Experiments We have also conducted experiments based on each kindof feature to see their effectiveness. For that, we conduct the same experimentsas above for each kind of feature. Table 4 gives the experimental results andFigure 3 shows the corresponding ROC curves, where only the results of oneclassifier are selected to present here for each kind of learning method.

From the results we can see that all the features are effective to help detectmalware, and using all of the features together has produced the best results,with the average accuracy 0.9431 and the average AUC 0.9869. The opcode fea-ture has performed better, with the accuracy 0.9339 and the AUC 0.9790 onaverage, than the other two. The reason for this is that opcodes help describethe intended behaviours of the software quite well. This also explains why mostexisting work on malware detection employs the opcode feature. Using the op-code feature alone seems to be enough for malware detection in most cases, but

Table 4. Results of Different Features

Feature Classifier PPV TPR FPR ACY AUC

Opcode

Naive Bayes 0.9492 0.6128 0.0230 0.8266 0.9455KNN 0.9521 0.9777 0.0345 0.9705 0.9842

Random Forest 0.9595 0.9775 0.0290 0.9736 0.9947SGD Classifier 0.9462 0.9701 0.0387 0.9649 0.9914

Average 0.9518 0.8840 0.0313 0.9339 0.9790

Library

Naive Bayes 0.8508 0.7705 0.0950 0.8494 0.9311KNN 0.7439 0.9892 0.2395 0.8549 0.8878


Average 0.8711 0.8485 0.1004 0.8785 0.9397

Type

Naive Bayes 0.8346 0.1829 0.0254 0.6476 0.9020KNN 0.8570 0.9735 0.1142 0.9219 0.9493


Average 0.8483 0.7701 0.0945 0.8496 0.9427

All

Naive Bayes 0.9494 0.6590 0.0247 0.8446 0.9710KNN 0.9572 0.9798 0.0307 0.9736 0.9859


Average 0.9582 0.9008 0.0272 0.9431 0.9869

Fig. 3. ROC Curve of Different Features

the detection may evade easily by obfuscation techniques like adding unreach-able codes. In some cases it is better to use the other two features and addingthem will improve the detection. So it is useful to also take the type and libraryfeatures into account. On average, the type feature has performed better thanthe library feature in term of AUC, although the library feature has producedslightly better results than the type feature with respect to accuracy. The resultsalso show that Random Forest has performed the best for each feature, whichis consistent with the above experiments. Note that, when Random Forest isemployed, using type feature can get the best TPR. The opcode and libraryfeatures have been used by lots of work in practice, so we believe that the typefeature can benefit malware detection as well in practice.

Time-Split Experiments As Saxe and Berlin [17] point out, a shortcomingof the standard cross validation experiments is that they do not separate theability to detect slightly modified malware from the ability to detect new malwaresamples and new versions of existing malware samples. In this section, to testour approach’s ability to detect genuinely new malware or new malware versions,we ran a time split experiment. We first download the malware samples, whichwere collected dated from January 2017 to July 2017, from the DAS MALWERKwebsite [34]. That is, all the malware samples are newer than the ones in ourdata set. We then train a classifier based on our data set using Random Forest.Finally, we use our classifier to detect these fresh malware samples.

The experimental results are shown in Table 5. About 81% of the samplescan be detected by our classifier, which signifies that our approach can detectsome malware samples and new versions of existing malware samples. However,the results also indicate that the classifier becomes ineffective as time passes. Forexample, the results detected by our approach for the samples collected in Juneand July are worse than random classification (the accuracies are not greaterthan 0.5). This is due to the high degree of freedom for malware writer to pro-duce new malware samples continuously. Consequently, most malware classifiersbecome unsustainable in the long run, becoming rapidly antiquated as malwarecontinues to evolve. This suggests that malware classifiers should be updatedoften with new data or new features in order to maintain the classification ac-curacy.

Obfuscation Experiments One thing that makes malware detection evenmore difficult is that malware writers may use obfuscation techniques, such asinserting NOP instructions, changing what registers to use, changing flow con-trol with jumps, changing machine instructions to equivalent ones or reorderingindependent instructions, to evade the detection. In this section, we report someexperiments to test our approach’s ability to detect new malware samples thatare obtained by obfuscating the existing ones.

The obfuscated malware samples are obtained as follows. Firstly, we use thefree version of the commercial tool Obfuscator [35], which only supports chang-ing code execution flow, to obfuscate 50 malware samples, which are randomly

Table 5. Results of Fresh Samples

Date Num RF Classifier Accuracy

July, 2017 12 6 0.500June, 2017 61 29 0.475May, 2017 87 76 0.874April, 2017 31 28 0.903March, 2017 48 42 0.875

February, 2017 69 61 0.884January, 2017 56 53 0.946

Total 364 295 0.810

selected from our data set, yielding 50 new malware samples. Secondly, we usethe open source tool Unest [36] to obfuscate 15 obj files, which are compiledfrom malware samples in C source codes through VS 20101, by the followingfour techniques, that is, (a) rewriting digital changes equivalently, (b) confus-ing the output string, (c) pushing the target code segment into the stack andjumping to it to confuse the target code, and (d) obfuscating the static libraries,yielding 60 new malware samples.

We used our classifier trained by Random Forest to detect the newly gener-ated malware samples. The results are presented in Table 6. The results showthat all the obfuscated malware samples have been detected by our classifier. Thisindicates that our classifier has some resistance to some obfuscation techniques.To change code execution flow, Obfuscator inserts lots of jump instructions. Itseems that the jump instructions may change the opcode feature, while the datafeature and the library feature are still the same. Indeed, jump is commonly usedin both malware and benign software. Therefore, this technique does not changethe features we use and thus cannot evade our detection. Similar to Obfuscator,the techniques used in Unset make little changes on the features, thus cannotevade our detection either. Nevertheless, the techniques used here is rather sim-ple. the integration of more (sophisticated) techniques will be considered in thefuture.

Table 6. Results of Obfuscated Malware Samples

Tools Number RF Classifier Accuracy

Obfuscator 50 50 100%

Unest 60 60 100%

1 We are able to obfuscate only the obj files compiled from C codes through VS 2010.

5 Conclusion

In this work, we have proposed a malware detection approach using various ma-chine learning methods based on the opcodes, data types and system libraries.To evaluate the proposed approach, we have carried out some interesting ex-periments. Through experiments, we have found that the classifier trained byRandom Forest outperforms others for our data set and all the features we haveadopted are effective for malware detection. The experimental results have alsodemonstrated that our classifier is capable of detecting some fresh malware, andhas a resistance to some obfuscation techniques.

As for future work, we may consider some other meaningful features, includ-ing static features and dynamic features, to improve the approach. We can em-ploy the unsupervised machine learning methods or the deep learning methodsto train the classifiers. More experiments on malware anti-detecting techniquesare under consideration.

References

1. McAfee Labs Threats Report. June 2017.2. Philippe Beaucamps and Eric Filiol. On the possibility of practically obfuscating

programs towards a unified perspective of code protection. Journal in ComputerVirology, 3(1):3–21, 2007.

3. Yanfang Ye, Tao Li, Donald Adjeroh, and S. Sitharama Iyengar. A survey on mal-ware detection using data mining techniques. Acm Computing Surveys, 50(3):41,2017.

4. Yuval Elovici, Asaf Shabtai, and Moskovitch. Applying machine learning tech-niques for detection of malicious code in network traffic. In German Conferenceon Advances in Artificial Intelligence, pages 44–50, 2007.

5. Mohammad M Masud, Latifur Khan, and Bhavani Thuraisingham. A scalablemulti-level feature extraction technique to detect malicious executables. Informa-tion Systems Frontiers, 10(1):33–45, 2008.

6. Blake Anderson, Curtis Storlie, and Terran Lane. Improving malware classifica-tion:bridging the static/dynamic gap. In ACM Workshop on Security and ArtificialIntelligence, pages 3–14, 2012.

7. Yanfang Ye, Tao Li, Yong Chen, and Qingshan Jiang. Automatic malware cate-gorization using cluster ensemble. In ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, 2010.

8. Igor Santos, Felix Brezo, Xabier Ugarte-Pedrero, and Pablo G Bringas. Opcode se-quences as representation of executables for data-mining-based unknown malwaredetection. Information Sciences, 231(9):64–82, 2013.

9. Tzu Yen Wang, Shi Jinn Horng, Ming Yang Su, and Chin Hsiung Wu. A surveil-lance spyware detection system based on data mining methods. In IEEE Interna-tional Conference on Evolutionary Computation, pages 3236–3241, 2006.

10. Yanfang Ye, Dingding Wang, Tao Li, and Dongyi Ye. Imds: intelligent malwaredetection system. In ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, pages 1043–1047, 2007.

11. Yanfang Ye, Tao Li, Kai Huang, Qingshan Jiang, and Yong Chen. Hierarchicalassociative classifier (hac) for malware detection from the large and imbalancedgray list. Journal of Intelligent Information Systems, 35(1):1–20, 2009.

12. Yanfang Ye, Lifei Chen, Dingding Wang, Tao Li, Qingshan Jiang, and Min Zhao.Sbmds: an interpretable string based malware detection system using svm ensemblewith bagging. Journal in Computer Virology, 5(4):283, 2009.

13. Rafiqul Islam, Ronghua Tian, Steve Versteeg, and Steve Versteeg. Classificationof malware based on integrated static and dynamic features. Journal of Networkand Computer Applications, 36(2):646–656, 2013.

14. Nikos Karampatziakis, Jack W. Stokes, Anil Thomas, and Mady Marinescu. UsingFile Relationships in Malware Classification. Springer Berlin Heidelberg, 2013.

15. Acar Tamersoy, Kevin Roundy, and Duen Horng Chau. Guilt by association:large scale malware detection by mining file-relation graphs. In ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, 2014.

16. Gamal Abdel Nassir Mohamed and Norafida Bte Ithnin. Survey on representationtechniques for malware detection system. American Journal of Applied Sciences,2017.

17. Joshua Saxe and Konstantin Berlin. Deep neural network based malware detectionusing two dimensional binary program features. In 2015 10th International Con-ference on Malicious and Unwanted Software (MALWARE), pages 11–20, 2015.

18. William Hardy, Lingwei Chen, Shifu Hou, Yanfang Ye, and Xin Li. Dl4md: Adeep learning framework for intelligent malware detection. In Proceedings of theInternational Conference on Data Mining, 2016.

19. Yanfang Ye, Lingwei Chen, Shifu Hou, William Hardy, and Xin Li. Deepam: a het-erogeneous deep learning framework for intelligent malware detection. Knowledgeand Information Systems, pages 1–21, 2017.

20. Roberto Jordaney, Kumar Sharad, Santanu K. Dash, Zhi Wang, Davide Papini,Ilia Nouretdinov, and Lorenzo Cavallaro. Transcend: Detecting concept drift inmalware classification models. In 26th USENIX Security Symposium (USENIXSecurity 17), pages 625–642, 2017.

21. Stephen Robertson. Understanding inverse document frequency: on theoreticalarguments for idf. Journal of Documentation, 60(5):503–520, 2004.

22. G Shakhna Rovich, T Darrell, and P Indyk. Nearest-neighbor methods in learningand vision. Pattern Analysis & Applications, 19(19), 2006.

23. I. Rish. An empirical study of the naive bayes classifier. Journal of UniversalComputer Science, 1(2):127, 2001.

24. J. R Quinlan. Induction on decision tree. Machine Learning, 1(1):81–106, 1986.

25. Andy Liaw and Matthew Wiener. Classification and regression by randomforest.R News, 23(23), 2002.

26. Alex M. Andrew. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Printed in the United Kingdom at the University Press,2000.

27. Fasihul Kabir, Sabbir Siddique, and Kotwal. Bangla text document categorizationusing stochastic gradient descent (sgd) classifier. In International Conference onCognitive Computing and Information Processing, 2015.

28. IDA Pro. http://www.hex-rays.com/idapro/.

29. Oliver Kramer. Scikit-Learn. Springer International Publishing, 2016.

30. Zhiwu Xu, Cheng Wen, and Shengchao Qin. Learning Types for Binaries. In 19thInternational Conference on Formal Engineering Methods, 2017.

31. Microsoft Malware Classification Challenge. https://www.kaggle.com/c/

malware-classification.

32. theZoo aka Malware DB. http://ytisf.github.io/theZoo/.

33. Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimationand model selection. In International Joint Conference on Artificial Intelligence,pages 1137–1143, 1995.

34. DAS MALWERK. http://dasmalwerk.eu/.35. Obfuscator. https://www.pelock.com/products/obfuscator.36. Unest. http://unest.org/.

e ective malware detection based on behaviour and data features · 2020-06-30 ·...

Documents