malware_detection_data_mining

Open University, Data Mining Seminar 13802

Semester 2015b

Malware Detection via Data Mining

Prof Roy Gelbard

David Zivi 204785638

Contents Terminology and Definitions .............................................................................................................. 3

Introduction ....................................................................................................................................... 4

Research Question ............................................................................................................................. 5

Study goal:...................................................................................................................................... 5

Study importance: .......................................................................................................................... 5

Mapping of Knowledge Elements ...................................................................................................... 6

Bibliography Review ........................................................................................................................... 7

Signature-based detection: ............................................................................................................ 7

Heuristic-based detection: ............................................................................................................. 7

Behavioral-based detection: .......................................................................................................... 7

Sandbox detection: ........................................................................................................................ 7

Data mining techniques: ................................................................................................................ 7

Research methodology ...................................................................................................................... 8

Raw Data Acquisition ..................................................................................................................... 8

Extraction of Significant Data ......................................................................................................... 8

Opcode Relevance in malwares ..................................................................................................... 9

Average Calculation ....................................................................................................................... 9

Results Export ................................................................................................................................ 9

Weka ............................................................................................................................................10

Prediction parameters .................................................................................................................10

Noise method ...............................................................................................................................10

Results ..............................................................................................................................................11

Recurrent WEKA process .............................................................................................................11

First run ....................................................................................................................................11

Second run ...............................................................................................................................11

Third run ...................................................................................................................................11

Fourth run ................................................................................................................................11

Fifth run ....................................................................................................................................11

Sixth run ...................................................................................................................................11

Seventh run ..............................................................................................................................11

Results summary per round .............................................................................................................12

Rules generated by WEKA ................................................................................................................13

Noise on model ............................................................................................................................14

Result Discussion and Future Research ...........................................................................................15

Future Research ...........................................................................................................................15

Extension of the opcodes set ...................................................................................................15

Sensitive system call ................................................................................................................15

PE header analysis ....................................................................................................................15

Resources List...................................................................................................................................16

Terminology and Definitions

Virus: A computer virus is a type of malware that propagates by inserting a copy

of itself into, and becoming part of, another program. It spreads from one

computer to another, leaving infections as it travels. Viruses can range in severity

from causing mildly annoying effects to damaging data or software and causing

denial-of-service (DoS) conditions. Almost all viruses are attached to an

executable file, which means the virus may exist on a system, but will not be

active or able to spread until a user runs or opens the malicious host file or

program. When the host code is executed, the viral code is executed as well. [1]

Disassembler: a computer program that translates machine language into

assembly language—the inverse operation to that of an assembler. Disassembly,

the output of a disassembler, is often formatted for human-readability rather

than suitability for input to an assembler, making it principally a reverse-

engineering tool. [2]

Opcode: In computing, an opcode (abbreviated from operation code) is the

portion of a machine language instruction that specifies the operation to be

performed. Beside the opcode itself, instructions usually specify the data they will

process, in form of operands. In addition to opcodes used in instruction set

architectures of various CPUs, which are hardware devices, opcodes can also be

used in abstract computing machines as part of their byte code specifications. [3]

x86 instruction set: x86 is a family of backward compatible instruction set

architectures based on the Intel 8086 CPU and its Intel 8088 variant. The 8086

was introduced in 1978 as a fully 16-bit extension of Intel's 8-bit based 8080

microprocessor, with memory segmentation as a solution for addressing more

memory than can be covered by a plain 16-bit address. The term "x86" came into

being because the names of several successors to the Intel's 8086 processor

ended in "86", including 80186, 80286, 80386 and 80486 processors. [4]

WEKA: is a workbench that contains a collection of visualization tools and

algorithms for data analysis and predictive modeling, together with graphical user

interfaces for easy access to this functionality. All of Weka's techniques are

predicated on the assumption that the data is available as a single flat file or relation,

where each data point is described by a fixed number of attributes. [17]

https://en.wikipedia.org/wiki/Reverse_engineering

https://en.wikipedia.org/wiki/Reverse_engineering

Introduction

In this study I present a technique I developed to determine whether an application is

malware or not using data-mining. The underlying method to be used is to teach the

system how to differentiate between software that is or is not malware, using a dataset

that is represented by a list of instructions that potentially characterize malware. The list

of instructions were collected from previous research done on this topic.

The technique is a two-step technique, where the first step consists of disassembly, and

opcode frequency calculation; the second step consists of usage of the data learning

algorithm J48 provided by the WEKA library.

For the dataset, I extracted relevant data from 300 known malware [6] and 150 types of

benign software typically found in a home computer under “c:\program files”. This data

was fed to WEKA, which generated rules to be used to determine whether a piece of

software is malware or not.

These rules were run on the dataset (cross-validation) I compiled, and were able to

predict, with an accuracy of 96%, whether a piece of software is malware or not.

In order to check the robustness of the rules, a noise with intensity scale from 2% to 50%

was randomly to relevant data without significant regression on score mentioned above.

Within noise of up to 50% the prediction score decreased to 91%.

Research Question

Study goal: Today with the exponential growth of the “freeware” software [7], users and

corporations can find a large variety of application and utilities that can be installed for

free. Since those applications come from unknown sources, the question raised is: Can a

user or corporation benefit from some free applications without compromising their

entire system?

The goal of this study is to build rules that will determine whether an executable or

library received by a third party can be trusted or not. This study does not purport to

replace the known anti-viruses, but to propose a complementary mechanism that will

make up for weakness of known anti-virus programs.

Limitations of current methods:

The common technique used by anti-viruses to determine if an executable is a malware

or not is done by scanning. A scanner will search all files in memory and on disk for code

snippets that will uniquely identify a file as malware. Such mechanisms have two main

weaknesses:

Attackers interested in propagating a known malware can just change the code

snippets that anti-virus is looking for.

New malware not yet classified will be considered as benign software till the

malware is analyzed and classified. In such a case malware will continue to infect

the system and expand itself to the new system until the anti-virus is updated.

Study importance: According to a newly-released report sponsored by McAfee, global cyber activity is

costing up to $500 billion each year, which is almost as much as the estimated cost of

drug trafficking [5]. In the third quarter of 2015, McAfee Labs detected more than 307

new threats every minute, or more than five every second, with mobile malware samples

growing by 16 percent during the quarter, and overall malware surging by 76 percent

year after year [8].

Malware becomes more and more sophisticated and provide high revenues to their

owners. Malware editors becomes well organized and structured with impressive skills

and high qualified resources. Due to the exponential growth of malware and its agility to

camouflage itself, having a sterile system for users and corporates becomes a

tremendous task. Since current anti-viruses are run against known malware databases

which are updated once a day in the best case, malwares have one entire day to infect

the system until they are caught.

Mapping of Knowledge Elements Characteristic/Process In Human World In Machine World

Data

Software editor Software behavior

All the data is saved in an automatic way Every malware has its own signature Malware characteristics are saved in database

Information

Collect information about known malware Collect information about software to check

Basic statistical calculation on raw data

Knowledge

Can get global feeling about the software we want to check Tendency can be deducted

Run algorithm on software in order to determine if we are dealing with malware or not

Data transformation to Information & Knowledge

With the data we have, we can be deduce which software we have to check

Installation validation based on decision tree

Information & Knowledge transformation to Data

The knowledge can be transformed to data. For example, the software signature can be saved in a database

Exists in learning system, where the conclusions and knowledge are automatically translated to data

Transformation of tacit knowledge to explicit knowledge

Malware analyst writes in a formal way the conclusions reached about malware pattern

Does not exist

Transformation of explicit knowledge to tacit knowledge

Malware analyst learns from explicit knowledge

Does not exist

Knowledge contribution in decision

Based on explicit & tacit knowledge , analyst decides to accept or reject software

According to decision tree

Knowledge contribution to innovation

Update of anti-virus engines based on knowledge

Does not exist

Learning and knowledge sharing

Learning of new techniques used by malware

The system automatically update its research criterion

Bibliography Review The exponential growth of malware encourages security researchers to invent new techniques to

protect computers and network. The various techniques used for malware detections [9] are

described below:

Signature-based detection:

This technique is the most common method used to identify viruses and other malware. The anti-virus engine compares the contents of a file to its database of known malware signatures. Such technique requires daily update of malware database.

Heuristic-based detection:

This technique is generally used together with signature-based detection. It detects malware based on characteristics typically used in known malware code.

Behavioral-based detection:

This technique is similar to heuristic-based detection and used also in Intrusion Detection System. The main difference is that, instead of characteristics hardcoded in the malware code itself, it is based on the behavioral fingerprint of the malware at run-time. Clearly, this technique is able to detect (known or unknown) malware only after it has starting doing its malicious actions.

Sandbox detection:

This technique is a particular behavioral-based detection that, instead of detecting the behavioral fingerprint at run time, executes the programs in a virtual environment, logging whatever actions the program performs. Depending on the actions logged, the anti-virus engine can determine if the program is malicious or not. If not, the program is executed in the real environment. Even though this technique has shown to be quite effective it is heavy and slow, so it is rarely used in end-user anti-virus solutions.

Data mining techniques:

Data mining techniques are one of the latest approaches applied in malware detection. Data mining and machine learning algorithms are used to try to classify the behavior of a file (as either malicious or benign) given a series of file features that are extracted from the file itself. In this study the focus was pointed on the following few techniques:

- Malware detection via analysis of number of strings, call and binary patterns [10]

- Malware detection via analysis of program executable header [11]

- Malware prediction via function call frequency, usage of non-standard

instructions and use of suspicious system calls [12]

- Malware detection via statistical analysis of opcode distributions [13]

Research methodology

The following picture describes the flow used in this study to generate rules that will be able to

catch malware:

Raw Data Acquisition In order to learn from malware and benign software behavior, a large database of samples for

both malware and benign software are needed. Since malware has been analyzed and classified

by security researchers, it is quite easy to find malware databases on the internet [14]. In this

study, all the known and classified malware from year 2014 are used. In order to prevent noise

on the dataset, derivatives of the same malware are not included in the sample. All 400 families

of malware found in 2014 were classified and used in this study. Our benign software dataset is

represented by standard applications located under “c:\program files (x86)” such as “Outlook”,

“Word”, “Excel”, and “Calculator”. These were taken from a non-infected computer.

Note: For security reasons, the malware database is protected by a password that can be

retrieved from [14]. All the research and access to malware for this study were done on

dedicated virtual machines in order to prevent unintentional infection.

Extraction of Significant Data According to the aforementioned bibliography on malware detection via data mining, this study

focuses on the following opcodes: call, nop, int, rdtsc, sbb, shld, fdivp, imul, pushf, setb, fild and

xor. In this study those opcodes represent a criterion for malware detection. All the above

opcodes are extracted from executables and libraries using the IDA [15] disassembler.

Opcode Relevance in malwares

One of the main challenges of malwares is their capability of “camouflage”. In order to survive anti-viruses and security researcher analysis, malwares must hide themselves, since once they are discovered they are automatically removed from the infected computer. Moreover, when a malware is discovered, its characteristics are shared with the entire community. In order hide themselves, the malwares use a technique called “packing” [16] which consist of the compression/encryption of the original executable. When this compressed executable is executed, the decompression/decryption code recreates the original code from the compressed/encrypted code before executing it. Executable compression is also frequently used to deter reverse engineering or to obfuscate the contents of the executable, for example, to hide the presence of malware from anti-virus scanners. Executable compression can be used to prevent direct disassembly; it consists of masking string literals and modifying signatures. Although this does not eliminate the chance of reverse engineering, it can make the process more costly. The following picture illustrates how the “packing” mechanism works:

Average Calculation A script in Python is used to count the number of instances found for each of the relevant

opcodes listed above. Then a value is calculated for every relevant opcode, according to the

following formula (example for “call” opcode):

Call percentage = (Number of Call * Size of Call opcode) / Size of all text section

Results Export The same Python script now exports an Excel table of all the opcodes, averaged for every

disassembled file (benign software & malware).

Weka The generated excel file is used for WEKA data mining tool analysis. Since the target field is

nominal, J48 & Kstar algorithms are used. The test options used to validate the model is cross

validation with percentage split of 66%

Prediction parameters TP (true positive): rate of valid prediction of a malware

TN (true negative): rate of valid prediction of benign software

FP (false positive): missed malware prediction i.e. malware was predicted as benign software

FN (false negative): missed benign software prediction i.e. benign software was predicted as

malware

Noise method A randomized noise with intensity ranging from 5% to 50% is applied on the generated excel

table. The noise is applied only to the head of the tree i.e. in our case on the “imul” instruction.

Applying such noise will determine the robustness of the model, in other words does standard

noise contest the study conclusion or not. Since the noise source can be the result of non-

standard code, for example when code is written directly in assembler by programmers, we

assume that a typical noise can have an intensity of 30%. If the malware prediction score is still

greater than 90% we can conclude that the generated model is robust enough.

Results In this research a deterministic way for malware prediction was formulated and tested. This

method prove to be highly successful in differentiating between malware and benign software. A

key factor for efficiently identifying malware was to have appropriate set of instructions. To find

the ideal instruction set I performed an investigation in a recurrent manner as described below:

Recurrent WEKA process As opposed to the mentioned research done on malware detection via data mining, like Sanjam

Singla et al [12], in this study the research analysis did not stop when a high level of predictability

was achieved. In every WEKA iteration the “head of the tree” was removed and new analysis was

performed to determine if the “head of the tree” is a key opcode in our prediction or not. Getting

good result after removal of “head of the tree” means that the “head of the tree” cannot be used

for prediction. The following describes the recurrent WEKA process used:

First run In the first WEKA run, all the potential opcodes are take into account i.e. call, nop, int, rdtsc,

sbb, shld, fdivp, imul, pushf, setb, fild and xor. The prediction score was 98% when the head of

the tree was “call” opcode

Second run The call opcode was removed so WEKA was run with: nop, int, rdtsc, sbb, shld, fdivp, imul,

pushf, setb, fild and xor. The prediction score was 97% when the head of the tree was “xor”

opcode

Third run The xor opcode was removed so WEKA was run with: nop, int, rdtsc, sbb, shld, fdivp, imul,

pushf, setb and fild. The prediction score was 97% when the head of the tree was “int” opcode

Fourth run The int opcode was removed so WEKA was run with: nop, rdtsc, sbb, shld, fdivp, imul, pushf,

setb and fild. The prediction score was 96% when the head of the tree was “rdtsc” opcode

Fifth run The rdtsc opcode was removed so WEKA was run with: nop, sbb, shld, fdivp, imul, pushf, setb

and fild. The prediction score was 95% when the head of the tree was “sbb” opcode

Sixth run The sbb opcode was removed so WEKA was run with: nop, shld, fdivp, imul, pushf, setb and fild.

The prediction score was 91% when the head of the tree was “imul” opcode

Seventh run The imul opcode was removed so WEKA was run with: nop, shld, fdivp, pushf, setb and fild. The

prediction score decreased to 78%. At this point the recurrent processing was stopped due to

significant decrease in prediction score. The last acceptable score was of 91% with opcodes: nop,

shld, fdivp, imul, pushf, setb and fild.

Results summary per round

Data set Algorithm TP FP TN FN RMSE Fscore

All J48 307 4 102 4 0.13 0.98

All Kstar 310 1 99 7 0.13 0.98

call removed J48 306 5 100 6 0.15 0.97

call removed Kstar 311 0 95 11 0.14 0.97

call & xor removed J48 308 3 99 7 0.15 0.97

call &xor removed Kstar 311 0 89 17 0.19 0.95

call,xor & int removed J48 307 4 101 5 0.14 0.97

call,xor & int removed Kstar 309 2 92 14 0.18 0.96

call,xor,int & rdtsc removed J48 304 7 97 9 0.19 0.96

call,xor,int & rdtsc removed Kstar 309 2 92 14 0.18 0.96

call,xor,int,rdtsc & sbb removed J48 306 5 98 8 0.17 0.96

call,xor,int,rdtsc & sbb removed Kstar 311 0 71 35 0.24 0.91

call,xor,int,rdtsc,sbb & imul removed J48 308 3 33 73 0.37 0.78

call,xor,int,rdtsc,sbb & imul removed Kstar 310 1 45 61 0.31 0.82

Rules generated by WEKA The following graph represents rules generated by WEKA for malware prediction with score of

96%

The following table represent the Confusion Matrix for J48 algorithm

clean virus

clean 102 4

virus 4 307

Noise on model The following table summarizes the noise intensity at the head of the tree, with the appropriate

Fscore

The following gives a graphical representation of the above results

Noise intensity in % on "imul" variable Fscore

0 0.969

2 0.966

5 0.964

8 0.966

10 0.966

13 0.964

16 0.966

20 0.954

25 0.961

30 0.961

35 0.946

40 0.954

45 0.939

50 0.916

Data set: call,xor,int,rdtsc & sbb removed

Result Discussion and Future Research

The goal of this study was to demonstrate that malware can be caught via analysis of executable

opcodes. I have shown that malware detection via data mining is a very promising method, since

the prediction score achieved is 96% for 2014 malware. This study found a different set of

instructions that point to code being malware, compared to previous research that was done on

this topic. The final instructions found in our generated tree are: “imul”, “pushf” and “fild”. As

explained before, those instructions are commonly used by “packer” and “protector” software in

order to unpack/decrypt the malware code.

The novel approaches of the study, as opposed to previous research done in this field, are:

Analysis of the malware surface: Since most of the malwares use packers and protectors

to hide themselves from security researchers, after disassembly only a small portion of

the malware code can be analyzed. In this study it was proved that analysis of a small

portion of code (loader) is enough to detect if the executable is a malware or not.

Recurrent run of WEKA: As opposed to previous research, the WEKA data mining tool

was run many times, in order to found the instructions that really influence the

prediction. Originally the “call” instruction was singled out as identifying software as

malware or not, as was done in research by Sanjam Singla et al [12], but after recurrent

run of WEKA the “imul” instruction was found as the identifier of malware.

I have shown that malware detection via data mining is a very promising method, since the

prediction score achieved is 96% for 2014 malware.

Future Research

Extension of the opcodes set Since malware changes very frequently, in future research the instruction set used for prediction

must be enlarged and updated according to new malware. Furthermore, malware commonly

uses rare instructions that will never being generated by a compiler, so having such rare

instructions in our instruction dataset will help to recognize them.

Sensitive system call In this study, only malware instructions were checked. As claimed in the above bibliography,

malware commonly uses sensitive calls like “VirtualAllocEx”, “IsDebugerPresent”. Use of such

calls can point to malware as well

PE header analysis Another approach to detect malware will be to check the program header format of the

executable. Since during the packing/protect mechanism the “entry point” of the program is

modified, we can find PE header malformations that point to malware as well.

Resources List

[1] http://www.cisco.com/web/about/security/intelligence/virus-worm-diffs.html

[2] https://en.wikipedia.org/wiki/Disassembler

[3] https://en.wikipedia.org/wiki/Opcode

[4] http://www.felixcloutier.com/x86/

[5] http://www.foxbusiness.com/technology/2013/07/22/report-cyber-crime-costs-global-

economy-up-to-1-trillion-year/

[6] http://www.nothink.org/honeypots/malware-archives/

[7] https://en.wikipedia.org/wiki/Freeware

[8] http://www.mcafee.com/us/about/news/2014/q4/20141209-01.aspx

[9] https://en.wikipedia.org/wiki/Antivirus_software

[10] M. Schultz, M. Eskin, E. Zadok 2001 Data Mining Methods for Detection of New Malicious

Executables

[11] Usukhbayar Baldangombo et al - A STATIC MALWARE DETECTION SYSTEM USING DATA

MINING

[12] A Novel Approach to Malware Detection using Static Classification Sanjam Singla et al 2015

[13] International Journal of Electronic Security Daniel Bilar, Opcodes as predictor for malware

2007

[14] http://www.nothink.org/honeypots/malware-archives/

[15] https://www.hex-rays.com/products/ida/index.shtml

[16] https://en.wikipedia.org/wiki/Executable_compression

[17] https://en.wikipedia.org/wiki/Weka_%28machine_learning%29

http://www.cisco.com/web/about/security/intelligence/virus-worm-diffs.html

https://en.wikipedia.org/wiki/Disassembler

https://en.wikipedia.org/wiki/Opcode

http://www.felixcloutier.com/x86/

http://www.foxbusiness.com/technology/2013/07/22/report-cyber-crime-costs-global-economy-up-to-1-trillion-year/

http://www.foxbusiness.com/technology/2013/07/22/report-cyber-crime-costs-global-economy-up-to-1-trillion-year/

http://www.nothink.org/honeypots/malware-archives/


https://en.wikipedia.org/wiki/Freeware

http://www.mcafee.com/us/about/news/2014/q4/20141209-01.aspx

https://en.wikipedia.org/wiki/Antivirus_software



https://www.hex-rays.com/products/ida/index.shtml

https://www.hex-rays.com/products/ida/index.shtml

https://en.wikipedia.org/wiki/Executable_compression

https://en.wikipedia.org/wiki/Weka_%28machine_learning%29

malware_detection_data_mining

Documents

data mining seminar

data mining techniques

sandbox detection

signaturebased detection

heuristicbased detection

behavioralbased detection

b malware detection

raw data acquisition