malware_detection_data_mining
TRANSCRIPT
Open University, Data Mining Seminar 13802
Semester 2015b
Malware Detection via Data Mining
Prof Roy Gelbard
David Zivi 204785638
Contents Terminology and Definitions .............................................................................................................. 3
Introduction ....................................................................................................................................... 4
Research Question ............................................................................................................................. 5
Study goal:...................................................................................................................................... 5
Study importance: .......................................................................................................................... 5
Mapping of Knowledge Elements ...................................................................................................... 6
Bibliography Review ........................................................................................................................... 7
Signature-based detection: ............................................................................................................ 7
Heuristic-based detection: ............................................................................................................. 7
Behavioral-based detection: .......................................................................................................... 7
Sandbox detection: ........................................................................................................................ 7
Data mining techniques: ................................................................................................................ 7
Research methodology ...................................................................................................................... 8
Raw Data Acquisition ..................................................................................................................... 8
Extraction of Significant Data ......................................................................................................... 8
Opcode Relevance in malwares ..................................................................................................... 9
Average Calculation ....................................................................................................................... 9
Results Export ................................................................................................................................ 9
Weka ............................................................................................................................................10
Prediction parameters .................................................................................................................10
Noise method ...............................................................................................................................10
Results ..............................................................................................................................................11
Recurrent WEKA process .............................................................................................................11
First run ....................................................................................................................................11
Second run ...............................................................................................................................11
Third run ...................................................................................................................................11
Fourth run ................................................................................................................................11
Fifth run ....................................................................................................................................11
Sixth run ...................................................................................................................................11
Seventh run ..............................................................................................................................11
Results summary per round .............................................................................................................12
Rules generated by WEKA ................................................................................................................13
Noise on model ............................................................................................................................14
Result Discussion and Future Research ...........................................................................................15
Future Research ...........................................................................................................................15
Extension of the opcodes set ...................................................................................................15
Sensitive system call ................................................................................................................15
PE header analysis ....................................................................................................................15
Resources List...................................................................................................................................16
Terminology and Definitions
Virus: A computer virus is a type of malware that propagates by inserting a copy
of itself into, and becoming part of, another program. It spreads from one
computer to another, leaving infections as it travels. Viruses can range in severity
from causing mildly annoying effects to damaging data or software and causing
denial-of-service (DoS) conditions. Almost all viruses are attached to an
executable file, which means the virus may exist on a system, but will not be
active or able to spread until a user runs or opens the malicious host file or
program. When the host code is executed, the viral code is executed as well. [1]
Disassembler: a computer program that translates machine language into
assembly language—the inverse operation to that of an assembler. Disassembly,
the output of a disassembler, is often formatted for human-readability rather
than suitability for input to an assembler, making it principally a reverse-
engineering tool. [2]
Opcode: In computing, an opcode (abbreviated from operation code) is the
portion of a machine language instruction that specifies the operation to be
performed. Beside the opcode itself, instructions usually specify the data they will
process, in form of operands. In addition to opcodes used in instruction set
architectures of various CPUs, which are hardware devices, opcodes can also be
used in abstract computing machines as part of their byte code specifications. [3]
x86 instruction set: x86 is a family of backward compatible instruction set
architectures based on the Intel 8086 CPU and its Intel 8088 variant. The 8086
was introduced in 1978 as a fully 16-bit extension of Intel's 8-bit based 8080
microprocessor, with memory segmentation as a solution for addressing more
memory than can be covered by a plain 16-bit address. The term "x86" came into
being because the names of several successors to the Intel's 8086 processor
ended in "86", including 80186, 80286, 80386 and 80486 processors. [4]
WEKA: is a workbench that contains a collection of visualization tools and
algorithms for data analysis and predictive modeling, together with graphical user
interfaces for easy access to this functionality. All of Weka's techniques are
predicated on the assumption that the data is available as a single flat file or relation,
where each data point is described by a fixed number of attributes. [17]
Introduction
In this study I present a technique I developed to determine whether an application is
malware or not using data-mining. The underlying method to be used is to teach the
system how to differentiate between software that is or is not malware, using a dataset
that is represented by a list of instructions that potentially characterize malware. The list
of instructions were collected from previous research done on this topic.
The technique is a two-step technique, where the first step consists of disassembly, and
opcode frequency calculation; the second step consists of usage of the data learning
algorithm J48 provided by the WEKA library.
For the dataset, I extracted relevant data from 300 known malware [6] and 150 types of
benign software typically found in a home computer under “c:\program files”. This data
was fed to WEKA, which generated rules to be used to determine whether a piece of
software is malware or not.
These rules were run on the dataset (cross-validation) I compiled, and were able to
predict, with an accuracy of 96%, whether a piece of software is malware or not.
In order to check the robustness of the rules, a noise with intensity scale from 2% to 50%
was randomly to relevant data without significant regression on score mentioned above.
Within noise of up to 50% the prediction score decreased to 91%.
Research Question
Study goal: Today with the exponential growth of the “freeware” software [7], users and
corporations can find a large variety of application and utilities that can be installed for
free. Since those applications come from unknown sources, the question raised is: Can a
user or corporation benefit from some free applications without compromising their
entire system?
The goal of this study is to build rules that will determine whether an executable or
library received by a third party can be trusted or not. This study does not purport to
replace the known anti-viruses, but to propose a complementary mechanism that will
make up for weakness of known anti-virus programs.
Limitations of current methods:
The common technique used by anti-viruses to determine if an executable is a malware
or not is done by scanning. A scanner will search all files in memory and on disk for code
snippets that will uniquely identify a file as malware. Such mechanisms have two main
weaknesses:
Attackers interested in propagating a known malware can just change the code
snippets that anti-virus is looking for.
New malware not yet classified will be considered as benign software till the
malware is analyzed and classified. In such a case malware will continue to infect
the system and expand itself to the new system until the anti-virus is updated.
Study importance: According to a newly-released report sponsored by McAfee, global cyber activity is
costing up to $500 billion each year, which is almost as much as the estimated cost of
drug trafficking [5]. In the third quarter of 2015, McAfee Labs detected more than 307
new threats every minute, or more than five every second, with mobile malware samples
growing by 16 percent during the quarter, and overall malware surging by 76 percent
year after year [8].
Malware becomes more and more sophisticated and provide high revenues to their
owners. Malware editors becomes well organized and structured with impressive skills
and high qualified resources. Due to the exponential growth of malware and its agility to
camouflage itself, having a sterile system for users and corporates becomes a
tremendous task. Since current anti-viruses are run against known malware databases
which are updated once a day in the best case, malwares have one entire day to infect
the system until they are caught.
Mapping of Knowledge Elements Characteristic/Process In Human World In Machine World
Data
Software editor Software behavior
All the data is saved in an automatic way Every malware has its own signature Malware characteristics are saved in database
Information
Collect information about known malware Collect information about software to check
Basic statistical calculation on raw data
Knowledge
Can get global feeling about the software we want to check Tendency can be deducted
Run algorithm on software in order to determine if we are dealing with malware or not
Data transformation to Information & Knowledge
With the data we have, we can be deduce which software we have to check
Installation validation based on decision tree
Information & Knowledge transformation to Data
The knowledge can be transformed to data. For example, the software signature can be saved in a database
Exists in learning system, where the conclusions and knowledge are automatically translated to data
Transformation of tacit knowledge to explicit knowledge
Malware analyst writes in a formal way the conclusions reached about malware pattern
Does not exist
Transformation of explicit knowledge to tacit knowledge
Malware analyst learns from explicit knowledge
Does not exist
Knowledge contribution in decision
Based on explicit & tacit knowledge , analyst decides to accept or reject software
According to decision tree
Knowledge contribution to innovation
Update of anti-virus engines based on knowledge
Does not exist
Learning and knowledge sharing
Learning of new techniques used by malware
The system automatically update its research criterion
Bibliography Review The exponential growth of malware encourages security researchers to invent new techniques to
protect computers and network. The various techniques used for malware detections [9] are
described below:
Signature-based detection:
This technique is the most common method used to identify viruses and other malware. The anti-virus engine compares the contents of a file to its database of known malware signatures. Such technique requires daily update of malware database.
Heuristic-based detection:
This technique is generally used together with signature-based detection. It detects malware based on characteristics typically used in known malware code.
Behavioral-based detection:
This technique is similar to heuristic-based detection and used also in Intrusion Detection System. The main difference is that, instead of characteristics hardcoded in the malware code itself, it is based on the behavioral fingerprint of the malware at run-time. Clearly, this technique is able to detect (known or unknown) malware only after it has starting doing its malicious actions.
Sandbox detection:
This technique is a particular behavioral-based detection that, instead of detecting the behavioral fingerprint at run time, executes the programs in a virtual environment, logging whatever actions the program performs. Depending on the actions logged, the anti-virus engine can determine if the program is malicious or not. If not, the program is executed in the real environment. Even though this technique has shown to be quite effective it is heavy and slow, so it is rarely used in end-user anti-virus solutions.
Data mining techniques:
Data mining techniques are one of the latest approaches applied in malware detection. Data mining and machine learning algorithms are used to try to classify the behavior of a file (as either malicious or benign) given a series of file features that are extracted from the file itself. In this study the focus was pointed on the following few techniques:
- Malware detection via analysis of number of strings, call and binary patterns [10]
- Malware detection via analysis of program executable header [11]
- Malware prediction via function call frequency, usage of non-standard
instructions and use of suspicious system calls [12]
- Malware detection via statistical analysis of opcode distributions [13]
Research methodology
The following picture describes the flow used in this study to generate rules that will be able to
catch malware:
Raw Data Acquisition In order to learn from malware and benign software behavior, a large database of samples for
both malware and benign software are needed. Since malware has been analyzed and classified
by security researchers, it is quite easy to find malware databases on the internet [14]. In this
study, all the known and classified malware from year 2014 are used. In order to prevent noise
on the dataset, derivatives of the same malware are not included in the sample. All 400 families
of malware found in 2014 were classified and used in this study. Our benign software dataset is
represented by standard applications located under “c:\program files (x86)” such as “Outlook”,
“Word”, “Excel”, and “Calculator”. These were taken from a non-infected computer.
Note: For security reasons, the malware database is protected by a password that can be
retrieved from [14]. All the research and access to malware for this study were done on
dedicated virtual machines in order to prevent unintentional infection.
Extraction of Significant Data According to the aforementioned bibliography on malware detection via data mining, this study
focuses on the following opcodes: call, nop, int, rdtsc, sbb, shld, fdivp, imul, pushf, setb, fild and
xor. In this study those opcodes represent a criterion for malware detection. All the above
opcodes are extracted from executables and libraries using the IDA [15] disassembler.
Opcode Relevance in malwares
One of the main challenges of malwares is their capability of “camouflage”. In order to survive anti-viruses and security researcher analysis, malwares must hide themselves, since once they are discovered they are automatically removed from the infected computer. Moreover, when a malware is discovered, its characteristics are shared with the entire community. In order hide themselves, the malwares use a technique called “packing” [16] which consist of the compression/encryption of the original executable. When this compressed executable is executed, the decompression/decryption code recreates the original code from the compressed/encrypted code before executing it. Executable compression is also frequently used to deter reverse engineering or to obfuscate the contents of the executable, for example, to hide the presence of malware from anti-virus scanners. Executable compression can be used to prevent direct disassembly; it consists of masking string literals and modifying signatures. Although this does not eliminate the chance of reverse engineering, it can make the process more costly. The following picture illustrates how the “packing” mechanism works:
Average Calculation A script in Python is used to count the number of instances found for each of the relevant
opcodes listed above. Then a value is calculated for every relevant opcode, according to the
following formula (example for “call” opcode):
Call percentage = (Number of Call * Size of Call opcode) / Size of all text section
Results Export The same Python script now exports an Excel table of all the opcodes, averaged for every
disassembled file (benign software & malware).
Weka The generated excel file is used for WEKA data mining tool analysis. Since the target field is
nominal, J48 & Kstar algorithms are used. The test options used to validate the model is cross
validation with percentage split of 66%
Prediction parameters TP (true positive): rate of valid prediction of a malware
TN (true negative): rate of valid prediction of benign software
FP (false positive): missed malware prediction i.e. malware was predicted as benign software
FN (false negative): missed benign software prediction i.e. benign software was predicted as
malware
Noise method A randomized noise with intensity ranging from 5% to 50% is applied on the generated excel
table. The noise is applied only to the head of the tree i.e. in our case on the “imul” instruction.
Applying such noise will determine the robustness of the model, in other words does standard
noise contest the study conclusion or not. Since the noise source can be the result of non-
standard code, for example when code is written directly in assembler by programmers, we
assume that a typical noise can have an intensity of 30%. If the malware prediction score is still
greater than 90% we can conclude that the generated model is robust enough.
Results In this research a deterministic way for malware prediction was formulated and tested. This
method prove to be highly successful in differentiating between malware and benign software. A
key factor for efficiently identifying malware was to have appropriate set of instructions. To find
the ideal instruction set I performed an investigation in a recurrent manner as described below:
Recurrent WEKA process As opposed to the mentioned research done on malware detection via data mining, like Sanjam
Singla et al [12], in this study the research analysis did not stop when a high level of predictability
was achieved. In every WEKA iteration the “head of the tree” was removed and new analysis was
performed to determine if the “head of the tree” is a key opcode in our prediction or not. Getting
good result after removal of “head of the tree” means that the “head of the tree” cannot be used
for prediction. The following describes the recurrent WEKA process used:
First run In the first WEKA run, all the potential opcodes are take into account i.e. call, nop, int, rdtsc,
sbb, shld, fdivp, imul, pushf, setb, fild and xor. The prediction score was 98% when the head of
the tree was “call” opcode
Second run The call opcode was removed so WEKA was run with: nop, int, rdtsc, sbb, shld, fdivp, imul,
pushf, setb, fild and xor. The prediction score was 97% when the head of the tree was “xor”
opcode
Third run The xor opcode was removed so WEKA was run with: nop, int, rdtsc, sbb, shld, fdivp, imul,
pushf, setb and fild. The prediction score was 97% when the head of the tree was “int” opcode
Fourth run The int opcode was removed so WEKA was run with: nop, rdtsc, sbb, shld, fdivp, imul, pushf,
setb and fild. The prediction score was 96% when the head of the tree was “rdtsc” opcode
Fifth run The rdtsc opcode was removed so WEKA was run with: nop, sbb, shld, fdivp, imul, pushf, setb
and fild. The prediction score was 95% when the head of the tree was “sbb” opcode
Sixth run The sbb opcode was removed so WEKA was run with: nop, shld, fdivp, imul, pushf, setb and fild.
The prediction score was 91% when the head of the tree was “imul” opcode
Seventh run The imul opcode was removed so WEKA was run with: nop, shld, fdivp, pushf, setb and fild. The
prediction score decreased to 78%. At this point the recurrent processing was stopped due to
significant decrease in prediction score. The last acceptable score was of 91% with opcodes: nop,
shld, fdivp, imul, pushf, setb and fild.
Results summary per round
Data set Algorithm TP FP TN FN RMSE Fscore
All J48 307 4 102 4 0.13 0.98
All Kstar 310 1 99 7 0.13 0.98
call removed J48 306 5 100 6 0.15 0.97
call removed Kstar 311 0 95 11 0.14 0.97
call & xor removed J48 308 3 99 7 0.15 0.97
call &xor removed Kstar 311 0 89 17 0.19 0.95
call,xor & int removed J48 307 4 101 5 0.14 0.97
call,xor & int removed Kstar 309 2 92 14 0.18 0.96
call,xor,int & rdtsc removed J48 304 7 97 9 0.19 0.96
call,xor,int & rdtsc removed Kstar 309 2 92 14 0.18 0.96
call,xor,int,rdtsc & sbb removed J48 306 5 98 8 0.17 0.96
call,xor,int,rdtsc & sbb removed Kstar 311 0 71 35 0.24 0.91
call,xor,int,rdtsc,sbb & imul removed J48 308 3 33 73 0.37 0.78
call,xor,int,rdtsc,sbb & imul removed Kstar 310 1 45 61 0.31 0.82
Rules generated by WEKA The following graph represents rules generated by WEKA for malware prediction with score of
96%
The following table represent the Confusion Matrix for J48 algorithm
clean virus
clean 102 4
virus 4 307
Noise on model The following table summarizes the noise intensity at the head of the tree, with the appropriate
Fscore
The following gives a graphical representation of the above results
Noise intensity in % on "imul" variable Fscore
0 0.969
2 0.966
5 0.964
8 0.966
10 0.966
13 0.964
16 0.966
20 0.954
25 0.961
30 0.961
35 0.946
40 0.954
45 0.939
50 0.916
Data set: call,xor,int,rdtsc & sbb removed
Result Discussion and Future Research
The goal of this study was to demonstrate that malware can be caught via analysis of executable
opcodes. I have shown that malware detection via data mining is a very promising method, since
the prediction score achieved is 96% for 2014 malware. This study found a different set of
instructions that point to code being malware, compared to previous research that was done on
this topic. The final instructions found in our generated tree are: “imul”, “pushf” and “fild”. As
explained before, those instructions are commonly used by “packer” and “protector” software in
order to unpack/decrypt the malware code.
The novel approaches of the study, as opposed to previous research done in this field, are:
Analysis of the malware surface: Since most of the malwares use packers and protectors
to hide themselves from security researchers, after disassembly only a small portion of
the malware code can be analyzed. In this study it was proved that analysis of a small
portion of code (loader) is enough to detect if the executable is a malware or not.
Recurrent run of WEKA: As opposed to previous research, the WEKA data mining tool
was run many times, in order to found the instructions that really influence the
prediction. Originally the “call” instruction was singled out as identifying software as
malware or not, as was done in research by Sanjam Singla et al [12], but after recurrent
run of WEKA the “imul” instruction was found as the identifier of malware.
I have shown that malware detection via data mining is a very promising method, since the
prediction score achieved is 96% for 2014 malware.
Future Research
Extension of the opcodes set Since malware changes very frequently, in future research the instruction set used for prediction
must be enlarged and updated according to new malware. Furthermore, malware commonly
uses rare instructions that will never being generated by a compiler, so having such rare
instructions in our instruction dataset will help to recognize them.
Sensitive system call In this study, only malware instructions were checked. As claimed in the above bibliography,
malware commonly uses sensitive calls like “VirtualAllocEx”, “IsDebugerPresent”. Use of such
calls can point to malware as well
PE header analysis Another approach to detect malware will be to check the program header format of the
executable. Since during the packing/protect mechanism the “entry point” of the program is
modified, we can find PE header malformations that point to malware as well.
Resources List
[1] http://www.cisco.com/web/about/security/intelligence/virus-worm-diffs.html
[2] https://en.wikipedia.org/wiki/Disassembler
[3] https://en.wikipedia.org/wiki/Opcode
[4] http://www.felixcloutier.com/x86/
[5] http://www.foxbusiness.com/technology/2013/07/22/report-cyber-crime-costs-global-
economy-up-to-1-trillion-year/
[6] http://www.nothink.org/honeypots/malware-archives/
[7] https://en.wikipedia.org/wiki/Freeware
[8] http://www.mcafee.com/us/about/news/2014/q4/20141209-01.aspx
[9] https://en.wikipedia.org/wiki/Antivirus_software
[10] M. Schultz, M. Eskin, E. Zadok 2001 Data Mining Methods for Detection of New Malicious
Executables
[11] Usukhbayar Baldangombo et al - A STATIC MALWARE DETECTION SYSTEM USING DATA
MINING
[12] A Novel Approach to Malware Detection using Static Classification Sanjam Singla et al 2015
[13] International Journal of Electronic Security Daniel Bilar, Opcodes as predictor for malware
2007
[14] http://www.nothink.org/honeypots/malware-archives/
[15] https://www.hex-rays.com/products/ida/index.shtml
[16] https://en.wikipedia.org/wiki/Executable_compression
[17] https://en.wikipedia.org/wiki/Weka_%28machine_learning%29