linux malware detection by hybrid analysis · thesis title: linux malware detection by hybrid...
TRANSCRIPT
Linux malware detection by hybridanalysis
By
ANMOL KUMAR SHRIVASTAVA
Department of Computer ScienceINDIAN INSTITUTE OF TECHNOLOGY, KANPUR
A thesis submitted in partial fulfillment of the requirementsfor the degree of MASTER OF TECHNOLOGY
Under the supervision of:DR. SANDEEP SHUKLA
MAY 2018
Abstract
Name of the student: Anmol Kumar Shrivastava Roll No: 16111031Degree for which submitted: M.Tech. Department: Computer Science and EngineeringThesis title: Linux malware detection by hybrid analysisThesis supervisor: Dr. Sandeep ShuklaMonth and year of thesis submission:
Over the past two decades, cyber-security research community has been working on detect-ing malicious programs for the Windows-based operating system. However, the recentexponential growth in popularity of IoT (Internet of Things) devices is causing the mal-
ware landscape to change rapidly. This so-called ’IoT Revolution ’ has fueled the interests ofmalware authors which has led to an exponential growth in Linux malware. The increasingnumber of malware is becoming a serious threat to data privacy as well as to the expensivecomputer resources. Manual malware analysis is not effective due to the large number of suchcases. Furthermore, the malware authors are using various obfuscation techniques to impedethe detection of traditional signature-based anti-virus system. As a result, automated yet robustmalware analysis is much needed. In this thesis, we develop a hybrid approach by integratingboth static features as well as dynamic features of a malware, to detect it efficiently. We performedour analysis on 7717 malware and 2265 benign files and got a highly promising detection accuracyof 99.14%. All prior work on Linux malware analysis used less than 1000 malware, and hencethe accuracy numbers reported by them are not completely validated. Our work improves overprior work in two ways: substantial enhancement in the dataset, and hybrid analysis based onboth static and dynamic features.
Acknowledgements
I would like to extend my sincerest gratitude to my thesis supervisor, Dr. SandeepShukla, for his unparalleled guidance and support. I was new to this field, Ican’t be much more thankful for all the wild ideas and new things i got to learn
and explore under his guidance. I am grateful for his patience and all those weeklysessions of discussion.I would also thank my Family for believing in me and my Friends for their support.These 2 years would not have been same without you all.Last but not the least I would like to thank Gaurav kumar who has helped me in thiswork and making it possible to complete the work on time.
TABLE OF CONTENTS
Page
List of Tables vii
List of Figures viii
1 Introduction 1
2 Problem Background 42.1 Linux Malware and types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Existing Malware detection strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Dynamic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Motivation For a new Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Past work 83.1 Static analysis approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Dynamic analysis approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Drawback of past works: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Analysis Infrastructure and Feature extraction 114.1 Analysis infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.2.1 Static feature vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.2.2 Dynamic Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 23
4.1.3 Machine learning classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Result And Discussion 315.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
v
TABLE OF CONTENTS
5.2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2.2 Additional metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.5 Comparison To Existing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6 Scope and Future Work 366.1 Supporting Multiple architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 Analysis on different file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3 Multi-path execution of files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A Appendix A 38
Bibliography 39
vi
LIST OF TABLES
TABLE Page
4.1 Fields in ELF header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Fields in setion header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Fields in segment header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Mean value comparision of different fields in ELF header . . . . . . . . . . . . . . . . . 19
5.1 Confusion matrix for a two class classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Test result on static feature set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Test result on hybrid features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4 Previous works on Linux malware analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vii
LIST OF FIGURES
FIGURE Page
1.1 VirusTotal stats1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 VirusTotal stats2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
4.1 Architecture of hybrid model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 ELF file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Limon sandbox architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4 ELF Layout in disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5 Frequency distribution of various section . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.6 Frequency distribution of various section type . . . . . . . . . . . . . . . . . . . . . . . . 20
4.7 Frequency distribution of various segment type type . . . . . . . . . . . . . . . . . . . . 20
4.8 Frequency distribution of symbol table features type . . . . . . . . . . . . . . . . . . . . 21
4.9 Example of GNU string output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.10 Example of strace output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.11 Benign sys call statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.12 Malware sys call statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.13 Benign proc file system access stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.14 Malware proc files access stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.15 Benign proc files access stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.16 Malware sys file system access stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.17 Benign etc files access stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.18 Malware etc file system access stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
viii
CH
AP
TE
R
1INTRODUCTION
Over the last past decades, the Internet has grown so much and so did the technology. This
growth of the Internet and advancement in technology has resulted in a steep increase in
the number of Linux based servers, Linux based routers and Linux based IoT (Internet
of Things) devices. Latest trends shows that Linux has now become a fresh worthy target for the
hackers.
FIGURE 1.1. Types of files submitted on VirusTotal [9] in the past 7 days
1
CHAPTER 1. INTRODUCTION
A report from AV-Test [1] shows that in 2016 MacOS computers show a 360 percent increase
in malware targeting with respect to the previous year, but Linux was not far behind.It saw
a 300 percent increment in malware targeting with respect to its previous year. According to
WatchGuard Security report [11] of Q1 2017, Linux malware made up 36 percent of the top
threats. In Figure 1.1, we can see that the number of ELF [3] i.e Linux executables, submitted
in the last seven days is in huge number and their counts can be compared to that of windows
executable. This all statistics shows that Linux malware threats are now in an alarming state.
As Linux is open source, all the latest threats are easily identified, and there are regular updates
to the Linux kernel to catch up with new threats. Developers are regularly providing some system
updates and are using some system protection mechanisms to protect the systems from emerging
threats. These make Linux one of the safest platforms. But here comes the problem. Most of the
IoT devices, router, etc. vendors do not provide this system updates as frequently, and if they do,
this takes a long time for the consumer to download the updates and install on all the devices.
These make the devices prone to exploitation.
As the Linux malware threat is on the rise, there is a need for Anti-threat mechanisms to make
our data and systems secure. In this work, we are going to present an automated Linux malware
detection method which currently outperforms all the existing detection methods. Our work
focuses on Linux binaries, i.e., ELF (Executable and Linkable Format) which is considered a
standard binary file format.
FIGURE 1.2. Number of files submitted on VirusTotal in the past 7 days
2
CHAPTER 1. INTRODUCTION
As we see in Figure 1.2, the number of new malware and total malware submitted on
VirusTotal is very high. To manually analyze this amount of data by manual reverse engineering
and unpacking is a hectic task. This leads to the need for an efficient automatic detection system
to detect malware.
Currently, malware analyst are using two approaches for analysis namely, Static and Dynamic
analysis. In static analysis, a file is analyzed just by looking at its structure and data present in
it. As it is not executed, it is fast and easy to deploy, but there are some limitations for example,
polymorphism/metamorphism and packers make the file data encrypted or packed so that no
more static analysis can be done on it. This motivates for another approach, dynamic analysis. In
this analysis, files are executed in a sandbox environment, and its behavior is analyzed. Since
this analysis does not get affected by obfuscation and encryption of data, it has its limitation too.
Some malware hide their true behavior when they find out that they are running in a controlled
environment. Another limitation is that, since we usually monitor one of the paths of the process,
hence rest of the code remains unexplored. So there is a need for a new detection mechanism
which can overcome the limitations of these approaches.
In the following chapters, we are going to discuss about Linux binary (ELF), types of malware
and our methodology to detect Linux malware efficiently.
3
CH
AP
TE
R
2PROBLEM BACKGROUND
In this section, we describe some of the Linux malware and discuss what problems are present
in the existing detection methodology and how we plan to overcome those problems using our
approach.
2.1 Linux Malware and types
Malware stands for malicious software which has some intention of disrupting infrastructure,
stealing useful data, spying on a victims computer, etc. Malware can be categorized by their
actions. Some of them are listed below:
• Exploit:- Exploits are malware which use some of the system vulnerabilities to attack.
Some of the exploits are available for well known vulnerabilities found in a system, and
some of them find the vulnerability in the system and then attack accordingly.
• Backdoor:- There are some malware which try to find some backdoors to steal user
information. In Linux, they try to get the list of all the registered user to get some backdoor,
or sometimes they try to register themselves as a registered user to get access to the system.
• Virus:- A virus is a malware which when executed, affect other files by inserting its code
and infecting them. They spread very fast and in a very short time can infect a large server.
• DDoS doxing:- DDoS stands for Distributed Denial of Service. This type of attack is
seen quite frequently on Linux servers. When an attack is successful, the server becomes
unresponsive.
4
CHAPTER 2. PROBLEM BACKGROUND
• keylogger:- Keylogger is a type of malware which tracks all the keys pressed by the victim
and sends them to its command and control server.
• A digital currency mining malware:- This type of malware tries to gain access to a
system and then uses its resources for mining digital currency. This malware uses some
machine learning methods to analyze victims usage, and on the basis of that, it uses the
resources so that it does not get exposed.
• Dropper:- This type of executable malware contains another executable. When executed,
it installs the other executable and runs it parallelly so that if it gets detected, actual
malware remains in the system.
There are some malware which combines the above-listed types to perform thier mallicious
activity. So, distinguishing between a malicious file to a non-malicious one becomes a major task
once a file enters in the system.
The work in this thesis aims to detect malware by performing analysis on a large corpus of data
to make our model robust to Zero-day malware. A zero-day malware uses a security vulnerability
the same day that vulnerability becomes known to the public or to the vendor who created
the software.. Our model outperforms many antiviruses as it does not depend on signatures of
malware. It uses static as well as dynamic features of a executable for their detection.
2.2 Existing Malware detection strategy
Malware analysis is the method of dissecting a binary file to understand how it works and then
devising methods to identify it and other similar files. It aims to gain information about the
actions performed by the malware and then developing a method to neutralize its effect and to
protect our system from further infection.
Malware analysis can be used both for detecting malware and classification of malware. Malware
detection means labeling an executable as benign or malware. Therefore malware detection
becomes the first stage of malware analysis. Once an executable is detected as malware, further
classification based on malware types and family can be performed on it. Malware analysis can
be done in two basic ways static and dynamic analysis. The Static analysis aims to analyze a
binary without executing it whereas in Dynamic analysis a binary is executed inside a sandbox
and analysis is performed based on its behavior.
2.2.1 Static Analysis
Static analysis is used to perform analysis of static properties of a binary. Without executing
the binary, an examination can be performed on it’s ELF header, embedded strings, metadata,
disassembly, etc. This analysis is fast as we do not have to execute the binary. But along with it’s
5
CHAPTER 2. PROBLEM BACKGROUND
pros, it has some limitations too. There are many techniques that malware authors nowadays
uses to thwart static analysis. Some of the techniques are described below:
• packing:-In this technique malware authors use some encryption or compression algorithm
on the original executable to create a packed executable which contains an unpacked stub.
As an executable is executed, the first thing that loads in memory is the unpacked stub.
This stub then unpacks the packed executable, and transfer the control to the actual entry
point of the executable.
• Metamorphism:- A metamorphic malware is a malware which changes its code each time
its get executed, without actually changing the actual functionality of the malware. This
functionality can be achieved by replacing an instruction by another similar instruction
with different opcode, including garbage code, changing the order of subroutines, etc.
• Polymorphism:- This type of malware also changes its shape as well as signature. It has
two part malware decryption part and encrypted malware body. Malware authors use a
randomly generated key to encrypt malware body. Once this malware is loaded into the
memory, the decryption part decrypts the encrypted part to perform malicious activity.
After execution, it again encrypts itself so that it doesnot get discovered.
These types of techniques are used by malware authors to create malware, so when the hash of
the malware is taken, each time it will give a different hash value which will make them bypass
signature-based detection system. Neither static code analysis can be performed on this type
of malware as the code is encrypted. These limitations of static analysis motivates the need for
another type of analysis which can overcome these limitations.
2.2.2 Dynamic Analysis
In dynamic analysis, an executable is run under a controlled environment, or Virtual Machine
and the behavior of the executable is observed to deduce whether it is a malware or not. When an
executable is run, we can track what files are being accessed by it, what are the ip’s it’s trying to
connect, what new files are created, etc. The main advantage of Dynamic analysis is it remains
unaffected from Polymorphic or Metamorphic malware as it has nothing to do with the static code
of the malware. With such advantage, it has some limitations too. Some of them are listed below:-
• incomplete code coverage:- During Dynamic analysis we are able to monitor single
execution path which lead to incomplete code coverage.
• Detection of Sandbox Environment:- Some malware can detect whether it is running
in a controlled environment. When this happens, the malware does not show its true
behavior.
6
CHAPTER 2. PROBLEM BACKGROUND
• risk to the host machine:-- If there is some bug in the sandbox environment, malware
may escape the isolaton and the host machine or other computers in the network may get
damaged or infected.
2.3 Motivation For a new Approach
As we have seen in the earlier sections that both static and dynamic analysis have some limi-
tations. Static analysis can be thwarted once some encryption algorithm is used while dynamic
analysis suffers from low code coverage problem. What if we could combine the feature set of
both of thess approaches. The dynamic features may be handy to get full insights when static
analysis gets thwarted by obfuscations. On the other hand, static analysis can provide the full
overview of the executable when dynamic analysis suffers from the code coverage issue. This
shows that both may act as complementary to each other. Malware authors can use packers,
obfuscation techniques, polymorphism/metamorphism techniques to bypass file format based
analysis or signature-based analysis. They can make the malware to do some additional actions
like a randomly accessing a file, call a random system call, etc, to bypass dynamic analysis.
But bypassing both the technique at once will be a tougher job for them. In this work we have
used this hybrid approach and results are quite promising. In later chapters, we discuss our
architecture and features of our model in furthur details.
2.4 Contribution
• To the best of our knowledge and our literature survey, this is the first time hybrid analysis
is done on linux binaries using such a large dataset. Most of the previous work which we
survey in the next chapter are interesting but are based on experiment on much smaller
dataset.
• We have used our new hybrid approach to detect Zero-day malware, and the results are
quite promising.
• We have created an automated system to deal with a large number of files which is difficult
to be analyze manually.
7
CH
AP
TE
R
3PAST WORK
A large variety of malware are being used to attack critical infrastructure, to steal private
data, to carry out financial fraud, etc. To remain spared from them, big MNCs, Government
organizations are spending huge amount of money on anti-virus software or creating their
own malware detection systems. Most of these anti-virus companies or malware detection
systems are based on signature-based detection techniques or anomaly-based detection. A
signature is usually a hash that uniquely identifies a specific malware. The signature-based
technique uses the signature of known malware. These signature are developed by antivirus
companies to capture threat. This technique is efficient, fast and easy to deploy. But once
an unseen or new malware comes into the system, this technique fails. Once that malware
has affected numerous systems and analysts are able to generate its signature, then only
this signature-based detection technique will able to detect it. In anomaly-based detection
technique, antivirus companies form rules about actions which they consider as safe actions.
If any of the rules are broken by a process, it is labeled as malicious. This technique has
the capability to capture new malware, but it has a very high false-alarm rate.
Another method which is now gaining popularity is heuristic based. In this technique static,
dynamic or behavior based features of a dataset containing both benign and malware are
used to train a machine learning classifier. In this literature, we survey some of the work
that has used thess heuristic techniques. In section 3.1 static analysis based and section
3.2 dynamic analysis based approaches has been discussed.
8
CHAPTER 3. PAST WORK
3.1 Static analysis approaches
In this section, we discuss some of the past work which have used the static analysis
approach.
– In Shahzad, F. [15] the authors have used executable and linkable format (ELF) for
analysis and extracted 383 features from ELF header. They have used information gain
as feature selection algorithm. They used four famous supervised machine learning
algorithms C4.5 Rules, PART, RIPPER and decision tree J48 for classification. Their
dataset contained 709 benign executables scraped from the Linux platform, and 709
Malware executables downloaded from VX heavens [10] and offensive computing [5].
They registered nearly 99 % detection accuracy with less than 0.1 false alarm.
– Jinrong Bai et. al [12] proposed a new malware detection technique in which they
extracted system calls from the symbol table of Linux executables. Out of many system
calls, they selected 100 of them as features. There method obtained an accuracy of 98
% in malware detection. Their dataset contains 756 benign executables scraped from
Linux system and 763 malware executables from VX heavens.
3.2 Dynamic analysis approaches
In this section, we discuss some of the past works which have used Dynamic analysis
– Ashmita, K. et al.: [13], proposed an approach based on system call features. They
use ’strace’[8] to trace all the system calls of the executables running in a controlled
environment. In this paper, authors have used two-step correlation-based feature
reduction. They first calculated feature-class correlation by using information gain and
entropy, to rank the features than in the next step they removed redundant feature by
calculating Feature-Feature correlation. They used three supervised machine learning
algorithm J48, Random Forest AdaBoost for classification and have feature vector
with 27 feature length. The authors of this paper used 668 files in the dataset, out
of which 442 are benign 226 were malware executable. From this approach, they
proposed the accuracy of 99.40 %.
– Shahzad, F., Bhatti [14], [16] have proposed a concept of genetic footprints in which
mined information of process control Block (PCB) of the kernel is used to detect
the runtime behavior of a process. In this approach, authors have selected 16 out of
available 118 parameters of the task_struct for each running process. To decide which
parameters to select, authors have claimed to done forensics study on that. These
authors believe that these parameters will define the semantic and the behavior of
the executing process. These selected parameters are called genetics footprints of the
9
CHAPTER 3. PAST WORK
process. Authors have then generated system call dump of all these parameters for
15 sec with a resolution of 100 ms. All the instances of benign and malware process
are classified using RBF-Network, SVM, J48 decision tree and a propositional rule
learner (J-Rip) in weka environment. Authors have done an analysis of their result
and shortlisted J-48 J-Rip classifiers having less class-noise as compared to others. In
the end, authors have also listed the comparison with other system call based existing
solutions. Authors have also discussed evasion in relation to the robustness of their
approach to evasion and access to task struct for modification. They used a dataset of
105 benign and 114 malware process and proposed a result of 96 percent detection
rate with 0 percent false alarm rate.
3.3 Drawback of past works:
– All of the past work which we seen in the literature survey used very small sample size
and hence their validity of the accuracy or false positive rates may not be reflective of
the real power of their malware classification methods.
– In the Dynamic cases , they have used very restricted number of features.
– They donot handle Zero-Day malware at all.
– They do not take advantage of using both static and dynamic featuires
10
CH
AP
TE
R
4ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
In this section, we are going to discuss the architecture of our detection system and various
feature engineering and modelling technique we used to achieve a good detection accuracy.
Our work is based on 32-bit ElF executables but can be extended to the 64-bit version as
well.
4.1 Analysis infrastructure
Figure 4.1 illustrate our Analysis Infrastructure. It has three phase: Data generation,
Feature Extraction and Data modelling
4.1.1 Data Generation
In this section, we are going to see how we generated static and dynamic reports from
executables for further analysis in the subsequent phase .For static reports, we use :
– GNU strings:- In Binutils, strings is one of the utilities. When used on any file,
it searches for ASCII character followed by an unprintable character. Strings in
an executable can be useful as it sometimes gives an overview of what the file is
going to perform, for example, if a malware tries to access /etc/passwd, we can have
this directory is printed as a string output, or if an executable tries to make a TCP
connection, we can see IP address embedded in the executable in ASCII.
11
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
FIGURE 4.1. Architecture of our hybrid malware detection system
12
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
– readelf:- readelf [6] is an Unix binary utility that displays structural information of
one or more ELF files. ELF file format contains lots of information that can be used in
the detection of malware. It contains ELF header followed by file data .More about
ElF format will be discussed in further sections.
FIGURE 4.2. ELF file format
– limon sandbox:- limon [4] is a tool which allows us to run an executable in a sand-
boxed/controlled environment and give us a report about what it did during its runtime.
The main component of limon sandbox includes a host machine which manages the
guest machine. We have used Ubuntu 16.04 as our host machine and Ubuntu 14.02 a
32-bit machine as our guest machine. To get a full picture of a file, it is executed in a
full privileged mode o the guest machine.
In order to run an executable in limon sandbox, its path is given using a command
line. Each analysis is performed in a fresh Virtual machine. While setting up a virtual
machine, a current snapshot is taken so that after execution of the file, limon can
revert back. At the end of the execution, the sandbox returns a text file containing the
full trace of system calls and userspace functions. We have used the default setting of
60 sec for which a file will be monitored. The architecture of limon sandbox is shown
in Figure 4.3
13
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
FIGURE 4.3. Limon sandbox architecture
4.1.2 Feature Extraction
As we use the hybrid approach, we first extract static features and dynamic feature of
executables separately and then integrat both of them to use in our model.
In the following section, we describe what are the various static and dynamic features we
have use in our model.
4.1.2.1 Static feature vector
Static feature extracted from the ELF file format of both malware and benign. Before
describing the features, we first briefly introduce the ELF format
14
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
– Executable and Linkable formatELF(Executable and Linkable format) is standard binary file format in Unix and Unix
like system. The binaries in Linux are executable, shared libraries, object code and
core dumps.ELF File format basically has three major part.
* ELF header
* Segments
* Sections
FIGURE 4.4. ELF Layout in disk
Each of these parts plays an important role in loading and linking process.Let’s now
look at the role and structure of each component one by one:
15
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
* ELF headerELF header consists a data-structure which gives information about the organi-
zation of a file. There are various fields in ELF Header and their definitions is
given below table:
Table 4.1: Fields in ELF header
e_ident
This field contains various flags foridentification of file. These flagshelps in decoding and interpreting the file content.example: ELF magic, File class, File version’s etc
e_typeThis field signifies the type of binary e.gexecutable, shared library etc
e_machineThis field gives information about thearchitecture oft the file eg x86, MIPS etc
e_versionThe version of an object file.Its set to 1 fororiginal version of ELF
e_entry address of process entry pointe_phoff This points to start of program header tablee_shoff This points to start of section header table
e_flagsinformation about some processorspecific fields
e_ehsize size of the ELF headere_phentsize size of program header table entrye_phnum number of entry in program header tablee_shentsize size of section header table entrye_shnum number of entry in section header table
e_shstrndxit contains the index of section table entrywhich contains name of all sections
* SectionsAll the information required during the linking process to make a target object file
an working executable resides in sections. Sections are actually needed during the
time of linking, they are of no use at run time. In ELF header there is a section
header table which contains the information about each section present in the
file. This table contains a number of section headers which points to each section.
There are various fields in each section header and their description is shown in
the table 4.2:
16
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
Table 4.2: Fields in setion header
sh_nameIt contains an offset to a string in.shstrtab sectionwhich is the name of the section
sh_typeIt signifies the type of sectione.g Program data, Symbol table, String table etc
sh_flags this signifies the attribute of a section
sh_addrFor the sections which are loaded in memory,it contains their virtual address
sh_offset it contains offset of the section in process imagesh_size Size of the sectionsh_link It contains section index of another associated sectionsh_Info additional information about sectionsh_addralign It signifies the required alignment for the sectionsh_entsize contain the size of each section header table entry
There are number of section each having different roles in linking process let’s
look sum of them :
· .text section: This contains user executable code.
· .data section: This contains all the uninitialized data.
· .rodata section:This contains read only initialized data
· .bss section:This contains all the uninitialized data
· .got section: For the dynamic binaries this section which stands for Global
Offset Table contains the address of all the variable which are relocated upon
loading.
· .got.plt section:This are the GOT entries which is assigned to dynamic linking
functions.
· .dynamic section: For the dynamic binaries this section contains information
about dynamic linking which is used by runtime linker.
· .dynsym section:This is actually runtime symbol table.
· .dynstr section:This contains null terminated strings which are the name of
symbols.
· .symtab section:This is compile time symbol table.dynsym is a subset of this
section
· .strtab section: contains the name of the symbols in symbol table
· .shstrtab section:This contains name of the sections
· .rela.dynsection:run time relocation table
17
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
* SegmentsSegments are basically known as program headers.It is just like disk view of
ELF format is broken into suitable chunks known as segments which get loaded
in memory.Like section header table there is also program header table.This
program header is optional in linking view while section header table is optional
in runtime view.In this header there are segment header which give information
about various segments present in the file image.each section header have some
fields related to the segments ,there defination is define in the below table:
Table 4.3: Fields in segment header
p_typeGives the information about typeof segment
p_flags segment related flags are present herep_offset offset of the segmentp_vaddr virtual addr of segment in the memoryp_paddr phisical address of segment in the diskp_filesz segment size in diskp_memsz segment size in memoryP_align segment required alignment
There are number of segment types,some of them are describe below:
· NULL:This is basically unassigned segments.
· LOAD:This is the segment which gets loaded into the memory,rest of the
other segment mapped within the range of the memory of one of this segment.
· INTERP: .interp section gets mapped to this segment
· DYNAMIC: this is basically the .dynamic section in memory
we just looked the overview of ELF file format.Now lets go back to our static feature
extraction which uses all the information described above.
– ELF structure feature setIn this section, we are going to discuss what are the feature we have extracted from
ELF structure.
* ELF HeaderThis gives us information about the organisation of ELF File. Out of various
fields present in ELF header, we have picked seven of the fields for our feature
set. Statistical comparison of that seven feature between malware and benign is
shown in the table 4.4 :
18
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
Table 4.4: Mean value comparision of different fields in ELF header
Features Mean for benign Mean for malware
Number of section headers 29.647 97.396Size of ELF header 63.466 52.202Number of program headers 8.921 4.222start of section header 170400.26 346456.475start of programm header 63.512 52.189size of program headers 55.025 32.404section header string table index 27.729 94.977
* Section Header TableThe basic structure of Section Header Table has been discussed in the previous
section. From the section header, we have used section name and section type
in our feature list. Frequency distribution of a various section in benign and
malware is shown below:
FIGURE 4.5. Frequency distribution of various section
19
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
FIGURE 4.6. Frequency distribution of various section type
* Program Header TableStructure of Program Header Table is discussed in segments field in the previous
section. From the Program Header Table, out the various fields, we have used
Segment type in our feature list. Comparison of segments type for benign and
malware is shown below:
FIGURE 4.7. Frequency distribution of various segment type
20
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
* Dynamic sectionThe runtime linker uses this segment to find all the necessary information needed
for dynamic linking and relocation. The Dynamic section contains two fields ’tags’
and ’values’.The number of entries in the dynamic section is not fixed. In our
work, we have used the content of ’tags’ field in our feature list.
* Symbol tablesymbol table contains a large amount of data needed to link or debug files. Symbol
table structure has five fields namely: name, value, info, size and section header
index. For the symbols, we categorized them according to there ’info’ field and for
the objects and functions, we categorized them according to their scope and info.
As features, we have used 14 categories we make and count of dynamic symbol.
Frequency distribution of these feature for benign and malware is shown below:
FIGURE 4.8. Frequency distribution of various symbol table features
21
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
– Strings based feature extractionStrings from a file can be informative, but most of the strings which we get from a
file are from file structure like the name of objects, a function from symbol tables or
argument of functions or some garbage values.
FIGURE 4.9. Example of GNU string output
This all things, we already using our features in one way or another. So reusing all of
this will cause redundancy in our feature set. However, we are using frequency bins
related to the length of strings in our feature set.
22
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
4.1.2.2 Dynamic Feature Extraction
The runtime behaviour based feature is extracted from the reports of the files that are
being generated by limon sandbox. Limon sandbox full report comes in a text file which
contains static analysis(structure information),ssdeep [7](fuzzy hashing comparison with
other other reports), Dynamic Analysis( system call trace ) and Network Analysis (if a file
is engaged in some network activity).
As most of the malware authors are now using Polymorphism and metamorphism technique,
this results to have multiple signatures of a single file. But when the programme is loaded
in memory, to perform its action its is decrypted to its original form. There comes the role of
ssdeep, what it did it take hashes of the binary that are loaded and then it compares with
other files. This helps us to remove multiple files having different signatures but have the
same fuzzy hash to reduce redundancy in our dataset.
For this work, we have use system call trace and arguments of system calls in our feature
list. Let’s see all of them one by one.
– System Calls:System calls can give ous the information about what a process want to perform or
what we say the behaviour of a process. Limon sandbox uses ’strace’ to get a full
system call trace of a process and process related to that process. An example of an
output of strace is shown below in image:
FIGURE 4.10. Example of strace output
The output of ’strace’ as we can see in the above picture contains system calls, its
arguments and return values. We have created an architecture through which we can
get only the names of the system calls from the strace report.
Linux system uses a fixed set of system calls. We have used all the system calls in
our feature list and check for each file, what are the system calls they are using. A
statistical comparison of system calls for benign and malware executables as observed
in our dataset is shown in fig 4.17 and fig 4.18.
23
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
FIGURE 4.11. 20 most frequent system call of benign
FIGURE 4.12. 20 most frequent system call of malware
24
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
– File system based feature:-
* proc and sysfs filesystemProc and sysfs filesystem are a virtual filesystem which contains information
about runtime system information on processes, system and hardware configura-
tions, and information on the kernel subsystems and kernel drivers. Comparision
of some of the proc and sys file accessed by malware and benign is shown below.
FIGURE 4.13. 7 most frequent proc files access by benign
FIGURE 4.14. 7 most frequent proc files access by malware
25
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
FIGURE 4.15. 7 most frequent sys files access by benign
FIGURE 4.16. 7 most frequent sys files access by malware
26
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
From our dataset we observed a large portion of malware is accessing ’/proc/net/route’
i.e system routing table, to get the list of all active network interfaces. We also
find that they are accessing ’/proc/net/tcp’ and ’/proc/net/dev’ to get information
about active tcp socket and sent and received packet respectively. On the counter-
part, in sysfs, we saw the malware are accessing ’/sys/class/net/’ to get the length
information of transmission queue.This information is very much important for
performing a DDoS attack.
Some of the sys and proc file are used by malware authors in VM detection like
’/proc/cpuinfo’, ’/proc/sysinfo’, ’/sys/class/dmi/id/product name’ etc.From our data
set, we have observed these files are being accessed by malware more frequently.
* etc file systemetc folder contains all system configuration files.It contains system configurtion
tables,configuration files which can force a service to start or stop,configuration file
of installed programs,configuration files which gives information about allowed or
restricted usr ,permitted ip’s etc.
FIGURE 4.17. 7 most frequent etc files access by benign
27
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
FIGURE 4.18. 7 most frequent etc files access by malware
From our malware dataset we observed that network configuration file are ac-
cessed like ’/etc/resolv.conf ’,’/etc/hosts’ etc ,more frequently.We observed that a
chunk of malware is accessing ’/etc/passwd’ file which gives information about
each user registered. Flooders (a malware type) uses this information to find a
backdoor account. They also try to edit ’/etc/passwd’ and ’/etc/shadow’ to add a
new usr.
* shell commandsShell acts as interface between user and Operating system.Using commands we
can use the services of OS.In our Dataset 16 percent of malware were executing
atleast one external command while in the Benign the percentage is quite low
,nearly 3-4 percent.In total we get 131 unique commands from our dataset.In our
dataset, commands like cp ,netstat,iptables,touch ,file etc are most frequently
seen being executed in malware.Some of the malware try to execute system
’reboot’command and some of them were executing ’ufw’ command which actually
can be used to alter the firewall of the network.In benign not much of commands
were found to be executed.Commands like file,grep,basename etc are mostly seen
in them.
28
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
4.1.3 Machine learning classifier
We have used python based machine learning library sckit-learn to evaluate the efficiency
of the hybrid approach we used. We have used various classification algorithms which are
described below:
– KNNKNN stands for K-Nearest Neighbour.In this algorithm, there a large number of
labelled data present in the feature space and if an unlabeled data point comes in that
feature space then labelling of it can be done by looking it’s K nearest neighbour and
giving the label which is in the majority. The distance which is generally used here is
Euclidean distance i.e square root of the sum of square of difference of each feature of
two data points.
– Decision TreeOne of the most intuitive and popular methods of data mining that provides explicit
rules for classification and copes well with heterogeneous data, missing data, and
nonlinear effects is decision tree.It uses information gain to select the root node which
gives the highest information gain value and similarly it goes to the leaf node where
decision is made.
– Random ForestRandom Forest is a supervised learning algorithm. Like you can already see from
it’s name, it creates a forest and makes it somehow random. The „forest“ it builds, is
an ensemble of Decision Trees, most of the time trained with the “bagging” method.
The general idea of the bagging method is that a combination of learning models
increases the overall result.One big advantage of random forest is, that it can be used
for both classification and regression problems, which form the majority of current
machine learning systems.The random-forest algorithm brings extra randomness into
the model, when it is growing the trees. Instead of searching for the best feature
while splitting a node, it searches for the best feature among a random subset of
features. This process creates a wide diversity, which generally results in a better
model. Therefore when you are growing a tree in random forest, only a random subset
of the features is considered for splitting a node.
29
CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION
4.2 Summary
In this section we see the architecture of our model,Types of features we used ,how they are
extracted and what are the different machine learning classifier we have used in our model
for the detection of malware.In next Chapter we are going to discuss about the datset we
used and see the result of our model on that datset.
30
CH
AP
TE
R
5RESULT AND DISCUSSION
5.1 Dataset
To make our model robust, the first thing we needed was the large corpus of a dataset.Most
of the previous work’s authors have used very fewer amount data, so one of the major
challenges we faced was to collect a large amount of dataset.For malware, we collected our
data from VXheavens, virustotal and Detux.org [2]. For Benign, we scrap the executables
from fresh Linux operating system directories /bin,/sbin and /usr/bin, we downloaded some
of the open Intel project, open c c++ software and compiled them into our system to get
more benign executable.For the final analysis, we have used 7717 malware and 2265 benign
executable.
5.2 Evaluation Metric
We have used several metrics to evaluate the performance of classification model. Below is
a brief description of them.
5.2.1 Confusion Matrix
A confusion matrix (Kohavi and Provost, 1998) contains information about actual and
predicted classifications done by a classification system. Performance of such systems is
commonly evaluated using the data in the matrix. The following table shows the confusion
matrix for a two class classifier.
The entries in the confusion matrix have the following meaning in the context of our study:
31
CHAPTER 5. RESULT AND DISCUSSION
– True positive: These are the samples which are correctly predicted as benign
– True Negative: These are the samples which are correctly predicted as malware
– False positive: These are the samples which are malware but predicted as benign
– False positive: These are the samples which are benign but predicted as malware
Table 5.1: Confusion matrix for a two class classifier
PredictedPositive Negative
Actual Positive #TP #FNNegative #FP #TN
5.2.2 Additional metrics
– TPR: TPR stands for True Positive Rate also known as recall,this is define as ratio of
number of true positive to total number positive sample:
TPR = TPTP +FN
– FPR: FPR stands for False Positive Rate,this is define as ratio of number of False
positive to total number negative sample:
FPR = FPFP +TN
– Precision: precision (PV) is the proportion of the predicted positive cases that were
correct, as calculated using the equation:
PV = TPTP +FP
– F score: when the dataset is imbalance ,its is used to measure how much your model
is correct.Its calculated as weighted harmonic mean of precision and recall:
Fscore = 2∗TPR∗PVTPR+PV
32
CHAPTER 5. RESULT AND DISCUSSION
5.3 Training and Testing
For our experiments we used ubuntu 16.04 with 32 Gb RAM and INTEL I7 octa core
processor . We used 70% data for training purpose and 30% data for testing purpose.
To minimize the risk of overfitting and to get a generalize result we used 10 fold cross
validation.
5.4 Result
As we have already seen that we have used three machine learning classifier and
we check our model efficiency on all of them. Results we received are highly promis-
ing.Table 5.2 and 5.3 show the results which we achieve on our dataset for static only
features and for hybrid approach respectively.
Table 5.2: Test result on static feature set
Class KNN Decision Tree Random ForestTPR FPR Pr FM TPR FPR Pr FM TPR FPR Pr FM
Benign 0.888 0.067 0.757 0.817 0.978 0.017 0.937 0.957 0.982 0.014 0.950 0.966Malware 0.932 0.111 0.972 0.951 0.982 0.021 0.994 0.988 0.986 0.017 0.995 0.990Average 0.922 0.101 0.923 0.921 .981 0.020 .981 .981 .985 .016 .985 .985
Table 5.3: Test result on hybrid features
Class KNN Decision Tree Random ForestTPR FPR Pr FM TPR FPR Pr FM TPR FPR Pr FM
Benign 0.914 0.062 0.777 0.840 0.983 0.010 0.963 0.973 0.989 0.006 0.976 0.982Malware 0.938 0.085 0.978 0.958 0.989 0.016 0.995 0.992 0.993 0.010 0.997 0.995Average 0.932 .079 0.932 0.9314 0.987 0.014 0.987 0.987 0.992 .009 0.992 0.992
As we can see from the table 5.2 and 5.3 that using Random Forest we get the best
detection accuracy.However we can see that detection efficiency increases in all of the
three model (KNN,Decision Tree,Random Forest) as we shifted our feature set from
static to hybrid features set.The best weighted average F Measure score i.e 0.992, we
get in Random Forest which is pretty good considering the fact that the TPR of Benign
is low in comparison with the malware .The main goal of our work was to have high
TPR for malware while not predicting too much benign in to malware.Figure 5.1 is
depicting the confusion matrix for the best detection accuracy which we got using
Random Forest.
33
CHAPTER 5. RESULT AND DISCUSSION
Confusion matrix statistics:
* malware/malware = 99.69% ( 2313/2320 )
* malware/benign = 0.31% ( 7/2320 )
* benign/benign = 97.62% ( 659/675 )
* benign/malware = 2.37% ( 16/675 )
FIGURE 5.1. confusion matrix result
It can observed from the confusion matrix that false negative for malware is quite low
which also explains high precision value for malware,whereas false negative value for
benign is quite high compared to malware which also explains low precision value for
benign compared to malware.
5.5 Comparison To Existing Approaches
In this section we are going to compare our work with the other author’s work who
have done some work on Linux malware analysis.To the best of our knowledge, most
of the work that other authors have used are either static or Dynamic approach. Our
work is the first that has used hybrid approach i.e by integrating static and dynamic
features.
34
CHAPTER 5. RESULT AND DISCUSSION
Table 5.4 shown the comparison result.Most of the work as we can see in table 5.4 has
performed analysis on a very small number of a dataset. In our work, we have used
a large corpus of both malware and benign files to make our model robust. Shahzad,
F. has performed analysis using fields of ELF static structure with a 99% detection
accuracy but since this approach is static based they have rejected some of the samples
which have forged headers. Ashmita, K. et al. (2014) has used the dynamic approach
in which they have analyzed system calls. They got a great detection accuracy of 99.40
%, but the dataset they used had only 226 malware, and the number of features was
also very less. Our model has got a comparable average detection accuracy of 99.14%,
and the strength of our dataset is also pretty good compared to them which makes our
model robust.
Table 5.4: Previous works on Linux malware analysis
Authors Features Accuracy Dataset Type of feature
Shahzad, F.(2011) 383 99%709 Benign709 Malware
Static :ELF structure
Jinrong Bai et al(2012) 100 98%756 Benign763 Malware
Static :Symbol Table
Ashmita, K. et al(2014) 27 99.40%442 Benign226 Malware
Dynamic:System calls
Shahzad, F., Bhatti(2013) 16 96%105 Benign114 Malware
Dynamic :Process control block
Ours115260
99.14%2265 Benign7717 Malware
Static : ELF Header +StringsDynamic: System calls + File Systems+ Shell Command
conclusion :We are using a new approach in performing Linux malware ananlysis by
the help traditional static and dynamic ones.Our model has shown some great result
and we have used a large dataset to prove robustness of our model.
35
CH
AP
TE
R
6SCOPE AND FUTURE WORK
6.1 Supporting Multiple architecture
During the collection of Malware sample ,we came across various ELF files which
supports different architecture.Due to the small scope of our work ,we have done the
analysis of only those files which were based on Intel architecture.There are many
files which remain un-analyzed due this limitation.In the future our can extended to
those files which are from different architecture.
6.2 Analysis on different file format
In this work our main focus was on ELF file format,but some the malware uses
different type of file format like Perl script,Python script ,Shell script ,Bash script
,PHP script etc to perform there malicious activity.Limon sandbox which we used in
this work to perform Dynamic analysis has the capabilities to give us the runtime
report for theses files also.In the future we can add a module to perform analysis for
these new files also.
36
CHAPTER 6. SCOPE AND FUTURE WORK
6.3 Multi-path execution of files
Currently Limon sandbox gives us the report of single execution path of an executable
file.This is the limitation of Dynamic based approach as we are unable to get all the
possible execution path of the malware to cover its complete run time behaviour.In
future we can add different modules to our model such that it can generate more
comprehensive reports.
37
AP
PE
ND
IX
AAPPENDIX A
Code base for Linux Malware Detcetion:https://github.com/Anmol33/M.Tech_thesis.git
38
BIBLIOGRAPHY
[1] Av-test security report.
https://www.av-test.org/fileadmin/pdf/security_report/AV-TEST_
Security_Report_2016-2017.pdf.
[2] Detux.org:.
https://detux.org/.
[3] Elf format:.
http://www.skyfree.org/linux/references/ELF_Format.pdf.
[4] Limon sandbox:.
https://github.com/monnappa22/Limon.
[5] Offensive computing.
http://www.offensivecomputing.net/.
[6] readelf tool:.
https://sourceware.org/binutils/docs/binutils/readelf.html.
[7] ssdeep (fuzzy hash):.
https://ssdeep-project.github.io/ssdeep/index.html.
[8] Strace tool:.
https://strace.io/.
[9] Virustotal statistics.
https://www.virustotal.com/en/statistics/.
[10] vx heaven.
http://vx.netlux.org/.
[11] Watchgaurd internet security report.
https://media.scmagazine.com/documents/306/
wg-threat-reportq1-2017_76417.pdf.
39
BIBLIOGRAPHY
[12] S. M. JINRONG BAI, YANRONG YANG AND Y. MA, Malware detection through
mining symbol table of linux executables., Information Technology Journal,
(2012).
[13] A. K.A AND V. P, Linux malware detection using non-parametric statistical
methods, Chakraborty R.S., Matyas V., Schaumont P. (eds) Security, Privacy,
and Applied Cryptography Engineering. SPACE, (2014).
[14] B. S. S. M. SHAHZAD, F. AND M. FAROOQ, In-execution malware detection
using task structures of linux process, IEEE International Conference on
Communication, pp. 1–6, (2011).
[15] F. SHAHZAD AND M. FAROOQ, Elf-miner: using structural knowledge and data
mining methods to detect new (linux) malicious executables, Knowledge and
Information Systems, 30 (2012), pp. 589–612.
[16] S. M. SHAHZAD, F. AND M. FAROOQ, In-execution dynamic malware analysis
and detection by mining information in process control blocks of linux os, Inf.
Sci. 231, 45–63, (2013).
40