linux malware detection by hybrid analysis · thesis title: linux malware detection by hybrid...

Linux malware detection by hybridanalysis

By

ANMOL KUMAR SHRIVASTAVA

Department of Computer ScienceINDIAN INSTITUTE OF TECHNOLOGY, KANPUR

A thesis submitted in partial fulfillment of the requirementsfor the degree of MASTER OF TECHNOLOGY

Under the supervision of:DR. SANDEEP SHUKLA

MAY 2018

Abstract

Name of the student: Anmol Kumar Shrivastava Roll No: 16111031Degree for which submitted: M.Tech. Department: Computer Science and EngineeringThesis title: Linux malware detection by hybrid analysisThesis supervisor: Dr. Sandeep ShuklaMonth and year of thesis submission:

Over the past two decades, cyber-security research community has been working on detect-ing malicious programs for the Windows-based operating system. However, the recentexponential growth in popularity of IoT (Internet of Things) devices is causing the mal-

ware landscape to change rapidly. This so-called ’IoT Revolution ’ has fueled the interests ofmalware authors which has led to an exponential growth in Linux malware. The increasingnumber of malware is becoming a serious threat to data privacy as well as to the expensivecomputer resources. Manual malware analysis is not effective due to the large number of suchcases. Furthermore, the malware authors are using various obfuscation techniques to impedethe detection of traditional signature-based anti-virus system. As a result, automated yet robustmalware analysis is much needed. In this thesis, we develop a hybrid approach by integratingboth static features as well as dynamic features of a malware, to detect it efficiently. We performedour analysis on 7717 malware and 2265 benign files and got a highly promising detection accuracyof 99.14%. All prior work on Linux malware analysis used less than 1000 malware, and hencethe accuracy numbers reported by them are not completely validated. Our work improves overprior work in two ways: substantial enhancement in the dataset, and hybrid analysis based onboth static and dynamic features.

Acknowledgements

I would like to extend my sincerest gratitude to my thesis supervisor, Dr. SandeepShukla, for his unparalleled guidance and support. I was new to this field, Ican’t be much more thankful for all the wild ideas and new things i got to learn

and explore under his guidance. I am grateful for his patience and all those weeklysessions of discussion.I would also thank my Family for believing in me and my Friends for their support.These 2 years would not have been same without you all.Last but not the least I would like to thank Gaurav kumar who has helped me in thiswork and making it possible to complete the work on time.

TABLE OF CONTENTS

Page

List of Tables vii

List of Figures viii

1 Introduction 1

2 Problem Background 42.1 Linux Malware and types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Existing Malware detection strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.2 Dynamic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Motivation For a new Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Past work 83.1 Static analysis approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Dynamic analysis approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3 Drawback of past works: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Analysis Infrastructure and Feature extraction 114.1 Analysis infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1.2.1 Static feature vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1.2.2 Dynamic Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 23

4.1.3 Machine learning classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Result And Discussion 315.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

v

TABLE OF CONTENTS

5.2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2.2 Additional metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.3 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.4 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.5 Comparison To Existing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6 Scope and Future Work 366.1 Supporting Multiple architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2 Analysis on different file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.3 Multi-path execution of files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

A Appendix A 38

Bibliography 39

vi

LIST OF TABLES

TABLE Page

4.1 Fields in ELF header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Fields in setion header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.3 Fields in segment header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.4 Mean value comparision of different fields in ELF header . . . . . . . . . . . . . . . . . 19

5.1 Confusion matrix for a two class classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2 Test result on static feature set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.3 Test result on hybrid features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.4 Previous works on Linux malware analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 35

vii

LIST OF FIGURES

FIGURE Page

1.1 VirusTotal stats1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 VirusTotal stats2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

4.1 Architecture of hybrid model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2 ELF file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.3 Limon sandbox architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.4 ELF Layout in disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.5 Frequency distribution of various section . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.6 Frequency distribution of various section type . . . . . . . . . . . . . . . . . . . . . . . . 20

4.7 Frequency distribution of various segment type type . . . . . . . . . . . . . . . . . . . . 20

4.8 Frequency distribution of symbol table features type . . . . . . . . . . . . . . . . . . . . 21

4.9 Example of GNU string output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.10 Example of strace output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.11 Benign sys call statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.12 Malware sys call statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.13 Benign proc file system access stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.14 Malware proc files access stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.15 Benign proc files access stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.16 Malware sys file system access stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.17 Benign etc files access stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.18 Malware etc file system access stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

viii

CH

AP

TE

R

1INTRODUCTION

Over the last past decades, the Internet has grown so much and so did the technology. This

growth of the Internet and advancement in technology has resulted in a steep increase in

the number of Linux based servers, Linux based routers and Linux based IoT (Internet

of Things) devices. Latest trends shows that Linux has now become a fresh worthy target for the

hackers.

FIGURE 1.1. Types of files submitted on VirusTotal [9] in the past 7 days

1

CHAPTER 1. INTRODUCTION

A report from AV-Test [1] shows that in 2016 MacOS computers show a 360 percent increase

in malware targeting with respect to the previous year, but Linux was not far behind.It saw

a 300 percent increment in malware targeting with respect to its previous year. According to

WatchGuard Security report [11] of Q1 2017, Linux malware made up 36 percent of the top

threats. In Figure 1.1, we can see that the number of ELF [3] i.e Linux executables, submitted

in the last seven days is in huge number and their counts can be compared to that of windows

executable. This all statistics shows that Linux malware threats are now in an alarming state.

As Linux is open source, all the latest threats are easily identified, and there are regular updates

to the Linux kernel to catch up with new threats. Developers are regularly providing some system

updates and are using some system protection mechanisms to protect the systems from emerging

threats. These make Linux one of the safest platforms. But here comes the problem. Most of the

IoT devices, router, etc. vendors do not provide this system updates as frequently, and if they do,

this takes a long time for the consumer to download the updates and install on all the devices.

These make the devices prone to exploitation.

As the Linux malware threat is on the rise, there is a need for Anti-threat mechanisms to make

our data and systems secure. In this work, we are going to present an automated Linux malware

detection method which currently outperforms all the existing detection methods. Our work

focuses on Linux binaries, i.e., ELF (Executable and Linkable Format) which is considered a

standard binary file format.

FIGURE 1.2. Number of files submitted on VirusTotal in the past 7 days

2

CHAPTER 1. INTRODUCTION

As we see in Figure 1.2, the number of new malware and total malware submitted on

VirusTotal is very high. To manually analyze this amount of data by manual reverse engineering

and unpacking is a hectic task. This leads to the need for an efficient automatic detection system

to detect malware.

Currently, malware analyst are using two approaches for analysis namely, Static and Dynamic

analysis. In static analysis, a file is analyzed just by looking at its structure and data present in

it. As it is not executed, it is fast and easy to deploy, but there are some limitations for example,

polymorphism/metamorphism and packers make the file data encrypted or packed so that no

more static analysis can be done on it. This motivates for another approach, dynamic analysis. In

this analysis, files are executed in a sandbox environment, and its behavior is analyzed. Since

this analysis does not get affected by obfuscation and encryption of data, it has its limitation too.

Some malware hide their true behavior when they find out that they are running in a controlled

environment. Another limitation is that, since we usually monitor one of the paths of the process,

hence rest of the code remains unexplored. So there is a need for a new detection mechanism

which can overcome the limitations of these approaches.

In the following chapters, we are going to discuss about Linux binary (ELF), types of malware

and our methodology to detect Linux malware efficiently.

3

CH

AP

TE

R

2PROBLEM BACKGROUND

In this section, we describe some of the Linux malware and discuss what problems are present

in the existing detection methodology and how we plan to overcome those problems using our

approach.

2.1 Linux Malware and types

Malware stands for malicious software which has some intention of disrupting infrastructure,

stealing useful data, spying on a victims computer, etc. Malware can be categorized by their

actions. Some of them are listed below:

• Exploit:- Exploits are malware which use some of the system vulnerabilities to attack.

Some of the exploits are available for well known vulnerabilities found in a system, and

some of them find the vulnerability in the system and then attack accordingly.

• Backdoor:- There are some malware which try to find some backdoors to steal user

information. In Linux, they try to get the list of all the registered user to get some backdoor,

or sometimes they try to register themselves as a registered user to get access to the system.

• Virus:- A virus is a malware which when executed, affect other files by inserting its code

and infecting them. They spread very fast and in a very short time can infect a large server.

• DDoS doxing:- DDoS stands for Distributed Denial of Service. This type of attack is

seen quite frequently on Linux servers. When an attack is successful, the server becomes

unresponsive.

4

CHAPTER 2. PROBLEM BACKGROUND

• keylogger:- Keylogger is a type of malware which tracks all the keys pressed by the victim

and sends them to its command and control server.

• A digital currency mining malware:- This type of malware tries to gain access to a

system and then uses its resources for mining digital currency. This malware uses some

machine learning methods to analyze victims usage, and on the basis of that, it uses the

resources so that it does not get exposed.

• Dropper:- This type of executable malware contains another executable. When executed,

it installs the other executable and runs it parallelly so that if it gets detected, actual

malware remains in the system.

There are some malware which combines the above-listed types to perform thier mallicious

activity. So, distinguishing between a malicious file to a non-malicious one becomes a major task

once a file enters in the system.

The work in this thesis aims to detect malware by performing analysis on a large corpus of data

to make our model robust to Zero-day malware. A zero-day malware uses a security vulnerability

the same day that vulnerability becomes known to the public or to the vendor who created

the software.. Our model outperforms many antiviruses as it does not depend on signatures of

malware. It uses static as well as dynamic features of a executable for their detection.

2.2 Existing Malware detection strategy

Malware analysis is the method of dissecting a binary file to understand how it works and then

devising methods to identify it and other similar files. It aims to gain information about the

actions performed by the malware and then developing a method to neutralize its effect and to

protect our system from further infection.

Malware analysis can be used both for detecting malware and classification of malware. Malware

detection means labeling an executable as benign or malware. Therefore malware detection

becomes the first stage of malware analysis. Once an executable is detected as malware, further

classification based on malware types and family can be performed on it. Malware analysis can

be done in two basic ways static and dynamic analysis. The Static analysis aims to analyze a

binary without executing it whereas in Dynamic analysis a binary is executed inside a sandbox

and analysis is performed based on its behavior.

2.2.1 Static Analysis

Static analysis is used to perform analysis of static properties of a binary. Without executing

the binary, an examination can be performed on it’s ELF header, embedded strings, metadata,

disassembly, etc. This analysis is fast as we do not have to execute the binary. But along with it’s

5


pros, it has some limitations too. There are many techniques that malware authors nowadays

uses to thwart static analysis. Some of the techniques are described below:

• packing:-In this technique malware authors use some encryption or compression algorithm

on the original executable to create a packed executable which contains an unpacked stub.

As an executable is executed, the first thing that loads in memory is the unpacked stub.

This stub then unpacks the packed executable, and transfer the control to the actual entry

point of the executable.

• Metamorphism:- A metamorphic malware is a malware which changes its code each time

its get executed, without actually changing the actual functionality of the malware. This

functionality can be achieved by replacing an instruction by another similar instruction

with different opcode, including garbage code, changing the order of subroutines, etc.

• Polymorphism:- This type of malware also changes its shape as well as signature. It has

two part malware decryption part and encrypted malware body. Malware authors use a

randomly generated key to encrypt malware body. Once this malware is loaded into the

memory, the decryption part decrypts the encrypted part to perform malicious activity.

After execution, it again encrypts itself so that it doesnot get discovered.

These types of techniques are used by malware authors to create malware, so when the hash of

the malware is taken, each time it will give a different hash value which will make them bypass

signature-based detection system. Neither static code analysis can be performed on this type

of malware as the code is encrypted. These limitations of static analysis motivates the need for

another type of analysis which can overcome these limitations.

2.2.2 Dynamic Analysis

In dynamic analysis, an executable is run under a controlled environment, or Virtual Machine

and the behavior of the executable is observed to deduce whether it is a malware or not. When an

executable is run, we can track what files are being accessed by it, what are the ip’s it’s trying to

connect, what new files are created, etc. The main advantage of Dynamic analysis is it remains

unaffected from Polymorphic or Metamorphic malware as it has nothing to do with the static code

of the malware. With such advantage, it has some limitations too. Some of them are listed below:-

• incomplete code coverage:- During Dynamic analysis we are able to monitor single

execution path which lead to incomplete code coverage.

• Detection of Sandbox Environment:- Some malware can detect whether it is running

in a controlled environment. When this happens, the malware does not show its true

behavior.

6


• risk to the host machine:-- If there is some bug in the sandbox environment, malware

may escape the isolaton and the host machine or other computers in the network may get

damaged or infected.

2.3 Motivation For a new Approach

As we have seen in the earlier sections that both static and dynamic analysis have some limi-

tations. Static analysis can be thwarted once some encryption algorithm is used while dynamic

analysis suffers from low code coverage problem. What if we could combine the feature set of

both of thess approaches. The dynamic features may be handy to get full insights when static

analysis gets thwarted by obfuscations. On the other hand, static analysis can provide the full

overview of the executable when dynamic analysis suffers from the code coverage issue. This

shows that both may act as complementary to each other. Malware authors can use packers,

obfuscation techniques, polymorphism/metamorphism techniques to bypass file format based

analysis or signature-based analysis. They can make the malware to do some additional actions

like a randomly accessing a file, call a random system call, etc, to bypass dynamic analysis.

But bypassing both the technique at once will be a tougher job for them. In this work we have

used this hybrid approach and results are quite promising. In later chapters, we discuss our

architecture and features of our model in furthur details.

2.4 Contribution

• To the best of our knowledge and our literature survey, this is the first time hybrid analysis

is done on linux binaries using such a large dataset. Most of the previous work which we

survey in the next chapter are interesting but are based on experiment on much smaller

dataset.

• We have used our new hybrid approach to detect Zero-day malware, and the results are

quite promising.

• We have created an automated system to deal with a large number of files which is difficult

to be analyze manually.

7

CH

AP

TE

R

3PAST WORK

A large variety of malware are being used to attack critical infrastructure, to steal private

data, to carry out financial fraud, etc. To remain spared from them, big MNCs, Government

organizations are spending huge amount of money on anti-virus software or creating their

own malware detection systems. Most of these anti-virus companies or malware detection

systems are based on signature-based detection techniques or anomaly-based detection. A

signature is usually a hash that uniquely identifies a specific malware. The signature-based

technique uses the signature of known malware. These signature are developed by antivirus

companies to capture threat. This technique is efficient, fast and easy to deploy. But once

an unseen or new malware comes into the system, this technique fails. Once that malware

has affected numerous systems and analysts are able to generate its signature, then only

this signature-based detection technique will able to detect it. In anomaly-based detection

technique, antivirus companies form rules about actions which they consider as safe actions.

If any of the rules are broken by a process, it is labeled as malicious. This technique has

the capability to capture new malware, but it has a very high false-alarm rate.

Another method which is now gaining popularity is heuristic based. In this technique static,

dynamic or behavior based features of a dataset containing both benign and malware are

used to train a machine learning classifier. In this literature, we survey some of the work

that has used thess heuristic techniques. In section 3.1 static analysis based and section

3.2 dynamic analysis based approaches has been discussed.

8

CHAPTER 3. PAST WORK

3.1 Static analysis approaches

In this section, we discuss some of the past work which have used the static analysis

approach.

– In Shahzad, F. [15] the authors have used executable and linkable format (ELF) for

analysis and extracted 383 features from ELF header. They have used information gain

as feature selection algorithm. They used four famous supervised machine learning

algorithms C4.5 Rules, PART, RIPPER and decision tree J48 for classification. Their

dataset contained 709 benign executables scraped from the Linux platform, and 709

Malware executables downloaded from VX heavens [10] and offensive computing [5].

They registered nearly 99 % detection accuracy with less than 0.1 false alarm.

– Jinrong Bai et. al [12] proposed a new malware detection technique in which they

extracted system calls from the symbol table of Linux executables. Out of many system

calls, they selected 100 of them as features. There method obtained an accuracy of 98

% in malware detection. Their dataset contains 756 benign executables scraped from

Linux system and 763 malware executables from VX heavens.

3.2 Dynamic analysis approaches

In this section, we discuss some of the past works which have used Dynamic analysis

– Ashmita, K. et al.: [13], proposed an approach based on system call features. They

use ’strace’[8] to trace all the system calls of the executables running in a controlled

environment. In this paper, authors have used two-step correlation-based feature

reduction. They first calculated feature-class correlation by using information gain and

entropy, to rank the features than in the next step they removed redundant feature by

calculating Feature-Feature correlation. They used three supervised machine learning

algorithm J48, Random Forest AdaBoost for classification and have feature vector

with 27 feature length. The authors of this paper used 668 files in the dataset, out

of which 442 are benign 226 were malware executable. From this approach, they

proposed the accuracy of 99.40 %.

– Shahzad, F., Bhatti [14], [16] have proposed a concept of genetic footprints in which

mined information of process control Block (PCB) of the kernel is used to detect

the runtime behavior of a process. In this approach, authors have selected 16 out of

available 118 parameters of the task_struct for each running process. To decide which

parameters to select, authors have claimed to done forensics study on that. These

authors believe that these parameters will define the semantic and the behavior of

the executing process. These selected parameters are called genetics footprints of the

9

CHAPTER 3. PAST WORK

process. Authors have then generated system call dump of all these parameters for

15 sec with a resolution of 100 ms. All the instances of benign and malware process

are classified using RBF-Network, SVM, J48 decision tree and a propositional rule

learner (J-Rip) in weka environment. Authors have done an analysis of their result

and shortlisted J-48 J-Rip classifiers having less class-noise as compared to others. In

the end, authors have also listed the comparison with other system call based existing

solutions. Authors have also discussed evasion in relation to the robustness of their

approach to evasion and access to task struct for modification. They used a dataset of

105 benign and 114 malware process and proposed a result of 96 percent detection

rate with 0 percent false alarm rate.

3.3 Drawback of past works:

– All of the past work which we seen in the literature survey used very small sample size

and hence their validity of the accuracy or false positive rates may not be reflective of

the real power of their malware classification methods.

– In the Dynamic cases , they have used very restricted number of features.

– They donot handle Zero-Day malware at all.

– They do not take advantage of using both static and dynamic featuires

10

CH

AP

TE

R

4ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

In this section, we are going to discuss the architecture of our detection system and various

feature engineering and modelling technique we used to achieve a good detection accuracy.

Our work is based on 32-bit ElF executables but can be extended to the 64-bit version as

well.

4.1 Analysis infrastructure

Figure 4.1 illustrate our Analysis Infrastructure. It has three phase: Data generation,

Feature Extraction and Data modelling

4.1.1 Data Generation

In this section, we are going to see how we generated static and dynamic reports from

executables for further analysis in the subsequent phase .For static reports, we use :

– GNU strings:- In Binutils, strings is one of the utilities. When used on any file,

it searches for ASCII character followed by an unprintable character. Strings in

an executable can be useful as it sometimes gives an overview of what the file is

going to perform, for example, if a malware tries to access /etc/passwd, we can have

this directory is printed as a string output, or if an executable tries to make a TCP

connection, we can see IP address embedded in the executable in ASCII.

11

CHAPTER 4. ANALYSIS INFRASTRUCTURE AND FEATURE EXTRACTION

FIGURE 4.1. Architecture of our hybrid malware detection system

12


– readelf:- readelf [6] is an Unix binary utility that displays structural information of

one or more ELF files. ELF file format contains lots of information that can be used in

the detection of malware. It contains ELF header followed by file data .More about

ElF format will be discussed in further sections.

FIGURE 4.2. ELF file format

– limon sandbox:- limon [4] is a tool which allows us to run an executable in a sand-

boxed/controlled environment and give us a report about what it did during its runtime.

The main component of limon sandbox includes a host machine which manages the

guest machine. We have used Ubuntu 16.04 as our host machine and Ubuntu 14.02 a

32-bit machine as our guest machine. To get a full picture of a file, it is executed in a

full privileged mode o the guest machine.

In order to run an executable in limon sandbox, its path is given using a command

line. Each analysis is performed in a fresh Virtual machine. While setting up a virtual

machine, a current snapshot is taken so that after execution of the file, limon can

revert back. At the end of the execution, the sandbox returns a text file containing the

full trace of system calls and userspace functions. We have used the default setting of

60 sec for which a file will be monitored. The architecture of limon sandbox is shown

in Figure 4.3

13


FIGURE 4.3. Limon sandbox architecture

4.1.2 Feature Extraction

As we use the hybrid approach, we first extract static features and dynamic feature of

executables separately and then integrat both of them to use in our model.

In the following section, we describe what are the various static and dynamic features we

have use in our model.

4.1.2.1 Static feature vector

Static feature extracted from the ELF file format of both malware and benign. Before

describing the features, we first briefly introduce the ELF format

14


– Executable and Linkable formatELF(Executable and Linkable format) is standard binary file format in Unix and Unix

like system. The binaries in Linux are executable, shared libraries, object code and

core dumps.ELF File format basically has three major part.

* ELF header

* Segments

* Sections

FIGURE 4.4. ELF Layout in disk

Each of these parts plays an important role in loading and linking process.Let’s now

look at the role and structure of each component one by one:

15


* ELF headerELF header consists a data-structure which gives information about the organi-

zation of a file. There are various fields in ELF Header and their definitions is

given below table:

Table 4.1: Fields in ELF header

e_ident

This field contains various flags foridentification of file. These flagshelps in decoding and interpreting the file content.example: ELF magic, File class, File version’s etc

e_typeThis field signifies the type of binary e.gexecutable, shared library etc

e_machineThis field gives information about thearchitecture oft the file eg x86, MIPS etc

e_versionThe version of an object file.Its set to 1 fororiginal version of ELF

e_entry address of process entry pointe_phoff This points to start of program header tablee_shoff This points to start of section header table

e_flagsinformation about some processorspecific fields

e_ehsize size of the ELF headere_phentsize size of program header table entrye_phnum number of entry in program header tablee_shentsize size of section header table entrye_shnum number of entry in section header table

e_shstrndxit contains the index of section table entrywhich contains name of all sections

* SectionsAll the information required during the linking process to make a target object file

an working executable resides in sections. Sections are actually needed during the

time of linking, they are of no use at run time. In ELF header there is a section

header table which contains the information about each section present in the

file. This table contains a number of section headers which points to each section.

There are various fields in each section header and their description is shown in

the table 4.2:

16


Table 4.2: Fields in setion header

sh_nameIt contains an offset to a string in.shstrtab sectionwhich is the name of the section

sh_typeIt signifies the type of sectione.g Program data, Symbol table, String table etc

sh_flags this signifies the attribute of a section

sh_addrFor the sections which are loaded in memory,it contains their virtual address

sh_offset it contains offset of the section in process imagesh_size Size of the sectionsh_link It contains section index of another associated sectionsh_Info additional information about sectionsh_addralign It signifies the required alignment for the sectionsh_entsize contain the size of each section header table entry

There are number of section each having different roles in linking process let’s

look sum of them :

· .text section: This contains user executable code.

· .data section: This contains all the uninitialized data.

· .rodata section:This contains read only initialized data

· .bss section:This contains all the uninitialized data

· .got section: For the dynamic binaries this section which stands for Global

Offset Table contains the address of all the variable which are relocated upon

loading.

· .got.plt section:This are the GOT entries which is assigned to dynamic linking

functions.

· .dynamic section: For the dynamic binaries this section contains information

about dynamic linking which is used by runtime linker.

· .dynsym section:This is actually runtime symbol table.

· .dynstr section:This contains null terminated strings which are the name of

symbols.

· .symtab section:This is compile time symbol table.dynsym is a subset of this

section

· .strtab section: contains the name of the symbols in symbol table

· .shstrtab section:This contains name of the sections

· .rela.dynsection:run time relocation table

17


* SegmentsSegments are basically known as program headers.It is just like disk view of

ELF format is broken into suitable chunks known as segments which get loaded

in memory.Like section header table there is also program header table.This

program header is optional in linking view while section header table is optional

in runtime view.In this header there are segment header which give information

about various segments present in the file image.each section header have some

fields related to the segments ,there defination is define in the below table:

Table 4.3: Fields in segment header

p_typeGives the information about typeof segment

p_flags segment related flags are present herep_offset offset of the segmentp_vaddr virtual addr of segment in the memoryp_paddr phisical address of segment in the diskp_filesz segment size in diskp_memsz segment size in memoryP_align segment required alignment

There are number of segment types,some of them are describe below:

· NULL:This is basically unassigned segments.

· LOAD:This is the segment which gets loaded into the memory,rest of the

other segment mapped within the range of the memory of one of this segment.

· INTERP: .interp section gets mapped to this segment

· DYNAMIC: this is basically the .dynamic section in memory

we just looked the overview of ELF file format.Now lets go back to our static feature

extraction which uses all the information described above.

– ELF structure feature setIn this section, we are going to discuss what are the feature we have extracted from

ELF structure.

* ELF HeaderThis gives us information about the organisation of ELF File. Out of various

fields present in ELF header, we have picked seven of the fields for our feature

set. Statistical comparison of that seven feature between malware and benign is

shown in the table 4.4 :

18


Table 4.4: Mean value comparision of different fields in ELF header

Features Mean for benign Mean for malware

Number of section headers 29.647 97.396Size of ELF header 63.466 52.202Number of program headers 8.921 4.222start of section header 170400.26 346456.475start of programm header 63.512 52.189size of program headers 55.025 32.404section header string table index 27.729 94.977

* Section Header TableThe basic structure of Section Header Table has been discussed in the previous

section. From the section header, we have used section name and section type

in our feature list. Frequency distribution of a various section in benign and

malware is shown below:

FIGURE 4.5. Frequency distribution of various section

19


FIGURE 4.6. Frequency distribution of various section type

* Program Header TableStructure of Program Header Table is discussed in segments field in the previous

section. From the Program Header Table, out the various fields, we have used

Segment type in our feature list. Comparison of segments type for benign and

malware is shown below:

FIGURE 4.7. Frequency distribution of various segment type

20


* Dynamic sectionThe runtime linker uses this segment to find all the necessary information needed

for dynamic linking and relocation. The Dynamic section contains two fields ’tags’

and ’values’.The number of entries in the dynamic section is not fixed. In our

work, we have used the content of ’tags’ field in our feature list.

* Symbol tablesymbol table contains a large amount of data needed to link or debug files. Symbol

table structure has five fields namely: name, value, info, size and section header

index. For the symbols, we categorized them according to there ’info’ field and for

the objects and functions, we categorized them according to their scope and info.

As features, we have used 14 categories we make and count of dynamic symbol.

Frequency distribution of these feature for benign and malware is shown below:

FIGURE 4.8. Frequency distribution of various symbol table features

21


– Strings based feature extractionStrings from a file can be informative, but most of the strings which we get from a

file are from file structure like the name of objects, a function from symbol tables or

argument of functions or some garbage values.

FIGURE 4.9. Example of GNU string output

This all things, we already using our features in one way or another. So reusing all of

this will cause redundancy in our feature set. However, we are using frequency bins

related to the length of strings in our feature set.

22


4.1.2.2 Dynamic Feature Extraction

The runtime behaviour based feature is extracted from the reports of the files that are

being generated by limon sandbox. Limon sandbox full report comes in a text file which

contains static analysis(structure information),ssdeep [7](fuzzy hashing comparison with

other other reports), Dynamic Analysis( system call trace ) and Network Analysis (if a file

is engaged in some network activity).

As most of the malware authors are now using Polymorphism and metamorphism technique,

this results to have multiple signatures of a single file. But when the programme is loaded

in memory, to perform its action its is decrypted to its original form. There comes the role of

ssdeep, what it did it take hashes of the binary that are loaded and then it compares with

other files. This helps us to remove multiple files having different signatures but have the

same fuzzy hash to reduce redundancy in our dataset.

For this work, we have use system call trace and arguments of system calls in our feature

list. Let’s see all of them one by one.

– System Calls:System calls can give ous the information about what a process want to perform or

what we say the behaviour of a process. Limon sandbox uses ’strace’ to get a full

system call trace of a process and process related to that process. An example of an

output of strace is shown below in image:

FIGURE 4.10. Example of strace output

The output of ’strace’ as we can see in the above picture contains system calls, its

arguments and return values. We have created an architecture through which we can

get only the names of the system calls from the strace report.

Linux system uses a fixed set of system calls. We have used all the system calls in

our feature list and check for each file, what are the system calls they are using. A

statistical comparison of system calls for benign and malware executables as observed

in our dataset is shown in fig 4.17 and fig 4.18.

23


FIGURE 4.11. 20 most frequent system call of benign

FIGURE 4.12. 20 most frequent system call of malware

24


– File system based feature:-

* proc and sysfs filesystemProc and sysfs filesystem are a virtual filesystem which contains information

about runtime system information on processes, system and hardware configura-

tions, and information on the kernel subsystems and kernel drivers. Comparision

of some of the proc and sys file accessed by malware and benign is shown below.

FIGURE 4.13. 7 most frequent proc files access by benign

FIGURE 4.14. 7 most frequent proc files access by malware

25


FIGURE 4.15. 7 most frequent sys files access by benign

FIGURE 4.16. 7 most frequent sys files access by malware

26


From our dataset we observed a large portion of malware is accessing ’/proc/net/route’

i.e system routing table, to get the list of all active network interfaces. We also

find that they are accessing ’/proc/net/tcp’ and ’/proc/net/dev’ to get information

about active tcp socket and sent and received packet respectively. On the counter-

part, in sysfs, we saw the malware are accessing ’/sys/class/net/’ to get the length

information of transmission queue.This information is very much important for

performing a DDoS attack.

Some of the sys and proc file are used by malware authors in VM detection like

’/proc/cpuinfo’, ’/proc/sysinfo’, ’/sys/class/dmi/id/product name’ etc.From our data

set, we have observed these files are being accessed by malware more frequently.

* etc file systemetc folder contains all system configuration files.It contains system configurtion

tables,configuration files which can force a service to start or stop,configuration file

of installed programs,configuration files which gives information about allowed or

restricted usr ,permitted ip’s etc.

FIGURE 4.17. 7 most frequent etc files access by benign

27


FIGURE 4.18. 7 most frequent etc files access by malware

From our malware dataset we observed that network configuration file are ac-

cessed like ’/etc/resolv.conf ’,’/etc/hosts’ etc ,more frequently.We observed that a

chunk of malware is accessing ’/etc/passwd’ file which gives information about

each user registered. Flooders (a malware type) uses this information to find a

backdoor account. They also try to edit ’/etc/passwd’ and ’/etc/shadow’ to add a

new usr.

* shell commandsShell acts as interface between user and Operating system.Using commands we

can use the services of OS.In our Dataset 16 percent of malware were executing

atleast one external command while in the Benign the percentage is quite low

,nearly 3-4 percent.In total we get 131 unique commands from our dataset.In our

dataset, commands like cp ,netstat,iptables,touch ,file etc are most frequently

seen being executed in malware.Some of the malware try to execute system

’reboot’command and some of them were executing ’ufw’ command which actually

can be used to alter the firewall of the network.In benign not much of commands

were found to be executed.Commands like file,grep,basename etc are mostly seen

in them.

28


4.1.3 Machine learning classifier

We have used python based machine learning library sckit-learn to evaluate the efficiency

of the hybrid approach we used. We have used various classification algorithms which are

described below:

– KNNKNN stands for K-Nearest Neighbour.In this algorithm, there a large number of

labelled data present in the feature space and if an unlabeled data point comes in that

feature space then labelling of it can be done by looking it’s K nearest neighbour and

giving the label which is in the majority. The distance which is generally used here is

Euclidean distance i.e square root of the sum of square of difference of each feature of

two data points.

– Decision TreeOne of the most intuitive and popular methods of data mining that provides explicit

rules for classification and copes well with heterogeneous data, missing data, and

nonlinear effects is decision tree.It uses information gain to select the root node which

gives the highest information gain value and similarly it goes to the leaf node where

decision is made.

– Random ForestRandom Forest is a supervised learning algorithm. Like you can already see from

it’s name, it creates a forest and makes it somehow random. The „forest“ it builds, is

an ensemble of Decision Trees, most of the time trained with the “bagging” method.

The general idea of the bagging method is that a combination of learning models

increases the overall result.One big advantage of random forest is, that it can be used

for both classification and regression problems, which form the majority of current

machine learning systems.The random-forest algorithm brings extra randomness into

the model, when it is growing the trees. Instead of searching for the best feature

while splitting a node, it searches for the best feature among a random subset of

features. This process creates a wide diversity, which generally results in a better

model. Therefore when you are growing a tree in random forest, only a random subset

of the features is considered for splitting a node.

29


4.2 Summary

In this section we see the architecture of our model,Types of features we used ,how they are

extracted and what are the different machine learning classifier we have used in our model

for the detection of malware.In next Chapter we are going to discuss about the datset we

used and see the result of our model on that datset.

30

CH

AP

TE

R

5RESULT AND DISCUSSION

5.1 Dataset

To make our model robust, the first thing we needed was the large corpus of a dataset.Most

of the previous work’s authors have used very fewer amount data, so one of the major

challenges we faced was to collect a large amount of dataset.For malware, we collected our

data from VXheavens, virustotal and Detux.org [2]. For Benign, we scrap the executables

from fresh Linux operating system directories /bin,/sbin and /usr/bin, we downloaded some

of the open Intel project, open c c++ software and compiled them into our system to get

more benign executable.For the final analysis, we have used 7717 malware and 2265 benign

executable.

5.2 Evaluation Metric

We have used several metrics to evaluate the performance of classification model. Below is

a brief description of them.

5.2.1 Confusion Matrix

A confusion matrix (Kohavi and Provost, 1998) contains information about actual and

predicted classifications done by a classification system. Performance of such systems is

commonly evaluated using the data in the matrix. The following table shows the confusion

matrix for a two class classifier.

The entries in the confusion matrix have the following meaning in the context of our study:

31

CHAPTER 5. RESULT AND DISCUSSION

– True positive: These are the samples which are correctly predicted as benign

– True Negative: These are the samples which are correctly predicted as malware

– False positive: These are the samples which are malware but predicted as benign

– False positive: These are the samples which are benign but predicted as malware

Table 5.1: Confusion matrix for a two class classifier

PredictedPositive Negative

Actual Positive #TP #FNNegative #FP #TN

5.2.2 Additional metrics

– TPR: TPR stands for True Positive Rate also known as recall,this is define as ratio of

number of true positive to total number positive sample:

TPR = TPTP +FN

– FPR: FPR stands for False Positive Rate,this is define as ratio of number of False

positive to total number negative sample:

FPR = FPFP +TN

– Precision: precision (PV) is the proportion of the predicted positive cases that were

correct, as calculated using the equation:

PV = TPTP +FP

– F score: when the dataset is imbalance ,its is used to measure how much your model

is correct.Its calculated as weighted harmonic mean of precision and recall:

Fscore = 2∗TPR∗PVTPR+PV

32


5.3 Training and Testing

For our experiments we used ubuntu 16.04 with 32 Gb RAM and INTEL I7 octa core

processor . We used 70% data for training purpose and 30% data for testing purpose.

To minimize the risk of overfitting and to get a generalize result we used 10 fold cross

validation.

5.4 Result

As we have already seen that we have used three machine learning classifier and

we check our model efficiency on all of them. Results we received are highly promis-

ing.Table 5.2 and 5.3 show the results which we achieve on our dataset for static only

features and for hybrid approach respectively.

Table 5.2: Test result on static feature set

Class KNN Decision Tree Random ForestTPR FPR Pr FM TPR FPR Pr FM TPR FPR Pr FM

Benign 0.888 0.067 0.757 0.817 0.978 0.017 0.937 0.957 0.982 0.014 0.950 0.966Malware 0.932 0.111 0.972 0.951 0.982 0.021 0.994 0.988 0.986 0.017 0.995 0.990Average 0.922 0.101 0.923 0.921 .981 0.020 .981 .981 .985 .016 .985 .985

Table 5.3: Test result on hybrid features

Class KNN Decision Tree Random ForestTPR FPR Pr FM TPR FPR Pr FM TPR FPR Pr FM

Benign 0.914 0.062 0.777 0.840 0.983 0.010 0.963 0.973 0.989 0.006 0.976 0.982Malware 0.938 0.085 0.978 0.958 0.989 0.016 0.995 0.992 0.993 0.010 0.997 0.995Average 0.932 .079 0.932 0.9314 0.987 0.014 0.987 0.987 0.992 .009 0.992 0.992

As we can see from the table 5.2 and 5.3 that using Random Forest we get the best

detection accuracy.However we can see that detection efficiency increases in all of the

three model (KNN,Decision Tree,Random Forest) as we shifted our feature set from

static to hybrid features set.The best weighted average F Measure score i.e 0.992, we

get in Random Forest which is pretty good considering the fact that the TPR of Benign

is low in comparison with the malware .The main goal of our work was to have high

TPR for malware while not predicting too much benign in to malware.Figure 5.1 is

depicting the confusion matrix for the best detection accuracy which we got using

Random Forest.

33


Confusion matrix statistics:

* malware/malware = 99.69% ( 2313/2320 )

* malware/benign = 0.31% ( 7/2320 )

* benign/benign = 97.62% ( 659/675 )

* benign/malware = 2.37% ( 16/675 )

FIGURE 5.1. confusion matrix result

It can observed from the confusion matrix that false negative for malware is quite low

which also explains high precision value for malware,whereas false negative value for

benign is quite high compared to malware which also explains low precision value for

benign compared to malware.

5.5 Comparison To Existing Approaches

In this section we are going to compare our work with the other author’s work who

have done some work on Linux malware analysis.To the best of our knowledge, most

of the work that other authors have used are either static or Dynamic approach. Our

work is the first that has used hybrid approach i.e by integrating static and dynamic

features.

34


Table 5.4 shown the comparison result.Most of the work as we can see in table 5.4 has

performed analysis on a very small number of a dataset. In our work, we have used

a large corpus of both malware and benign files to make our model robust. Shahzad,

F. has performed analysis using fields of ELF static structure with a 99% detection

accuracy but since this approach is static based they have rejected some of the samples

which have forged headers. Ashmita, K. et al. (2014) has used the dynamic approach

in which they have analyzed system calls. They got a great detection accuracy of 99.40

%, but the dataset they used had only 226 malware, and the number of features was

also very less. Our model has got a comparable average detection accuracy of 99.14%,

and the strength of our dataset is also pretty good compared to them which makes our

model robust.

Table 5.4: Previous works on Linux malware analysis

Authors Features Accuracy Dataset Type of feature

Shahzad, F.(2011) 383 99%709 Benign709 Malware

Static :ELF structure

Jinrong Bai et al(2012) 100 98%756 Benign763 Malware

Static :Symbol Table

Ashmita, K. et al(2014) 27 99.40%442 Benign226 Malware

Dynamic:System calls

Shahzad, F., Bhatti(2013) 16 96%105 Benign114 Malware

Dynamic :Process control block

Ours115260

99.14%2265 Benign7717 Malware

Static : ELF Header +StringsDynamic: System calls + File Systems+ Shell Command

conclusion :We are using a new approach in performing Linux malware ananlysis by

the help traditional static and dynamic ones.Our model has shown some great result

and we have used a large dataset to prove robustness of our model.

35

CH

AP

TE

R

6SCOPE AND FUTURE WORK

6.1 Supporting Multiple architecture

During the collection of Malware sample ,we came across various ELF files which

supports different architecture.Due to the small scope of our work ,we have done the

analysis of only those files which were based on Intel architecture.There are many

files which remain un-analyzed due this limitation.In the future our can extended to

those files which are from different architecture.

6.2 Analysis on different file format

In this work our main focus was on ELF file format,but some the malware uses

different type of file format like Perl script,Python script ,Shell script ,Bash script

,PHP script etc to perform there malicious activity.Limon sandbox which we used in

this work to perform Dynamic analysis has the capabilities to give us the runtime

report for theses files also.In the future we can add a module to perform analysis for

these new files also.

36

CHAPTER 6. SCOPE AND FUTURE WORK

6.3 Multi-path execution of files

Currently Limon sandbox gives us the report of single execution path of an executable

file.This is the limitation of Dynamic based approach as we are unable to get all the

possible execution path of the malware to cover its complete run time behaviour.In

future we can add different modules to our model such that it can generate more

comprehensive reports.

37

AP

PE

ND

IX

AAPPENDIX A

Code base for Linux Malware Detcetion:https://github.com/Anmol33/M.Tech_thesis.git

38

https://github.com/Anmol33/M.Tech_thesis.git

BIBLIOGRAPHY

[1] Av-test security report.

https://www.av-test.org/fileadmin/pdf/security_report/AV-TEST_

Security_Report_2016-2017.pdf.

[2] Detux.org:.

https://detux.org/.

[3] Elf format:.

http://www.skyfree.org/linux/references/ELF_Format.pdf.

[4] Limon sandbox:.

https://github.com/monnappa22/Limon.

[5] Offensive computing.

http://www.offensivecomputing.net/.

[6] readelf tool:.

https://sourceware.org/binutils/docs/binutils/readelf.html.

[7] ssdeep (fuzzy hash):.

https://ssdeep-project.github.io/ssdeep/index.html.

[8] Strace tool:.

https://strace.io/.

[9] Virustotal statistics.

https://www.virustotal.com/en/statistics/.

[10] vx heaven.

http://vx.netlux.org/.

[11] Watchgaurd internet security report.

https://media.scmagazine.com/documents/306/

wg-threat-reportq1-2017_76417.pdf.

39

https://www.av-test.org/fileadmin/pdf/security_report/AV-TEST_Security_Report_2016-2017.pdf

https://www.av-test.org/fileadmin/pdf/security_report/AV-TEST_Security_Report_2016-2017.pdf

https://detux.org/

http://www.skyfree.org/linux/references/ELF_Format.pdf

https://github.com/monnappa22/Limon

http://www.offensivecomputing.net/

https://sourceware.org/binutils/docs/binutils/readelf.html

https://ssdeep-project.github.io/ssdeep/index.html

https://strace.io/

https://www.virustotal.com/en/statistics/

http://vx.netlux.org/

https://media.scmagazine.com/documents/306/wg-threat-reportq1-2017_76417.pdf

https://media.scmagazine.com/documents/306/wg-threat-reportq1-2017_76417.pdf

BIBLIOGRAPHY

[12] S. M. JINRONG BAI, YANRONG YANG AND Y. MA, Malware detection through

mining symbol table of linux executables., Information Technology Journal,

(2012).

[13] A. K.A AND V. P, Linux malware detection using non-parametric statistical

methods, Chakraborty R.S., Matyas V., Schaumont P. (eds) Security, Privacy,

and Applied Cryptography Engineering. SPACE, (2014).

[14] B. S. S. M. SHAHZAD, F. AND M. FAROOQ, In-execution malware detection

using task structures of linux process, IEEE International Conference on

Communication, pp. 1–6, (2011).

[15] F. SHAHZAD AND M. FAROOQ, Elf-miner: using structural knowledge and data

mining methods to detect new (linux) malicious executables, Knowledge and

Information Systems, 30 (2012), pp. 589–612.

[16] S. M. SHAHZAD, F. AND M. FAROOQ, In-execution dynamic malware analysis

and detection by mining information in process control blocks of linux os, Inf.

Sci. 231, 45–63, (2013).

40

linux malware detection by hybrid analysis · thesis title: linux malware detection by hybrid...

Documents