network based intelligent malware detection...malmenator network based intelligent malware detection...

Malmenator

Network Based Intelligent Malware Detection

Final Year Project

Interim Report

February 2, 2020

David B. Han (3035344211)Piyush Jha (3035342691)

Supervisor: Dr. Dirk SchniedersUniversity of Hong Kong

github.com/piy0999/Malmenator

github.com/piy0999/Malmenator

Abstract

Strong security measures in the digital world are becoming increasingly important as indicated

by the meteoric rise of spending on cybersecurity as well as the increased losses from successful

cyberattacks. One important research area in cybersecurity is network anomaly detection, which is

implemented in network-based intrusion detection systems (NIDSs) to identify and stop malicious

network activity before it causes damage. Unfortunately, most of the open source NIDSs utilize

a purely rules-based approach to detect anomalies, which render them vulnerable to new exploits.

On the other hand, many novel machine learning approaches have sub optimal performance due

to challenges in their lack of interpretability as well as in the absence of large and representative

datasets for training. Malmenator takes a two step approach to create a stronger anomaly detection

engine. It first evaluates publicly available network traffic datasets and compiles a comprehensive

dataset for evaluation. Malmenator then proceeds to implement a new hybrid approach that utilizes

both machine learning and rules-based methods to create a custom NIDS. All of these features are

then combined into a portable hardware component that functions both as an NIDS and a WiFi

router. Malmenator has already implemented a basic NIDS on a WiFi router constructed out of a

Raspberry Pi and has also researched and selected a comprehensive dataset, CIDDS-001, for building

the anomaly detection model. Malmenator has just begun researching different methodologies for

implementing the network anomaly detection model and is making decent progress in accordance

with our projected timeline.

i

Acknowledgements

We would like to thank our supervisor, Dr. Dirk Schnieders, for his support and advice regarding the

usage of machine learning in our project. His input allowed us to appropriately define our project

scope and focus our research in a direction with the most impact. Furthermore, we would like to

extend a special token of gratitude to professors from the Centre of Applied English Studies at the

University of Hong Kong in helping us improve our writing skills to express the work done during

the project. Lastly, we would like to thank the Department of Computer Science at The University

of Hong Kong for reimbursing the costs incurred for buying various equipment during the project

and making the project feasible.

ii

Contents

Abstract i

Acknowledgements ii

List of Figures vi

List of Tables vii

Abbreviations viii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Report organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 A Primer on Cybersecurity 3

2.1 What are Cyberattacks? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Economic Impacts of Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Categories of Cybersecurity Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Ethics of Malware Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.5 Challenges of Malware Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Project background 8

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.1 What is an NIDSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

iii

3.2 NIDSs vs network-based intrusion prevention systems (NIPSs) . . . . . . . . . . . . 9

3.3 NIDS implementation strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.4 Overview of NIDS techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.4.1 Signature-based detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4.2 Anomaly-based detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.5 Packet-based and flow-based data formats . . . . . . . . . . . . . . . . . . . . . . . . 11

3.5.1 NetFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.6 Nfdump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Previous works 14

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Network traffic datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2.1 Methods for network traffic dataset generation . . . . . . . . . . . . . . . . . 14

4.2.2 Old reference datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2.3 Key datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2.3.1 CIDDS-001 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2.3.2 UGR 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2.3.3 CICIDS 2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2.4 Other relevant datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2.4.1 UNSW-NB15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.4.2 MAWILab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.4.3 UNB ISCX 2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.4.4 CTU-13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.5 General Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.3 Machine learning in network anomaly detection . . . . . . . . . . . . . . . . . . . . . 18

4.3.1 Classification methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3.2 Statistical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3.3 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3.4 Constraints of machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

iv

5 Methodology 21

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.2 Network scanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.2.1 Raspberry Pi hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.2.2 Hardware setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.2.3 Utilizing nfdump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.3 Dataset selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.3.1 CIDDS-001 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.4 Network anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.4.1 Hybrid network anomaly detection approach . . . . . . . . . . . . . . . . . . 24

5.4.2 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.5 Web interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.5.1 Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.5.2 Web architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6 Discussion and remarks 27

6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.2 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.3 Discussion of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.3.1 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.4 Preliminary anomaly detection model results . . . . . . . . . . . . . . . . . . . . . . 30

6.5 Project timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.6 Engineering Risk mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.7 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.7.1 Steep learning curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.7.2 Scope identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.7.3 Proprietary information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Conclusion 33

Bibliography 34

v

List of Figures

2.1 Estimating financial losses to cyberattacks . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1 Inline and passive implementations of an NIDS . . . . . . . . . . . . . . . . . . . . . 10

3.2 Overview of packet-based headers by protocol . . . . . . . . . . . . . . . . . . . . . . 12

4.1 Taxonomy of network anomaly detection methods . . . . . . . . . . . . . . . . . . . 18

5.1 Network setup with the Malmenator network scanner . . . . . . . . . . . . . . . . . . 22

5.2 Malmenator web architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

vi

List of Tables

6.1 Progress Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.2 Planned project timetable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

vii

Abbreviations

ANN artificial neural network

DDoS distributed denial-of-service

ELK Elasticsearch Logstash Kibana

ICMP Internet Control Message Protocol

IDS intrusion detection system

IP Internet Protocol

KNN K-nearest neighbor

LAN local area network

NIDPS network-based intrusion detection and prevention system

NIDS network-based intrusion detection system

NIPS network-based intrusion prevention system

SSH secure shell

SVM support vector machine

TCP Transmission Control Protocol

UDP User Datagram Protocol

WLAN wireless local area network

viii

Chapter 1

Introduction

1.1 Motivation

Over the past decade, spending on cybersecurity has been consistently increasing year over year, yet

the losses due to cyberattacks only continue to grow [1]. With the scarcity of investment in network

security research and the lack of effective open source network anomaly detection methods, this

research project aims at empowering open source network security technology [2]. Additionally, the

Malmenator project aims to create an effective, affordable, and portable NIDS that can be adapted

to any home or small office environment. By making cybersecurity more readily accessible, we aim

to increase trust in a society that is undergoing an enormous digital transformation.

1.2 Problem formulation

From an industry perspective, network security solutions for individuals are inaccessible and

expensive. Due to the confidential nature of data breaches and network traffic, individual parties

such as governments and corporations have very little cooperation compared to other areas of

computer science research [3]. This is especially true for network cybersecurity as network traffic

data often contains highly sensitive information. Currently, most network security solutions are

tailored towards large enterprises that can afford the implementation costs, while most individuals

and small groups leave network security to their Internet Service Providers.

From an academic perspective, network anomaly detection faces two large challenges. The

foremost problem is the lack of a single widely accepted network traffic dataset. Without a

1

comprehensive and representative dataset, the second problem is reached - appropriately

developing, testing, and deploying anomaly-based detection techniques using standardized norms

and metrics. Malmenator aims to bridge the industry gap through a research based project, and in

doing so, target the two large issues that plague research into network security - the lack of a

comprehensive dataset for NIDS evaluation and the absence of powerful open source hybrid

detection tools.

1.3 Contributions

There are several contributions that Malmenator hopes to make upon its completion. First,

Malmenator hopes to contribute to the open source community in its development of new anomaly

detection methodologies built on Netflow data. Additionally, Malmenator aims to shed light on the

difficulties of testing NIDSs with both simulated and real datasets and provide a more

comprehensive solution to this issue. Lastly, Malmenator would like to introduce novel

combinations of malicious network detection methodologies that have real-world and commercial

implications.

1.4 Report organization

The remaining contents of this report are as follows. This report begins by providing a brief

introduction to the cybersecurity industry as a whole. It then gives an introduction to NIDSs and

NetFlow data format before proceeding to summarize previous works related to the major aspects

of the Malmenator project, namely network traffic dataset creation and network anomaly

detection. For traffic dataset creation, this report looks at the different types of network data

formats before highlighting the key dataset Malmenator will be using for model training and

evaluation in Malmenator. For network anomaly detection, this report first summarizes the

different high-level machine learning approaches that have been explored by various researchers,

and then proceeds to the project methodology and describes the steps by which Malmenator will

implement its hardware, analyze its dataset, detect anomalies, and visualize its system. Lastly, this

report concludes by discussion Malmenator’s project progress, challenges, and timeline.

2

Chapter 2

A Primer on Cybersecurity

This chapter offers an overview of the technology and economic conditions of various aspects of the

the cybersecurity industry. Reading this chapter useful for understanding the basics of cybersecurity

for the unfamiliar and may be skipped if one wishes to delve directly into details of the project. This

chapter first describes the nature of cyberattacks before giving an overview on various high level

aspects of cybersecurity including its economic impacts, ethics, and challenges.

2.1 What are Cyberattacks?

A cyberattack is an intentional and malicious attempt by an individual or organization to disrupt

or penetrate the information system of another individual or organization. There are many forms of

cyberattacks, some of common forms are described below. Note that the list below is non-exhaustive;

cyberattacks are constantly evolving and adapting to defensive measures as they are deployed.

Malware: Malware is a combination of the words mal icious software. Malware breaches a

network through a vulnerability or exploit, and usually occurs when a user clicks a dangerous

link or email attachment that installs the malware. Once activated, malware can perform

any of a number of actions depending on the malware type. In the case of ransomware, the

malware can block access to key files until a fee is paid. In the case of spyware, the malware

can secretly obtain and transmit information from the system. In the case of a backdoor, the

malware can allow hackers to gain remote access to the computer. There are a large number

of malware variations, and new forms are discovered every day.

3

Man-in-the-middle attack : Man-in-the-middle attacks occur when attackers insert themselves

into a a communication line. Thus, instead of one party communicating directly with the other

party, all communication traffic is unknowingly sent through the ”man-in-the-middle”, which

enables the attacker to monitor and steal data.

Denial-of-service attack : Denial-of-service attacks overwhelms system networks with a huge

amount of traffic to exhaust their resources and bandwidth to cause the system to be unable

to fulfill legitimate attacks. In the event that this attack is simultaneously launched from a

number of compromised devices, the attack is known as a distributed-denial-of-service (DDoS)

attack.

Zero-day exploit : Zero-day exploits occur when a network vulnerability is discovered and

announced, but before an update has been released to patch the problem. Attackers target

the disclosed vulnerability during the window of time before that patch is released.

Each form of cyberattack takes advantage of different system vulnerabilities and have a variety of

different risks. Furthermore, each form can be further subdivided into various types and families as

is evident in the malware section. In this project, we focus on forms of attack that can be detected

using network traffic patterns, which includes denial-of-service attacks, man-in-the-middle attacks,

and certain types of malware such as backdoors.

2.2 Economic Impacts of Malware

Cyberattacks from malware are an increasingly large problem, and companies and governments

are spending more year over year in order to properly protect themselves. From 2012 to 2018,

average annual cybersecurity expenditures per employee doubled from $584 USD to $1,178 [4]. In

2019, global spending on cybersecurity initiative is expected to exceed $100 billion USD [5]. These

numbers are only expected to continue to rise as the financial incentives to engage in malicious

attacks only continue to rise.

In 2018, a conservative estimate of financial losses caused by cyberattacks is at least $45 billion

USD across millions of reported attacks such as DDoS and ransomware attacks [3]. Including the

financial losses due to data breaches and the loss of more than 2 billion consumer data records, this

number rises to a staggering $654 billion USD for US corporations in 2018 alone [2]. Cyberattacks

to U.S. based financial services organizations in Q1 of 2019 alone cost more than $6.2 billion USD,

a sharp rise from just $8 million USD in Q1 of 2018.

4

It must be noted that these costs are general estimates as the exact loss of value can be difficult

to calculate. There is no centralized data set on the costs of cyberattacks and data breaches. Thus,

many statistics in the cybersecurity industry come from surveys, which can suffer from

non-representative and inaccurate reporting. Often, firms are reluctant to report negative

information which may cause these statistics to bias downwards due to under-reporting. The

White House published a report in 2018 containing figure 2.1 depicting the various economic

impacts of cyberattacks and their respective difficulties in quantifying cost [1]. It can be seen that

certain losses such as court fees and forensic costs are easy to justify, but damage to reputation

and loss of IP are difficult to quantify. Overall, cybersecurity breaches impact organizations across

the world in various ways and magnitudes.

Figure 2.1: Estimating financial losses to cyberattacks

2.3 Categories of Cybersecurity Solutions

Cybersecurity efforts are growing rapidly in order to respond to the quickly shifting landscape of

malware attacks. Defensive solutions can take on a variety of forms and roles including anti-virus

software, identity and access management tools, intrusion detection system (IDS), and others. The

uses of most of these follow directly from their naming convention. Anti-virus software are the bread

and butter of the cybersecurity industry and include software such as Windows Defender, Norton

Antivirus, and Avast Antivirus. These tools can scan system files and downloads to check for virus

5

signatures that are then matched to a trusted database. Identity management tools include solutions

like 2-factor authentication where users must confirm their identity through a verification code sent

to an email or mobile device. These solutions enable organizations to more stringently verify a user’s

online identity before granting access to sensitive information. IDSs are used to monitor network

traffic for suspicious activity and issue alerts when such activity is discovered. Appropriate actions

can then be taken such as blocking traffic from suspicious IP addresses or discarding undesirable

packets. IDSs are at the heart of the Malmenator project.

2.4 Ethics of Malware Research

In order to defend against malware attacks, researchers first need to understand how malicious

code works. Often, this requires preemptively creating malware in order to find vulnerabilities and

appropriate solutions. Thus, not all malware is created with malicious intentions. People in the

cybersecurity industry can be categorized into one of four categories: white hat, black hat, grey hat,

and red hat. The definitions of these types of hackers are listed below:

White Hat : White hat hackers are people who create malware and attempt to break into

computer systems for a good cause. These people could work for a cybersecurity firm, or could

be a professional penetration testing consultant.

Black Hat : Black hat hackers create malware for malicious causes in order to extort individuals

and corporations for personal incentives such as money or power. These people are the ones

responsible for much of the malware that cause tremendous economic losses.

Grey Hat : Grey hat hackers do a mix of white hat and black hat activities and dabble in using

their knowledge and skills for both good and bad depending on the circumstances.

Red Hat : Red hat hackers are similar to black hat hackers, except they are employed by a

government to initiate attacks on foreign powers.

The key takeaway from this section is that ethics in malware research is not always

straightforward. However, publicly published academic research is generally a white hat effort, this

paper included.

6

2.5 Challenges of Malware Research

Performing substantive literature review to understand the cutting edge technology in malware

research is a difficult task because there is little incentive to publicly publish anti-malware

techniques. Any research that is published is also available to attackers who can choose to exploit

other vulnerabilities in the system. Thus, the papers published by the top cybersecurity firms and

government organizations are usually either outdated or extremely abstract without precise

implementation and performance details. A cybersecurity report published by the Royal Society

highlighted cybersecurity’s distinct characteristics including having multidisciplinary, global, and

cross-sectoral interest, which cause research to take place across academic, commercial, and

government sectors that further adds to these difficulties [6]. Information sharing across these

fields is not transparent, and many corporations are hesitant to share their vulnerabilities in

academic or government research as that may impact their reputation and harm their business.

Thus, much of the recently published academic research in malware detection have challenges in

practical implementation that must be taken into account.

7

Chapter 3

Project background

3.1 Overview

This section provides an introduction to the concept of NIDSs. Given that cybersecurity is an

extremely niche field in Computer Science, the information in the background is critical for

understanding the relevant tools and technologies that the Malmenator project is based on. This

section begins with a high level overview of what NIDSs are before delving into the details of their

implementation and categorization.

3.1.1 What is an NIDSs

NIDSs monitor network traffic in order to detect when an unauthorized intrusion is being carried

out by hostile entities by providing some or all of the following:

• Monitoring the condition of routers, firewalls, and servers

• Providing system admins a way to tune, organize and understand relevant operating system

audit trails and other logs that are often otherwise difficult to track or parse

• Including an extensive attack signature database against which information from the system

can be matched

• Recognizing and reporting when the IDS detects that data files have been altered

• Generating an alarm and notifying that security has been breached.

8

NIDSs offer organizations a number of benefits, starting with the ability to identify security

incidents. NIDSs can be used to help analyze the quantity and types of attacks, and organizations

can use this information to change their security systems or implement more effective controls.

NIDSs can also help companies identify bugs or problems with their network device configurations.

These metrics can then be used to assess future risks.

NIDSs can also help the enterprise attain regulatory compliance. An IDS gives companies greater

visibility across their networks, making it easier to meet security regulations. Additionally, businesses

can use their NIDSs logs as part of the documentation to show they are meeting certain compliance

requirements.

Although NIDSs monitor networks for potentially malicious activity, they are also prone to

false alarms (false positives). Consequently, organizations need to fine-tune their NIDS products

when they first install them. That means properly configuring their intrusion detection systems to

recognize what normal traffic on their network looks like compared to potentially malicious activity.

3.2 NIDSs vs NIPSs

NIDSs are used to monitor and analyze network traffic in order to identify suspicious activity and

alert system administrators in the event of an attack. A NIPS is similar to an NIDS, but differs in that

an NIPS can also be configured to automatically block potential threats without the intervention of

a system administrator. Historically, NIDSs were tailored to process network data more thoroughly

and rapidly than NIPSs, but with the advent of increased processing power, the line between the

two has become blurred. Today, most NIDSs provide configurations to allow for their capabilities

to extend into the territory of NIPSs. Hence, the term network-based intrusion detection and

prevention system (NIDPS) was coined. network-based intrusion detection and prevention systems

(NIDPSs) are systems that combine the capabilities of NIDSs and NIPSs. Although the prevention

aspect of NIDPSs are critical is dealing with certain aspects of cybersecurity, Malmenator focuses

on NIDSs implementation to allow for a more focused project. Further research should be done to

determine and implement the optimal reaction to different forms of cyberattacks.

3.3 NIDS implementation strategies

NIDSs are deployed at strategic points within a system network where it can best capture traffic

to and from all devices on the network. This usually involves being deployed directly within or in

9

parallel to a router, switch, or access point in a network. Figure 3.1 depicts two implementation

Figure 3.1: Inline and passive implementations of an NIDS

Inline (left) and passive (right) implementation of a NIDS as recommended by the National Institute ofStandards and Technology, U.S. Department of Commerce [7]. In the inline implementation, all networktraffic must pass through the NIDS which may throttle network speeds. In the passive implementation, acopy of all network traffic is send to the NIDS for processing, enabling network speeds to remain high.

methods of an NIDS. Inline implementation has the benefit of being able to directly respond to

attacks by blocking network traffic. On the other hand, passive implementation must rely on a

separate tool such as a firewall to secure traffic. Malmenator utilizes inline implementations of

NIDSs since Malmenator focuses not only on detecting, but also on preventing network attacks.

Note that in Figure 3.1, the NIDS is depicted as a separate hardware component. In some inline

implementations, such as with Malmenator, the NIDS can be set up directly within the router.

3.4 Overview of NIDS techniques

This section describes the high-level methodologies that are employed in NIDSs to detect malicious

traffic, namely signature-based detection and anomaly-based detection. These methodologies can

be combined in a hybrid approach to balance each other’s weaknesses [8].

10

3.4.1 Signature-based detection

Also known as misuse-based techniques, signature-based techniques refer to the detection of attacks

by looking for specific patterns within traffic data. The detected patterns are known as signatures,

which are then matched against a trusted database to check if they have previously appeared in any

malicious attacks. Signature-based techniques are effective for detecting previously known attack

types with high accuracy and without raising a large number of false alarms, but their efficacy is

only as good as the signature database [9]. Signature-based techniques rarely detect novel attacks

(zero-day attacks) whose signature is not already inside the database. For example, one of the most

popular NIDS, Snort, conservatively captures 8.2% of zero-day attacks [10]. Overall, this technique

is computationally fast, but not highly adaptable [11].

3.4.2 Anomaly-based detection

To solve the shortcomings of signature-based techniques in novel attacks, anomaly-based techniques

model normal network behavior and identify deviations from the norm, enabling them to detect

novel attacks whose signatures may be previously unknown. However, anomaly-based detection

techniques may suffer from false positives - normal activity not yet seen before may be incorrectly

classified as malicious. Furthermore, anomaly-based detection is a time consuming process in both

training and execution, and many implementations suffer from excessive delay during the detection

process that degrades their performance [12]. Anomaly-based detection is an area of ongoing and

active research, and is one of the areas of focus for the Malmenator project.

3.5 Packet-based and flow-based data formats

Network traffic data is formatted in one of two ways: packet-based or flow-based. Packet-based

data is captured in PCAP format and contains both metadata and payload information for each

network packet. Metadata information for each packet depends on the transport protocol, and their

differences are highlighted in Figure 3.2. There are a number of different protocols, but the most

important ones are IP, TCP, UDP, and ICMP as they constitute the majority of internet traffic

and are the core of packet-based datasets [13]. It is important to critically evaluate the impact of

various transfer protocol data types on models built using packet-based data.

Flow-based data is more compact compared to packet-based data, mainly containing metadata

about network connections. Flow-based data aggregates packets within similar properties in a time

11

time frame into one flow and discard payload information [13]. As packet-based data is more detailed

and thorough, packet-based data can be converted into flow-based data but not vice versa.

Figure 3.2: Overview of packet-based headers by protocol

Packet header formats for the IP, TCP, UDP, and ICMP transport protocols [13]. Each segment of theheader is 32 bits in lengths. Note that there may be multiple data segments in a network packet. Packetdata information can be used in payload analysis for anomaly detection, which is another branch of networkanomaly detection.

3.5.1 NetFlow

There are a number of variations of flow based network data including sflow and jflow, but the

most widely used format of network flow data is called NetFlow. NetFlow was created by Cisco

12

and defines a flow as a set of packets that have a common combination of key-fields in the packet.

This includes information such as source and destination IP addresses and port numbers, protocol

type, signature byte and logical interface. A packet is sorted into a flow record if it matches the

combination of key-fields listed above [14].

There are a number of different version of NetFlow, the most widely used being V4, V6, V9, and

V10. At the time of writing, V10 is still relatively new the market and the most widely implemented

is V9, which we will be using throughout the remainder of this project. The main differentiator

between versions is the underlying methodology in which the packets are gathered and sorted - not

in the flows themselves. Thus, research done on V9 should be widely applicable to NetFlow data

of both newer and older versions as the high level concept behind NetFlow data remains unchanged

despite underlying engineering changes [14].

3.6 Nfdump

Nfdump is a library contained a suite of netflow collector tools which includes nfcapd, nfdump,

nfreplay, and others. However the entire suite of tools is often referred to simply as Nfdump. These

tools are open source and can be configured from the command line to capture (via nfcapd) and

export (via nfdump) netflow data. Importantly, netflow captured and stored by this suite of tools

can be rebroadcast to a different port or machine using nfreplay. This suite of open source tools is

integral to the data pipeline of Malmenator.

3.7 Summary

In this chapter, the concept of NIDS and how they are used to detect and protect against malicious

network data was introduced. Furthermore, inline and passive NIDS implementation techniques

were covered, and a high level overview of their detection techniques, namely signature and anomaly

based techniques, was also provided. Additionally, nfdump, a leading open source netflow library

that Malmenator is built upon, was introduced. Building upon this knowledge, the following chapter

will explore more advanced information and recent research pertaining to Malmenator.

13

Chapter 4

Previous works

4.1 Overview

This chapter provides an in depth look into the context of the Malmenator project through a review

of relevant literature. It first looks at research in the areas of network traffic dataset generation

before proceeding with various methodologies in network anomaly detection.

4.2 Network traffic datasets

One of the major research challenges in training and evaluating an NIDS is the unavailability of

a single comprehensive network traffic dataset which that reflects modern network traffic scenarios

[8]. Since the effectiveness of an NIDS is evaluated based on its performance against data set that

contains normal and abnormal behaviors, it is critically important to use a combination of datasets

that feature different attack behaviours to serve as a metric for evaluation.

4.2.1 Methods for network traffic dataset generation

Testing NIDSs against realistic data has always been a challenge in the network security field as

such data is sensitive and is not publicly available [15]. To date there are three main methods by

which network data is generated and NIDSs are evaluated:

1. Replaying real network attack traffic from publicly available datasets

2. Generating artificial attack traffic from tools such as curl-loader

14

3. Testbed design strategies, which can be further broken down into three methods:

(a) Simulation: all entities of the network are simulated

(b) Emulation: attackers and targets are physically implemented, while the network structure

is simulated with software

(c) Physical Representation: all entities of the network are physically implemented

Each method has their own benefits and drawbacks that are more fully discussed in [15]. It important

to note that testing NIDSs is not a straightforward task, and that it is recommended to use a number

of datasets for comprehensive evaluation [13]. In the network security field, it is important to be

aware of the means by which datasets are generated as network dataset generation is a huge research

area itself.

4.2.2 Old reference datasets

The most famous and classical benchmark network datasets are the KDD CUP 99, DARPA 1998, and

DARPA 1999 datasets, were all gathered from simulated environments [16] [17] [18]. Unfortunately,

since their inception nearly two decades ago, a number of studies have shown that evaluating an

NIDS on these datasets is ineffective as they do not reflect realistic network data [19] [20] [21]

[22]. These datasets have not been able to accurately reflect the changing landscape of network

cybersecurity.

To solve these issues, the NSL-KDD was created in 2009 to enhance the KDD CUP 99 dataset by

refining the KDD CUP 99 data set and creating more sophisticated subsets [23]. Unfortunately, the

underlying network traffic of the NSL-KDD dataset dates back to two decades ago, which renders

it a poor representation of a modern low foot print attack environment [24].

4.2.3 Key datasets

4.2.3.1 CIDDS-001

The CIDDS-001 dataset was captured within an emulated environment of a small business in 2017

and is formatted as a flow-based network traffic. Normal and malicious behaviours were generated

via python scripts [25] [26].

This dataset also contains 14 features and captures over a million network flows which make

it reasonably comprehensive. This dataset has also been cited by more than 50 researchers as of

15

December 2019 which adds to its credibility. Its relatively recent release, popularly cited creation

methodology, comprehensiveness, and the flow-based format makes it a top choice for this project.

4.2.3.2 UGR 16

UGR ’16 is another public flow-based dataset which boasts 16,900 million unidirectional flows. This

dataset was captured in 2016 for 4 months and aims to capture the behaviour of the network traffic

using an internet connection provided by an internet service provider. Some attacks were explicitly

run against the system, while other attacks were later identified and labeled as such [27]. Each flow

is labelled and categorised as normal, background, or attack, and includes a variety of attacks such

as DoS attacks, botnet, and port scans.

A potential issue that might limit its use in this project comes with the fact that most of the

traffic in this dataset is labelled as ”background” which could be a normal set of traffic flow or an

attack scenario and the labelling does not differentiate between the two scenarios. This might cause

the accuracy of a model trained on this dataset to drop. However, the comprehensiveness of this

flow-based data still makes it a trustable choice for this project.

4.2.3.3 CICIDS 2017

The CICIDS 2017 dataset was also created using an emulated environment and contains network

traffic in both packet and flow-based formats. It comprises of 5 days of network traffic generated

in an emulated environment and comes in both bidirectional flow-based format and the standard

packet-based format. It also consists of more than 80 features for each flow and enriches the data

about IP addresses and cyberattacks by providing extra metadata for both of them. Normal user

behavior was executed via scripts, and the attacks have a wide range, including SSH brute force,

DDoS, heartbleed, and botnet attacks. CICIDS 2017 is publicly available [28].

A major advantage of this dataset is that it comprises of a wide variety of cyberattack scenarios

such as SSH brute force, heartbleed, botnet, DoS, web and infiltration attacks. This dataset’s

superiority in capturing the network traffic behaviour for a broad range of cyberattacks makes it a

prime choice for its utilization in this project.

4.2.4 Other relevant datasets

The following subsection briefly summarizes some other datasets that have been the subject of some

research and may be explored at a later time.

16

4.2.4.1 UNSW-NB15

The UNSW-NB15 dataset was created using the IXIA Perfect Storm tool in an emulated environment

that is available in both packet-based and flow-based formats. It contains nine different families of

attacks including backdoors and DDoS attacks [24].

4.2.4.2 MAWILab

The MAWILab repository contains real network traffic captured between two systems in the USA

and Japan. Since 2007, a 15-minute trace of packet-based data every day and labeled using anomaly

detection methods [29].

4.2.4.3 UNB ISCX 2012

The ISCX dataset was created by emulating a network environment for one week to create a dataset

containing both normal and malicious network behaviours and contains variety of attacks such as

SSH brute force and DDoS attacks [30]. The dataset exists in both a packet-based and bi-directional

flow based format.

4.2.4.4 CTU-13

The CTU-13 dataset is available in both packet-based and flow-based formats and was captured in a

university network. Since the data is captured is real data, the authors of the dataset utilized their

own anomaly detection methodologies to label the data [31].

4.2.5 General Remarks

According to the most recent survey on network datasets published earlier this year [13], the

CIDDS-001, CICIDS 2017, UNSW-NB15, and UGR16 data sets the the most ideal for uses across

generalized applications. UGR16 has a huge volume, CIDDS-001 contains detailed metadata for

deeper analysis, and CICIDS 2017 and UNSW-NB15 have large variations in attack scenarios [13].

Other datasets mentioned in this section are more appropriately utilized in specialized cases and

for certain evaluation scenarios where they may excel.

17

4.3 Machine learning in network anomaly detection

The anomaly detection problem boils down to the task of labeling a data point as being either a

normal point or an outlier (i.e. harmless or malicious network traffic in this case). The following

subsections each cover a different machine learning based approach to anomaly-based detection

that this project will explore and implement. As for signature-based techniques, the areas for

Figure 4.1: Taxonomy of network anomaly detection methods

Categorization of the four main approaches to network anomaly detection as shown in [32]. Note that thislist is by no means exhaustive. Research is being done on the applications of new algorithms in anomalydetection such as deep learning, genetic algorithms, and artificial immune systems.

improvement generally rely on having more comprehensive knowledge databases or faster signature

matching algorithms, both of which are outside the scope of this project.

An overview for network anomaly detection techniques is depicted in figure 4.1. Note that an

exhaustive and in depth discussion on the various subcategories of each methodology is not discussed

in this report, but are discussed at length in [12] [8] [32] [9]. This section will center its discussion

on the highest level of the taxonomy tree, namely classification, statistical, and clustering as general

directions for exploration for the project. Models based on information theory also lie outside the

scope of the project as they do not directly fall under the umbrella of machine learning.

4.3.1 Classification methods

Classification methods are a subsection of supervised machine learning categorize data points into

classes (normal and malicious) by building a baseline profile based on normal network traffic

18

features in a set of labeled training data. There is a large variety of classification methods, but the

most popular for network anomaly detection are support vector machines (SVMs), K-nearest

neighbors (KNNs), and artificial neural networks (ANNs) [33]. SVMs are of special note because

this classification technique has been widely used to much success not only in network anomaly

detection, but in the machine learning field as a whole [34] [35].

Other classification techniques, including fuzzy logic, regression models, and decision trees have

also been explored to some success [36] [37] [9]. However, classification techniques consume

comparatively more resources than their statistical counterparts, which cause practical

implementation challenges for their usage in NIDPSs. Furthermore, classification techniques are

also hindered by the reliance on successfully modeling baseline profiles for standard network traffic,

and is heavily influenced by the datasets they can be trained on. To date, a large number

classification methods have been evaluated using outdated datasets such as KDD CUP 99

(described in 4.2.2), and their performance would be worse when evaluated on newer datasets [12].

4.3.2 Statistical methods

Statistical methods fit a statistical model to the dataset and apply an inference test to determine

whether a new data point belongs to the normal dataset or if it is an anomaly. Statistical models

tend to be more straightforward to compute and test compared to classification models since they

are closely linked to statistical tests. For example, a successfully implemented statistical model

for network anomaly detection was based on a distance measure from the norm derived from the

chi-square test statistic [38]. Since statistical methods are relatively lightweight, they can be used

to great success in hybrid network anomaly detection methods on real-time traffic [39].

4.3.3 Clustering methods

Clustering methods are unsupervised machine learning methods which do not require prelabeled

data. Its aim is to discover patterns within data and extract rules for assigning data points to

groups based on characteristics such as distance or probability measure. Outliers or sparse clusters

are then retrospectively identified and labeled as anomalous. Clustering methods generally rely on

three key assumptions: (1) normal data falls into clusters while attacks do not, (2) normal data

clusters are closer to the cluster centroid, while attack clusters are far away, and (3) normal data

forms large and dense clusters, while attack data forms small and sparse ones [32] [12].

Clustering techniques are advantageous because they do not require prelabeled data. Since

19

there is no single comprehensive and correctly labeled benchmark dataset, clustering techniques

can proceed where traditional classification or statistical techniques may fail. Furthermore,

clustering techniques decrease computational complexity and generally have better performance

than classification methods [32]. On the other hand, clustering falls short in their high false

positive rate and their inability to detect attacks that conceal themselves in a normal cluster [9].

4.3.4 Constraints of machine learning

Despite the huge variety of network anomaly detection techniques being researched, all machine

learning methodologies must keep several key points in mind in order for to be functional in a real

world implementation. According to one of the world’s leading cybersecurity firms, Kaspersky Labs,

network anomaly detection models must: (1) be interpretable, (2) have relatively few false positives,

(3) adapt to counteractions, and (4) be trained on a large and comprehensive dataset [40]. These

key points are touted not only in industry, but also in academic research environments [41].

4.4 Summary

This section first explored methods for formatting and creating network data before proceeding

to highlight several recent and comprehensive datasets. Lastly, this section discussed classification,

statistical, and clustering techniques for applying machine learning to the network anomaly detection

problem as well as their constraints. With ample background knowledge of the topic and relevant

research, the next section will detail how Malmenator builds upon state-of-the-art technology to

create a viable product.

20

Chapter 5

Methodology

5.1 Overview

This chapter covers in detail the steps required to complete the project. The project consists of

two phases over two corresponding semesters. The first phase focuses on completing the basic NIDS

hardware implementation, processing selected datasets, and implementing a basic web interface

for reporting. The second phase focuses on network anomaly detection, real-time network traffic

evaluation and web interface completion to unify both phases into a single web platform.

5.2 Network scanner

This section describes the process of building a NIDS hardware device on Raspberry Pi to monitor

network traffic. This will be completed in phase one of the project.

5.2.1 Raspberry Pi hardware

Raspberry Pis are low cost computers that can be customized for a variety of purposes such as web

servers, personal computers, or in our case a hybrid WiFi Router and NIDS device. The recent

release of the Raspberry Pi 4 Model B expanded its processing power from 1GB to 4GB of RAM,

making it an ideal tool to use to build Malmenator. Raspberry Pis can run a variety of open source

operating systems, but Raspbian OS’s latest September 2019 distribution made it the ideal choice

to use in Malmenator for stability and compatibility reasons [42].

21

5.2.2 Hardware setup

The Raspberry Pi was used as illustrated in Figure 5.1, where it acted as a NIDS integrated into a

router, similar to the inline NIDS implementation described in section 3.3. The NIDS device was

connected to the internet through a wired Ethernet connection into the HKU local area network

(LAN). From there, the device was configured to act as a wireless router by broadcasting a wireless

local area network (WLAN) and routing all connections through the Ethernet port by implementing

the iptables library for Linux and intensively modifying a number of network configuration files.

To broadcast a wireless network, an external wireless USB adapter was originally utilized before

discovering that Raspberry Pi’s inbuilt wireless card could similarly be utilized to accomplish the

same task. Further customization was performed to ensure that the Raspberry Pi’s internal internet

usage was not impacted and that multiple devices could connect to the wireless network without

interference. By accomplishing this setup, all network traffic sent and received from devices on the

wireless network can be analyzed by the NIDS.

Figure 5.1: Network setup with the Malmenator network scanner

Depiction of the utilization of the Raspberry Pi with integrated NIDS features. The Raspberry Pi is directlyconnected to the internet through an ethernet channel and functions as a WiFi Router.

5.2.3 Utilizing nfdump

Nfdump tools (described in section 3.6), was installed on the Raspberry Pi as the foundation NIDS

upon which the rest of the project will be built. Preliminary trials on light network traffic revealed

that nfcapd was able to capture 100% of traffic that went through the Raspberry Pi.

Data captured by nfcapd and exported by nfdump will be periodically exported into an instance

of Elasticsearch on AWS. Malmenator’s future works in network anomaly detection algorithms will

be implemented on top of this system and deployed for testing and evaluation.

22

5.3 Dataset selection

This section outlines the method for compiling Malmenator’s real-time network dataset for testing

and evaluating the network anomaly detection system and also for contributing to the open source

community. This part of the project will be completed in phase one, and is crucial for the overall

success of Malmenator since many existing network datasets suffer from inaccurate labelling and

poor attack diversity [12].

5.3.1 CIDDS-001

Determining a suitable training dataset for anomaly based detection was a two phase process. First,

it required looking into state of the art network traffic capturing technology. From there, it required

datasets that capture similar features to train the models on.

In accordance with the recommendations laid out in [13] and after careful deliberation and

debate with both advisor and another professor specializing in cybersecurity, we opted to leverage

the CIDDS-001 dataset. CIDDS-001 data set was formed in 2017 by capturing NetFlow data in an

emulated small business environment. It contains four weeks of unidirectional flow-based network

traffic and encompasses an external server which was attacked in the internet [25]. In contrast

to honeypots (dummy malware traps) this server was also regularly used by the clients from the

emulated environment. The CIDDS-001 data set is publicly available and contains SSH brute force,

DoS and port scan attacks as well as several attacks captured from the wild [26].

CIDDS-001 contains detailed metadata for deeper analysis, and has a number of advantages of

other datasets including high accuracy for its labeled NetFlow entries, large volumes of data collected

over a long period, and recency and relevancy. Detailed analysis and high level information about the

dataset has already been done by its creators and be found in either its technical report describing

its features [25] or its research paper describing its generation techniques [26].

5.4 Network anomaly detection

This project takes a hybrid approach to network anomaly detection using both signature-based and

anomaly-based detection methods (introduced in section 3.4). This portion of the project will be

completed in phase two only after the selection and processing of datasets have been completed.

23

5.4.1 Hybrid network anomaly detection approach

For the signature-based portion, the most effective open source rules will be retained and utilized in

Malmenator such as utilizing a list of known untrusted IPs or geolocation. Much of the work in this

section will arise from proposing and evaluating a novel ensemble of algorithms for evaluating network

traffic data among those referred to in section 4.3 on anomaly-based methods. Work will begin by

implementing standard versions of various statistical, classification, and clustering techniques with

low training times on the dataset to see which methods have the best immediate results. From

there, further research will be done on the more promising techniques. The selection of the final

machine learning approach will be subject to further experiments and will be primarily evaluated

on interpretability, real-time processing speeds, accuracy, and a low false positive rate as justified in

section 4.3.4.

5.4.2 Model evaluation

The most critical aspect of creating an anomaly detection model is its evaluation - this is doubly

true for the cybersecurity space for the reasons listed in the introduction. Thus, Malmenator has a

3-fold approach to ensure its network anomaly detection model is competitive and reliable.

1. Comparing to prior CIDDS-001 research

2. Comparing to commercial netflow analyzers such as ManageEngine and SolarWinds

3. Comparing our models performance on other NetFlow datasets as mentioned in 4.2.4 to

research done on those datasets

Having a 3-fold approach to model evaluation allows Malmenator to ensure consistency and

quality in its performance.

5.5 Web interface

This section gives an overview of the implementation of Malmenator’s web interface for interacting

with the NIDS. The web interface will be continuously refined throughout the course of the project.

24

5.5.1 Dashboard

The main functionality of the web interface will be to monitor and control aspects of the Raspberry

Pi NIDS. This includes features such as viewing the different devices connected to the network as

well as seeing live updates and data analytics on the traffic flow. The network data analytics include

network anomaly alerts, IP address origins, and network speeds.

5.5.2 Web architecture

The web interface is centered around the Elasticsearch Logstash Kibana (ELK) stack, three open

source tools that drive thousand of data analytics projects across the world. Elasticsearch is a

powerful search and analytics engine. Logstash is a serverside data processing pipeline for ingesting

data from multiple sources to send to Elasticsearch, and Kibana is used to visualize data with charts

and graphs [43].

Figure 5.2: Malmenator web architecture

The overall web architecture. The blue lines denote standard internet traffic generated by users, while theyellow line indicates the NetFlow log messages that are sent to the cloud server. Note that the networkanomaly detection model may be deployed on the cloud or within the Raspberry Pi depending on processingspeeds.

25

The ELK-powered web application will be hosted on Microsoft Azure Cloud Services as illustrated

in Figure 5.2. Elasticsearch will be hosted on the cloud, and the NIDS will feed information into the

Elasticsearch instance via Logstash. Kibana will then be responsible for monitoring and displaying

data to the end users. This enables us to use a modular approach in constructing the web interface

which has advantages based on its agile software development methodology

5.6 Summary

This chapter first the detailed roadmap to the Malemenator project and began by covering the

hardware NIDS setup using nfdump and Raspberry Pi. The chapter proceeded to detail dataset

selection and network anomaly detection approaches before finally providing an overview of the web

architecture behind the web interface to control the NIDS. Armed with a roadmap to the project, the

next and final section will discuss and evaluate Malmenator’s overarching progress and challenges.

26

Chapter 6

Discussion and remarks

6.1 Overview

This chapter focuses on the current status and planned timeline of the project as well as self-

evaluation regarding Malmenator’s progress.

6.2 Current Status

The project since its inception in August 2019 has been consistently making progress towards

achieving its goals. Based on the methodology, there are 4 key components in this research-based

project. These are Network Scanner, Dataset Curation, Hybrid Network Anomaly Detection

Model and Web Architecture. This section describes the current progress of the project and

objectively evaluates the current standing of the project in terms of its success in delivering the

key components.

The project has finished working on all aspects of the Network Scanner. The Raspberry Pi

hardware has been successfully configured so that it can connect to any router and broadcast a Wifi

network. Moreover, nfdump has been installed and configured on the Raspberry Pi hardware and

it is able to scan network packets flowing through the system. This part of the project was finished

in October 2019.

During November 2019, dataset curation has also been completed following our extensive research

on public network flow datasets. This includes selecting public datasets as described for which the

CIDDS-001, UGR’16 and CICIDS-2017 datasets have already been extracted and analyzed. Further

27

work on analyzing and extracting more public datasets will be carried in accordance with the results

obtained and the time constraints.

Starting in December 2019 the work for the Hybrid Network Anomaly Detection Model has been

started and is currently in progress. The team is currently trying to build the initial version for

an anomaly detection model based on the analyzed CIDDS-001 public dataset. This also involves

trying to train and test the anomaly detection model on the CIDDS-001 dataset using several basic

classification-based methods. Further testing with other types of anomaly detection methods is also

simultaneously being done by the team.

As of February 2019, our progress seems to be consistent with our planned project timetable

provided in table 6.2 and we have been able to complete the deliverables due on September 29,

October 10 and December 10 successfully. Based on our progress we are hopeful about meeting

the deadlines for the remaining deliverables and achieving the goals set out for this research-based

project. A summary of our current progress based on the 4 key components of the project detailed

in section 5 can be found in the following table.

Table 6.1: Progress Evaluation

ComponentName

Description Status

Network Scanner Hardware and software for logging network traffic flow Finished

Dataset Curation Researching, extracting and analyzing public networktraffic datasets for model building

Finished

Network AnomalyDetection Model

Building, training and testing the model with differentmachine learning methods

Ongoing

Web Architecture Setup Elasticsearch, kibana and the data pipeline fromnetwork scanner to elasticsearch

Ongoing

6.3 Discussion of Findings

The Wifi network broadcasted by the Raspberry Pi has been tested by connecting several mobile

devices and personal computers which were able to connect using the Pi’s Wifi network. This has

reduced a major risk in terms of feasibility of the project because the new model of Raspberry Pi

did not have any resources describing the setup for turning the Raspberry Pi into a router that can

broadcast a network along with an internet connection successfully and work with multiple devices

28

simultaneously.

Another aspect finished includes a running nfdump on the Raspberry Pi. We found that

nfdump functions very efficiently without causing any lags on the Raspberry Pi’s normal

functioning. Moreover, it was able to capture 100% of data packets that passed through the

Raspberry Pi’s network during a 10-minute test run.

This result confirms the expected behaviours because no alerts should be generated on a malware

and cyberattack free network as even the connected devices on the network did not have any malware

during the test run. It was also observed that nfdump can log each network packet with 7 key

parameters such as the packet size, receiver and sender’s IP addresses, receiver and sender’s port,

packet transfer protocol and the network packet’s content in byte format.

Moreover, exploration and analysis of the CIDDS-001 public data resulted in confirmation that

the 7 parameters captured by nfdump for each data packet are also available in the public dataset

and this makes it possible for us to successfully train the network anomaly detection model on

the CIDDS-001 dataset while evaluating and running the model on the network scanner using the

parameters logged by nfdump.

This enables our project to analyze all the network packets flowing in and out of the system in

approximately real-time. Besides, the ability to capture the parameters in the curated datasets from

the Network Scanner is significantly helping with the development of the hybrid network anomaly

detection model, one of the key goals of Malmenator.

6.3.1 Future Works

While the current progress is promising, there are still a wide array of tasks that are yet to be fulfilled

for a satisfactory success of this project. These include working on further analyzing the UGR16

and CICIDS-2017 public datasets to incorporate them in building the network anomaly detection

model, besides the work done on CIDDS-001.

This further curation of data will enhance the comprehensiveness of the existing curated datasets

and support better training and evaluation of the performance of the hybrid network anomaly

detection model.

Following that, another task is to keep working on the hybrid network anomaly detection model

which will require significant time and experimentation with different machine learning algorithms.

The experimentation is required to discover which set of algorithms fit the curated datasets

effectively.

29

Moreover, the experimentation of the accuracy of the model with different algorithms will also

help in creating a more comprehensive evaluation matrix to compare and contrast our methods with

other research on network anomaly detection model.

And finally, we need to continue developing our web infrastructure which will act as the end

deliverable for the users of the research project to interact with the project and perform operations

on the project using a web-based dashboard.

6.4 Preliminary anomaly detection model results

The initial anomaly detection model was made by utilising two approaches based on the CIDDS-001

dataset as described in section 5.3.1.

Approach 1: An initial application involved the usage of a one class SVM model utilising

numeric and categorical fields from the netflow data with features including netflow duration, source

port, Destination Port, Number of Packets, Number of Flows and the labels include a single boolean

information on whether the flow is anomalous or not. For this initial approach the model considers

the attacker and victim labelled flow as anomalous and the normal labelled flow as non-anomalous.

In this way, out of the 8.45 million flows, 7.01 million were anomalous and 1.44 million were non-

anomalous. The One Class SVM model trained on this dataset produced fair results on testing data

with a false positive rate of 10.88%. Correctly classifying almost 89.12% of network flows.

Approach 2: This involved the use of KMeans unsupervised learning on the network flow data

as a time series. The KMeans model utilizes the date first seen feature in addition to the features

utilized by the one class SVM approach. Adding this feature makes it a time series anomaly detection

problem on which the KMeans model was trained. The results on training data only involved 0.637%

of incorrect prediction and a near 99% correct prediction rate. However, as the predictions were

done on the training data this might not be the precise indication of the model’s actual performance

which is susceptible to over-fitting. This measurement will be corrected in the future versions of the

model.

6.5 Project timeline

As mentioned in 5.1, Malmenator consists of two phases. In phase one, we aim to setup the NIDS

hardware and generate a comprehensive dataset for evaluation. General statistical analysis on the

datasets and the creation of a basic web interface for Malmenator will also be completed. In phase

30

two, we aim to train, evaluate and deploy a hybrid network anomaly detection technique using

customized hardware.

Malmenator’s project schedule can be found in table 6.2. Phase one will be completed by the

end of December and phase two will be completed by the end of March.

Table 6.2: Planned project timetable

Date Description Deliverable

Aug 22 Initial meeting with supervisor

Sep 20 Second meeting with supervisor

Sep 29 Project proposal and website Project plan and website

Oct 31 Setup nfdump and Raspberry Pi router Hardware setup completed

Nov 30 Dataset compilation Network dataset processed

Dec 31 Web UI hosted on the cloud ELK established

Jan 24 First presentation

Feb 2 Preliminary implementation and interim report Interim report

Feb 28 Network classification model Fully trained classifier

Mar 30 Unified network scanner, classifier, and web UI Functional product

April 1 Project in production

Apr 19 Finalized implementation and final report Final Report

Apr 20-24 Final presentation

May 5 Project exhibition

June 3 Project competition

6.6 Engineering Risk mitigation

There are two main approaches Malmenator uses to reduce the risk of wasting time or failing to

attain desirable results. First, our team is focused on coding the skeleton framework for the NIDS

and web interface first without spending time on the details that may be refined later. This can be

seen in the deliverables for phase one of Malemenator. Second, our team spends a large amount of

time researching previous related works to ensure that our research direction continues to remain

promising. By taking this approach, our team does not waste time coding technology that will not

be implemented and enables our knowledge base to grow at a steady pace without rushing to acquire

31

experimental results.

6.7 Challenges

6.7.1 Steep learning curve

There have been several major challenges posed in the way of the Malmenator project, the first and

foremost being our team’s lack of prior knowledge in the cybersecurity field. This directly resulted in

us setting unrealistic expectations for our project scope which required us to reevaluate our decision

and narrow our scope several times. This has resulted in us modifying our initial methodology

several times to more accurately reflect feasibility and utility based on our research.

6.7.2 Scope identification

Cybersecurity is a huge domain in itself, entailing specialisations from malware analysis to

penetration testing. This made it harder to limit the scope of the project as the team started with

a broad set of goals including cleaning of malware from a network as well as malware file

classification. A reason for this can be attributed to the lack of prior knowledge of the team in the

field of cybersecurity. The team was eventually able to identify key problems and the need for

improvement as well as innovation in the niche of network security and solved this challenge by

limiting the scope of research to this field by working on identifying anomalies/cyberattacks on a

network level.

6.7.3 Proprietary information

A second challenge comes from the inherent nature of the cybersecurity field. Unlike other fields

such as computer vision or natural language processing, there are huge barriers in navigating and

researching the field due to a lack of transparency and a general obfuscation of materials. For

example, a universally used benchmark dataset does not exist due to the privacy and legal concerns

over data sharing (in contrast with ImageNet for image recognition problems) as well as the ever

changing nature of cyberattacks. Furthermore, cybersecurity is a field that is prone to industry

research being far ahead of academic research since academic research is accessible by malicious

attackers. Lastly, experimenting with live viruses and malware pose serious safety hazards that

require additional safety precautions, which further slows progress.

32

Conclusion

Malmenator is research based project that aims to deliver a powerful and adaptable network security

tool to individuals and small organizations. By exploring powerful and comprehensive anomaly

detection techniques from multiple approaches, our project thoroughly analyzes network traffic data

for suspicious and unwanted activity. Malmenator’s research component will shed new light on the

efficacy of various datasets in evaluating NIDSs and propose novel approaches to detect network

anomalies. To date, the NIDS hardware has already been implemented on a heavily configured

Raspberry Pi using a modular open source NIDS software, nfdump. Furthermore, research into the

appropriate dataset for model creation has been completed, leading us to leverage the CIDDS-001

dataset for building and evaluating a hybrid technique for network anomaly detection. Over the

next month, our team will explore a variety of machine learning models on the dataset to determine

which class of techniques work most effectively.

33

Bibliography

[1] The White House, “The cost of malicious cyber activity to the u.s. economy,” tech. rep., The

Council of Economic Advisors, Feb 2018.

[2] Forgerock, “U.s. consumer data breach report 2019: Personally identifiable information targeted

in breaches that impact billions of records,” tech. rep., Forgerock, 2019.

[3] The Internet Society, “2018 cyber incident and breach trends report,” tech. rep., The Internet

Society, Jul 2019.

[4] A. Asen, W. Bohmayr, S. Deutscher, M. Gonzalez, and D. Mkrtchian, “Are you spending

enough on cybersecurity?,” tech. rep., Boston Consulting Group, Feb 2019.

[5] International Data Corporation, “Worldwide semiannual security spending guide,” tech. rep.,

International Data Corporation, Mar 2019.

[6] The Royal Society, “Progress and research in cybersecurity: Supporting a resilient and

trustworthy system for the uk,” tech. rep., The Royal Society, Jul 2016.

[7] K. A. Scarfone and P. M. Mell, “Guide to intrusion detection and prevention systems (idps),”

tech. rep., National Institute of Standards and Technology, U.S. Department of Commerce, Feb

2007.

[8] G. Fernandes, J. J. P. C. Rodrigues, L. F. Carvalho, J. F. Al-Muhtadi, and M. L. Proena,

“A comprehensive survey on network anomaly detection,” Telecommunication Systems, vol. 70,

p. 447489, Jul 2018.

[9] J. K. K. M. H. Bhuyan, D. K. Bhattacharyya, “Network anomaly detection: Methods, systems

and tools,” IEEE Communications Surveys Tutorials, vol. 16, pp. 303–336, Jan 2014.

34

[10] H. Holm, “Signature based intrusion detection for zero-day attacks: (not) a closed chapter?,”

2014 47th Hawaii International Conference on System Sciences, Jan 2014.

[11] H.-J. Liao, C.-H. R. Lin, Y.-C. Lin, and K.-Y. Tung, “Intrusion detection system: A

comprehensive review,” Journal of Network and Computer Applications, vol. 36, p. 1624, Jan

2013.

[12] N. Moustafa, J. Hu, and J. Slay, “A holistic review of network anomaly detection systems: A

comprehensive survey,” Journal of Network and Computer Applications, vol. 128, p. 3355, Feb

2019.

[13] M. Ring, S. Wunderlich, D. Scheuring, D. Landes, and A. Hotho, “A survey of network-based

intrusion detection data sets,” Computers & Security, vol. 86, p. 147167, 2019.

[14] Cisco, “Netflow version 9 flow-record format,” tech. rep., Cisco, May 2011.

[15] I. Sharafaldin, A. Gharib, A. H. Lashkari, and A. A. Ghorbani, “Towards a reliable intrusion

detection benchmark dataset,” Software Networking, vol. 2017, p. 177200, Jul 2017.

[16] D. Dua and C. Graff, “UCI machine learning repository,” 2017.

[17] R. Lippmann, D. Fried, I. Graf, J. Haines, K. Kendall, D. Mcclung, D. Weber, S. Webster,

D. Wyschogrod, R. Cunningham, and et al., “Evaluating intrusion detection systems: the 1998

darpa off-line intrusion detection evaluation,” Proceedings DARPA Information Survivability

Conference and Exposition, Jan 2000.

[18] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba, and K. Das, “The 1999 darpa off-line

intrusion detection evaluation,” Computer Networks, vol. 34, no. 4, p. 579595, 2000.

[19] P. Gogoi, M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, “Packet and flow based network

intrusion dataset,” Communications in Computer and Information Science Contemporary

Computing, p. 322334, 2012.

[20] M. V. Mahoney and P. K. Chan, “An analysis of the 1999 darpa/lincoln laboratory evaluation

data for network anomaly detection,” Recent Advances in Intrusion Detection, p. 220237, 2003.

[21] J. Mchugh, “Testing intrusion detection systems: a critique of the 1998 and 1999 darpa

intrusion detection system evaluations as performed by lincoln laboratory,” ACM Transactions

on Information and System Security, vol. 3, p. 262294, Nov 2000.

35

[22] A. Vasudevan, E. Harshini, and S. Selvakumar, “Ssenet-2011: A network intrusion detection

system dataset and its comparison with kdd cup 99 dataset,” 2011 Second Asian Himalayas

International Conference on Internet (AH-ICI), 2011.

[23] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed analysis of the kdd

cup 99 data set,” IEEE Symposium on Computational Intelligence for Security and Defense

Applications, 2009.

[24] N. Moustafa and J. Slay, “Unsw-nb15: a comprehensive data set for network intrusion detection

systems (unsw-nb15 network data set),” Military Communications and Information Systems

Conference (MilCIS), Nov 2015.

[25] M. Ring, S. Wunderlich, D. Grdl, D. Landes, and A. Hotho, “Flow-based benchmark data sets

for intrusion detection,” in Proceedings of the 16th European Conference on Cyber Warfare and

Security (ECCWS), pp. 361–369, ACPI, 2017.

[26] M. Ring, S. Wunderlich, D. Grdl, D. Landes, and A. Hotho, “Creation of flow-based data sets

for intrusion detection,” Journal of Information Warfare, vol. 16, pp. 40–53, 2017.

[27] G. Maci-Fernndez, J. Camacho, R. Magn-Carrin, P. Garca-Teodoro, and R. Thern, “Ugr16: A

new dataset for the evaluation of cyclostationarity-based network idss,” Computers & Security,

vol. 73, p. 411424, 2018.

[28] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generating a new intrusion

detection dataset and intrusion traffic characterization,” Proceedings of the 4th International

Conference on Information Systems Security and Privacy, Jan 2018.

[29] R. Fontugne, P. Borgnat, P. Abry, and K. Fukuda, “Mawilab,” Proceedings of the 2018

Workshop on Big Data Analytics and Machine Learning for Data Communication Networks,

2010.

[30] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, “Toward developing a systematic

approach to generate benchmark datasets for intrusion detection,” Computers & Security,

vol. 31, p. 357374, May 2012.

[31] S. Garca, M. Grill, J. Stiborek, and A. Zunino, “An empirical comparison of botnet detection

methods,” Computers & Security, vol. 45, p. 100123, Sep 2014.

36

[32] M. Ahmed, A. N. Mahmood, and J. Hu, “A survey of network anomaly detection techniques,”

Journal of Network and Computer Applications, vol. 60, p. 1931, Jan 2016.

[33] S. Dua and X. Du, Data Mining and Machine Learning in Cybersecurity. Auerbach Publications,

2016.

[34] C. Wagner, J. Franois, R. State, and T. Engel, “Machine learning approach for ip-flow record

anomaly detection,” International Conference on Research in Networking, p. 2839, 2011.

[35] M. A. Ambusaidi, X. He, P. Nanda, and Z. Tan, “Building an intrusion detection system using a

filter-based feature selection algorithm,” IEEE Transactions on Computers, vol. 65, p. 29862998,

Oct 2016.

[36] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection,” ACM Computing Surveys,

vol. 41, p. 158, Sep 2009.

[37] I. Corona, G. Giacinto, and F. Roli, “Adversarial attacks against intrusion detection systems:

Taxonomy, solutions and open issues,” Information Sciences, vol. 239, p. 201225, Aug 2013.

[38] N. Ye and Q. Chen, “An anomaly detection technique based on a chi-square statistic

for detecting intrusions into information systems,” Quality and Reliability Engineering

International, vol. 17, no. 2, p. 105112, 2001.

[39] C. Manikopoulos and S. Papavassiliou, “Network intrusion and fault detection: a statistical

anomaly approach,” IEEE Communications Magazine, vol. 40, pp. 76–82, Oct 2002.

[40] Kaspersky Lab, “Machine learning in malware detection,” tech. rep., Kaspersky, 2019.

[41] S. Saad, W. Briguglio, and H. Elmiligi, “The curious case of machine learning in malware

detection,” Proceedings of the 5th International Conference on Information Systems Security

and Privacy, Feb 2019.

[42] Raspberry Pi Foundation, “Raspberry pi foundation,” Sep 2019.

[43] C. Gormley and Z. Tong, Elasticsearch: The Definitive Guide. O’Reilly Media, Inc., 1st ed.,

2015.

37

AbstractAcknowledgementsList of FiguresList of TablesAbbreviationsIntroductionMotivationProblem formulationContributionsReport organization

A Primer on CybersecurityWhat are Cyberattacks?Economic Impacts of MalwareCategories of Cybersecurity SolutionsEthics of Malware ResearchChallenges of Malware Research

Project backgroundOverviewWhat is an nids

nids vs nipsnids implementation strategiesOverview of nids techniquesSignature-based detectionAnomaly-based detection

Packet-based and flow-based data formatsNetFlow

NfdumpSummary

Previous worksOverviewNetwork traffic datasetsMethods for network traffic dataset generation Old reference datasetsKey datasetsCIDDS-001UGR â•Ÿ16CICIDS 2017

Other relevant datasetsUNSW-NB15MAWILabUNB ISCX 2012CTU-13

General Remarks

Machine learning in network anomaly detectionClassification methodsStatistical methodsClustering methodsConstraints of machine learning

Summary

MethodologyOverviewNetwork scannerRaspberry Pi hardwareHardware setupUtilizing nfdump

Dataset selectionCIDDS-001

Network anomaly detectionHybrid network anomaly detection approachModel evaluation

Web interfaceDashboardWeb architecture

Summary

Discussion and remarksOverviewCurrent StatusDiscussion of FindingsFuture Works

Preliminary anomaly detection model resultsProject timelineEngineering Risk mitigationChallengesSteep learning curveScope identificationProprietary information

ConclusionBibliography

network based intelligent malware detection...malmenator network based intelligent malware detection...

Documents