network based intelligent malware detection...malmenator network based intelligent malware detection...
TRANSCRIPT
-
Malmenator
Network Based Intelligent Malware Detection
Final Year Project
Interim Report
February 2, 2020
David B. Han (3035344211)Piyush Jha (3035342691)
Supervisor: Dr. Dirk SchniedersUniversity of Hong Kong
github.com/piy0999/Malmenator
github.com/piy0999/Malmenator
-
Abstract
Strong security measures in the digital world are becoming increasingly important as indicated
by the meteoric rise of spending on cybersecurity as well as the increased losses from successful
cyberattacks. One important research area in cybersecurity is network anomaly detection, which is
implemented in network-based intrusion detection systems (NIDSs) to identify and stop malicious
network activity before it causes damage. Unfortunately, most of the open source NIDSs utilize
a purely rules-based approach to detect anomalies, which render them vulnerable to new exploits.
On the other hand, many novel machine learning approaches have sub optimal performance due
to challenges in their lack of interpretability as well as in the absence of large and representative
datasets for training. Malmenator takes a two step approach to create a stronger anomaly detection
engine. It first evaluates publicly available network traffic datasets and compiles a comprehensive
dataset for evaluation. Malmenator then proceeds to implement a new hybrid approach that utilizes
both machine learning and rules-based methods to create a custom NIDS. All of these features are
then combined into a portable hardware component that functions both as an NIDS and a WiFi
router. Malmenator has already implemented a basic NIDS on a WiFi router constructed out of a
Raspberry Pi and has also researched and selected a comprehensive dataset, CIDDS-001, for building
the anomaly detection model. Malmenator has just begun researching different methodologies for
implementing the network anomaly detection model and is making decent progress in accordance
with our projected timeline.
i
-
Acknowledgements
We would like to thank our supervisor, Dr. Dirk Schnieders, for his support and advice regarding the
usage of machine learning in our project. His input allowed us to appropriately define our project
scope and focus our research in a direction with the most impact. Furthermore, we would like to
extend a special token of gratitude to professors from the Centre of Applied English Studies at the
University of Hong Kong in helping us improve our writing skills to express the work done during
the project. Lastly, we would like to thank the Department of Computer Science at The University
of Hong Kong for reimbursing the costs incurred for buying various equipment during the project
and making the project feasible.
ii
-
Contents
Abstract i
Acknowledgements ii
List of Figures vi
List of Tables vii
Abbreviations viii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Report organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 A Primer on Cybersecurity 3
2.1 What are Cyberattacks? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Economic Impacts of Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Categories of Cybersecurity Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Ethics of Malware Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Challenges of Malware Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Project background 8
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.1 What is an NIDSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
iii
-
3.2 NIDSs vs network-based intrusion prevention systems (NIPSs) . . . . . . . . . . . . 9
3.3 NIDS implementation strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Overview of NIDS techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4.1 Signature-based detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.2 Anomaly-based detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5 Packet-based and flow-based data formats . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5.1 NetFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.6 Nfdump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Previous works 14
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Network traffic datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.1 Methods for network traffic dataset generation . . . . . . . . . . . . . . . . . 14
4.2.2 Old reference datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.3 Key datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.3.1 CIDDS-001 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.3.2 UGR 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.3.3 CICIDS 2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.4 Other relevant datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.4.1 UNSW-NB15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.4.2 MAWILab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.4.3 UNB ISCX 2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.4.4 CTU-13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.5 General Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Machine learning in network anomaly detection . . . . . . . . . . . . . . . . . . . . . 18
4.3.1 Classification methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.2 Statistical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.3 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.4 Constraints of machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
iv
-
5 Methodology 21
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Network scanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.1 Raspberry Pi hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.2 Hardware setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2.3 Utilizing nfdump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Dataset selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3.1 CIDDS-001 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.4 Network anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.4.1 Hybrid network anomaly detection approach . . . . . . . . . . . . . . . . . . 24
5.4.2 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.5 Web interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.5.1 Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.5.2 Web architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6 Discussion and remarks 27
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3 Discussion of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.3.1 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.4 Preliminary anomaly detection model results . . . . . . . . . . . . . . . . . . . . . . 30
6.5 Project timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.6 Engineering Risk mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.7 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.7.1 Steep learning curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.7.2 Scope identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.7.3 Proprietary information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Conclusion 33
Bibliography 34
v
-
List of Figures
2.1 Estimating financial losses to cyberattacks . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1 Inline and passive implementations of an NIDS . . . . . . . . . . . . . . . . . . . . . 10
3.2 Overview of packet-based headers by protocol . . . . . . . . . . . . . . . . . . . . . . 12
4.1 Taxonomy of network anomaly detection methods . . . . . . . . . . . . . . . . . . . 18
5.1 Network setup with the Malmenator network scanner . . . . . . . . . . . . . . . . . . 22
5.2 Malmenator web architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
vi
-
List of Tables
6.1 Progress Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2 Planned project timetable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
vii
-
Abbreviations
ANN artificial neural network
DDoS distributed denial-of-service
ELK Elasticsearch Logstash Kibana
ICMP Internet Control Message Protocol
IDS intrusion detection system
IP Internet Protocol
KNN K-nearest neighbor
LAN local area network
NIDPS network-based intrusion detection and prevention system
NIDS network-based intrusion detection system
NIPS network-based intrusion prevention system
SSH secure shell
SVM support vector machine
TCP Transmission Control Protocol
UDP User Datagram Protocol
WLAN wireless local area network
viii
-
Chapter 1
Introduction
1.1 Motivation
Over the past decade, spending on cybersecurity has been consistently increasing year over year, yet
the losses due to cyberattacks only continue to grow [1]. With the scarcity of investment in network
security research and the lack of effective open source network anomaly detection methods, this
research project aims at empowering open source network security technology [2]. Additionally, the
Malmenator project aims to create an effective, affordable, and portable NIDS that can be adapted
to any home or small office environment. By making cybersecurity more readily accessible, we aim
to increase trust in a society that is undergoing an enormous digital transformation.
1.2 Problem formulation
From an industry perspective, network security solutions for individuals are inaccessible and
expensive. Due to the confidential nature of data breaches and network traffic, individual parties
such as governments and corporations have very little cooperation compared to other areas of
computer science research [3]. This is especially true for network cybersecurity as network traffic
data often contains highly sensitive information. Currently, most network security solutions are
tailored towards large enterprises that can afford the implementation costs, while most individuals
and small groups leave network security to their Internet Service Providers.
From an academic perspective, network anomaly detection faces two large challenges. The
foremost problem is the lack of a single widely accepted network traffic dataset. Without a
1
-
comprehensive and representative dataset, the second problem is reached - appropriately
developing, testing, and deploying anomaly-based detection techniques using standardized norms
and metrics. Malmenator aims to bridge the industry gap through a research based project, and in
doing so, target the two large issues that plague research into network security - the lack of a
comprehensive dataset for NIDS evaluation and the absence of powerful open source hybrid
detection tools.
1.3 Contributions
There are several contributions that Malmenator hopes to make upon its completion. First,
Malmenator hopes to contribute to the open source community in its development of new anomaly
detection methodologies built on Netflow data. Additionally, Malmenator aims to shed light on the
difficulties of testing NIDSs with both simulated and real datasets and provide a more
comprehensive solution to this issue. Lastly, Malmenator would like to introduce novel
combinations of malicious network detection methodologies that have real-world and commercial
implications.
1.4 Report organization
The remaining contents of this report are as follows. This report begins by providing a brief
introduction to the cybersecurity industry as a whole. It then gives an introduction to NIDSs and
NetFlow data format before proceeding to summarize previous works related to the major aspects
of the Malmenator project, namely network traffic dataset creation and network anomaly
detection. For traffic dataset creation, this report looks at the different types of network data
formats before highlighting the key dataset Malmenator will be using for model training and
evaluation in Malmenator. For network anomaly detection, this report first summarizes the
different high-level machine learning approaches that have been explored by various researchers,
and then proceeds to the project methodology and describes the steps by which Malmenator will
implement its hardware, analyze its dataset, detect anomalies, and visualize its system. Lastly, this
report concludes by discussion Malmenator’s project progress, challenges, and timeline.
2
-
Chapter 2
A Primer on Cybersecurity
This chapter offers an overview of the technology and economic conditions of various aspects of the
the cybersecurity industry. Reading this chapter useful for understanding the basics of cybersecurity
for the unfamiliar and may be skipped if one wishes to delve directly into details of the project. This
chapter first describes the nature of cyberattacks before giving an overview on various high level
aspects of cybersecurity including its economic impacts, ethics, and challenges.
2.1 What are Cyberattacks?
A cyberattack is an intentional and malicious attempt by an individual or organization to disrupt
or penetrate the information system of another individual or organization. There are many forms of
cyberattacks, some of common forms are described below. Note that the list below is non-exhaustive;
cyberattacks are constantly evolving and adapting to defensive measures as they are deployed.
Malware: Malware is a combination of the words mal icious software. Malware breaches a
network through a vulnerability or exploit, and usually occurs when a user clicks a dangerous
link or email attachment that installs the malware. Once activated, malware can perform
any of a number of actions depending on the malware type. In the case of ransomware, the
malware can block access to key files until a fee is paid. In the case of spyware, the malware
can secretly obtain and transmit information from the system. In the case of a backdoor, the
malware can allow hackers to gain remote access to the computer. There are a large number
of malware variations, and new forms are discovered every day.
3
-
Man-in-the-middle attack : Man-in-the-middle attacks occur when attackers insert themselves
into a a communication line. Thus, instead of one party communicating directly with the other
party, all communication traffic is unknowingly sent through the ”man-in-the-middle”, which
enables the attacker to monitor and steal data.
Denial-of-service attack : Denial-of-service attacks overwhelms system networks with a huge
amount of traffic to exhaust their resources and bandwidth to cause the system to be unable
to fulfill legitimate attacks. In the event that this attack is simultaneously launched from a
number of compromised devices, the attack is known as a distributed-denial-of-service (DDoS)
attack.
Zero-day exploit : Zero-day exploits occur when a network vulnerability is discovered and
announced, but before an update has been released to patch the problem. Attackers target
the disclosed vulnerability during the window of time before that patch is released.
Each form of cyberattack takes advantage of different system vulnerabilities and have a variety of
different risks. Furthermore, each form can be further subdivided into various types and families as
is evident in the malware section. In this project, we focus on forms of attack that can be detected
using network traffic patterns, which includes denial-of-service attacks, man-in-the-middle attacks,
and certain types of malware such as backdoors.
2.2 Economic Impacts of Malware
Cyberattacks from malware are an increasingly large problem, and companies and governments
are spending more year over year in order to properly protect themselves. From 2012 to 2018,
average annual cybersecurity expenditures per employee doubled from $584 USD to $1,178 [4]. In
2019, global spending on cybersecurity initiative is expected to exceed $100 billion USD [5]. These
numbers are only expected to continue to rise as the financial incentives to engage in malicious
attacks only continue to rise.
In 2018, a conservative estimate of financial losses caused by cyberattacks is at least $45 billion
USD across millions of reported attacks such as DDoS and ransomware attacks [3]. Including the
financial losses due to data breaches and the loss of more than 2 billion consumer data records, this
number rises to a staggering $654 billion USD for US corporations in 2018 alone [2]. Cyberattacks
to U.S. based financial services organizations in Q1 of 2019 alone cost more than $6.2 billion USD,
a sharp rise from just $8 million USD in Q1 of 2018.
4
-
It must be noted that these costs are general estimates as the exact loss of value can be difficult
to calculate. There is no centralized data set on the costs of cyberattacks and data breaches. Thus,
many statistics in the cybersecurity industry come from surveys, which can suffer from
non-representative and inaccurate reporting. Often, firms are reluctant to report negative
information which may cause these statistics to bias downwards due to under-reporting. The
White House published a report in 2018 containing figure 2.1 depicting the various economic
impacts of cyberattacks and their respective difficulties in quantifying cost [1]. It can be seen that
certain losses such as court fees and forensic costs are easy to justify, but damage to reputation
and loss of IP are difficult to quantify. Overall, cybersecurity breaches impact organizations across
the world in various ways and magnitudes.
Figure 2.1: Estimating financial losses to cyberattacks
2.3 Categories of Cybersecurity Solutions
Cybersecurity efforts are growing rapidly in order to respond to the quickly shifting landscape of
malware attacks. Defensive solutions can take on a variety of forms and roles including anti-virus
software, identity and access management tools, intrusion detection system (IDS), and others. The
uses of most of these follow directly from their naming convention. Anti-virus software are the bread
and butter of the cybersecurity industry and include software such as Windows Defender, Norton
Antivirus, and Avast Antivirus. These tools can scan system files and downloads to check for virus
5
-
signatures that are then matched to a trusted database. Identity management tools include solutions
like 2-factor authentication where users must confirm their identity through a verification code sent
to an email or mobile device. These solutions enable organizations to more stringently verify a user’s
online identity before granting access to sensitive information. IDSs are used to monitor network
traffic for suspicious activity and issue alerts when such activity is discovered. Appropriate actions
can then be taken such as blocking traffic from suspicious IP addresses or discarding undesirable
packets. IDSs are at the heart of the Malmenator project.
2.4 Ethics of Malware Research
In order to defend against malware attacks, researchers first need to understand how malicious
code works. Often, this requires preemptively creating malware in order to find vulnerabilities and
appropriate solutions. Thus, not all malware is created with malicious intentions. People in the
cybersecurity industry can be categorized into one of four categories: white hat, black hat, grey hat,
and red hat. The definitions of these types of hackers are listed below:
White Hat : White hat hackers are people who create malware and attempt to break into
computer systems for a good cause. These people could work for a cybersecurity firm, or could
be a professional penetration testing consultant.
Black Hat : Black hat hackers create malware for malicious causes in order to extort individuals
and corporations for personal incentives such as money or power. These people are the ones
responsible for much of the malware that cause tremendous economic losses.
Grey Hat : Grey hat hackers do a mix of white hat and black hat activities and dabble in using
their knowledge and skills for both good and bad depending on the circumstances.
Red Hat : Red hat hackers are similar to black hat hackers, except they are employed by a
government to initiate attacks on foreign powers.
The key takeaway from this section is that ethics in malware research is not always
straightforward. However, publicly published academic research is generally a white hat effort, this
paper included.
6
-
2.5 Challenges of Malware Research
Performing substantive literature review to understand the cutting edge technology in malware
research is a difficult task because there is little incentive to publicly publish anti-malware
techniques. Any research that is published is also available to attackers who can choose to exploit
other vulnerabilities in the system. Thus, the papers published by the top cybersecurity firms and
government organizations are usually either outdated or extremely abstract without precise
implementation and performance details. A cybersecurity report published by the Royal Society
highlighted cybersecurity’s distinct characteristics including having multidisciplinary, global, and
cross-sectoral interest, which cause research to take place across academic, commercial, and
government sectors that further adds to these difficulties [6]. Information sharing across these
fields is not transparent, and many corporations are hesitant to share their vulnerabilities in
academic or government research as that may impact their reputation and harm their business.
Thus, much of the recently published academic research in malware detection have challenges in
practical implementation that must be taken into account.
7
-
Chapter 3
Project background
3.1 Overview
This section provides an introduction to the concept of NIDSs. Given that cybersecurity is an
extremely niche field in Computer Science, the information in the background is critical for
understanding the relevant tools and technologies that the Malmenator project is based on. This
section begins with a high level overview of what NIDSs are before delving into the details of their
implementation and categorization.
3.1.1 What is an NIDSs
NIDSs monitor network traffic in order to detect when an unauthorized intrusion is being carried
out by hostile entities by providing some or all of the following:
• Monitoring the condition of routers, firewalls, and servers
• Providing system admins a way to tune, organize and understand relevant operating system
audit trails and other logs that are often otherwise difficult to track or parse
• Including an extensive attack signature database against which information from the system
can be matched
• Recognizing and reporting when the IDS detects that data files have been altered
• Generating an alarm and notifying that security has been breached.
8
-
NIDSs offer organizations a number of benefits, starting with the ability to identify security
incidents. NIDSs can be used to help analyze the quantity and types of attacks, and organizations
can use this information to change their security systems or implement more effective controls.
NIDSs can also help companies identify bugs or problems with their network device configurations.
These metrics can then be used to assess future risks.
NIDSs can also help the enterprise attain regulatory compliance. An IDS gives companies greater
visibility across their networks, making it easier to meet security regulations. Additionally, businesses
can use their NIDSs logs as part of the documentation to show they are meeting certain compliance
requirements.
Although NIDSs monitor networks for potentially malicious activity, they are also prone to
false alarms (false positives). Consequently, organizations need to fine-tune their NIDS products
when they first install them. That means properly configuring their intrusion detection systems to
recognize what normal traffic on their network looks like compared to potentially malicious activity.
3.2 NIDSs vs NIPSs
NIDSs are used to monitor and analyze network traffic in order to identify suspicious activity and
alert system administrators in the event of an attack. A NIPS is similar to an NIDS, but differs in that
an NIPS can also be configured to automatically block potential threats without the intervention of
a system administrator. Historically, NIDSs were tailored to process network data more thoroughly
and rapidly than NIPSs, but with the advent of increased processing power, the line between the
two has become blurred. Today, most NIDSs provide configurations to allow for their capabilities
to extend into the territory of NIPSs. Hence, the term network-based intrusion detection and
prevention system (NIDPS) was coined. network-based intrusion detection and prevention systems
(NIDPSs) are systems that combine the capabilities of NIDSs and NIPSs. Although the prevention
aspect of NIDPSs are critical is dealing with certain aspects of cybersecurity, Malmenator focuses
on NIDSs implementation to allow for a more focused project. Further research should be done to
determine and implement the optimal reaction to different forms of cyberattacks.
3.3 NIDS implementation strategies
NIDSs are deployed at strategic points within a system network where it can best capture traffic
to and from all devices on the network. This usually involves being deployed directly within or in
9
-
parallel to a router, switch, or access point in a network. Figure 3.1 depicts two implementation
Figure 3.1: Inline and passive implementations of an NIDS
Inline (left) and passive (right) implementation of a NIDS as recommended by the National Institute ofStandards and Technology, U.S. Department of Commerce [7]. In the inline implementation, all networktraffic must pass through the NIDS which may throttle network speeds. In the passive implementation, acopy of all network traffic is send to the NIDS for processing, enabling network speeds to remain high.
methods of an NIDS. Inline implementation has the benefit of being able to directly respond to
attacks by blocking network traffic. On the other hand, passive implementation must rely on a
separate tool such as a firewall to secure traffic. Malmenator utilizes inline implementations of
NIDSs since Malmenator focuses not only on detecting, but also on preventing network attacks.
Note that in Figure 3.1, the NIDS is depicted as a separate hardware component. In some inline
implementations, such as with Malmenator, the NIDS can be set up directly within the router.
3.4 Overview of NIDS techniques
This section describes the high-level methodologies that are employed in NIDSs to detect malicious
traffic, namely signature-based detection and anomaly-based detection. These methodologies can
be combined in a hybrid approach to balance each other’s weaknesses [8].
10
-
3.4.1 Signature-based detection
Also known as misuse-based techniques, signature-based techniques refer to the detection of attacks
by looking for specific patterns within traffic data. The detected patterns are known as signatures,
which are then matched against a trusted database to check if they have previously appeared in any
malicious attacks. Signature-based techniques are effective for detecting previously known attack
types with high accuracy and without raising a large number of false alarms, but their efficacy is
only as good as the signature database [9]. Signature-based techniques rarely detect novel attacks
(zero-day attacks) whose signature is not already inside the database. For example, one of the most
popular NIDS, Snort, conservatively captures 8.2% of zero-day attacks [10]. Overall, this technique
is computationally fast, but not highly adaptable [11].
3.4.2 Anomaly-based detection
To solve the shortcomings of signature-based techniques in novel attacks, anomaly-based techniques
model normal network behavior and identify deviations from the norm, enabling them to detect
novel attacks whose signatures may be previously unknown. However, anomaly-based detection
techniques may suffer from false positives - normal activity not yet seen before may be incorrectly
classified as malicious. Furthermore, anomaly-based detection is a time consuming process in both
training and execution, and many implementations suffer from excessive delay during the detection
process that degrades their performance [12]. Anomaly-based detection is an area of ongoing and
active research, and is one of the areas of focus for the Malmenator project.
3.5 Packet-based and flow-based data formats
Network traffic data is formatted in one of two ways: packet-based or flow-based. Packet-based
data is captured in PCAP format and contains both metadata and payload information for each
network packet. Metadata information for each packet depends on the transport protocol, and their
differences are highlighted in Figure 3.2. There are a number of different protocols, but the most
important ones are IP, TCP, UDP, and ICMP as they constitute the majority of internet traffic
and are the core of packet-based datasets [13]. It is important to critically evaluate the impact of
various transfer protocol data types on models built using packet-based data.
Flow-based data is more compact compared to packet-based data, mainly containing metadata
about network connections. Flow-based data aggregates packets within similar properties in a time
11
-
time frame into one flow and discard payload information [13]. As packet-based data is more detailed
and thorough, packet-based data can be converted into flow-based data but not vice versa.
Figure 3.2: Overview of packet-based headers by protocol
Packet header formats for the IP, TCP, UDP, and ICMP transport protocols [13]. Each segment of theheader is 32 bits in lengths. Note that there may be multiple data segments in a network packet. Packetdata information can be used in payload analysis for anomaly detection, which is another branch of networkanomaly detection.
3.5.1 NetFlow
There are a number of variations of flow based network data including sflow and jflow, but the
most widely used format of network flow data is called NetFlow. NetFlow was created by Cisco
12
-
and defines a flow as a set of packets that have a common combination of key-fields in the packet.
This includes information such as source and destination IP addresses and port numbers, protocol
type, signature byte and logical interface. A packet is sorted into a flow record if it matches the
combination of key-fields listed above [14].
There are a number of different version of NetFlow, the most widely used being V4, V6, V9, and
V10. At the time of writing, V10 is still relatively new the market and the most widely implemented
is V9, which we will be using throughout the remainder of this project. The main differentiator
between versions is the underlying methodology in which the packets are gathered and sorted - not
in the flows themselves. Thus, research done on V9 should be widely applicable to NetFlow data
of both newer and older versions as the high level concept behind NetFlow data remains unchanged
despite underlying engineering changes [14].
3.6 Nfdump
Nfdump is a library contained a suite of netflow collector tools which includes nfcapd, nfdump,
nfreplay, and others. However the entire suite of tools is often referred to simply as Nfdump. These
tools are open source and can be configured from the command line to capture (via nfcapd) and
export (via nfdump) netflow data. Importantly, netflow captured and stored by this suite of tools
can be rebroadcast to a different port or machine using nfreplay. This suite of open source tools is
integral to the data pipeline of Malmenator.
3.7 Summary
In this chapter, the concept of NIDS and how they are used to detect and protect against malicious
network data was introduced. Furthermore, inline and passive NIDS implementation techniques
were covered, and a high level overview of their detection techniques, namely signature and anomaly
based techniques, was also provided. Additionally, nfdump, a leading open source netflow library
that Malmenator is built upon, was introduced. Building upon this knowledge, the following chapter
will explore more advanced information and recent research pertaining to Malmenator.
13
-
Chapter 4
Previous works
4.1 Overview
This chapter provides an in depth look into the context of the Malmenator project through a review
of relevant literature. It first looks at research in the areas of network traffic dataset generation
before proceeding with various methodologies in network anomaly detection.
4.2 Network traffic datasets
One of the major research challenges in training and evaluating an NIDS is the unavailability of
a single comprehensive network traffic dataset which that reflects modern network traffic scenarios
[8]. Since the effectiveness of an NIDS is evaluated based on its performance against data set that
contains normal and abnormal behaviors, it is critically important to use a combination of datasets
that feature different attack behaviours to serve as a metric for evaluation.
4.2.1 Methods for network traffic dataset generation
Testing NIDSs against realistic data has always been a challenge in the network security field as
such data is sensitive and is not publicly available [15]. To date there are three main methods by
which network data is generated and NIDSs are evaluated:
1. Replaying real network attack traffic from publicly available datasets
2. Generating artificial attack traffic from tools such as curl-loader
14
-
3. Testbed design strategies, which can be further broken down into three methods:
(a) Simulation: all entities of the network are simulated
(b) Emulation: attackers and targets are physically implemented, while the network structure
is simulated with software
(c) Physical Representation: all entities of the network are physically implemented
Each method has their own benefits and drawbacks that are more fully discussed in [15]. It important
to note that testing NIDSs is not a straightforward task, and that it is recommended to use a number
of datasets for comprehensive evaluation [13]. In the network security field, it is important to be
aware of the means by which datasets are generated as network dataset generation is a huge research
area itself.
4.2.2 Old reference datasets
The most famous and classical benchmark network datasets are the KDD CUP 99, DARPA 1998, and
DARPA 1999 datasets, were all gathered from simulated environments [16] [17] [18]. Unfortunately,
since their inception nearly two decades ago, a number of studies have shown that evaluating an
NIDS on these datasets is ineffective as they do not reflect realistic network data [19] [20] [21]
[22]. These datasets have not been able to accurately reflect the changing landscape of network
cybersecurity.
To solve these issues, the NSL-KDD was created in 2009 to enhance the KDD CUP 99 dataset by
refining the KDD CUP 99 data set and creating more sophisticated subsets [23]. Unfortunately, the
underlying network traffic of the NSL-KDD dataset dates back to two decades ago, which renders
it a poor representation of a modern low foot print attack environment [24].
4.2.3 Key datasets
4.2.3.1 CIDDS-001
The CIDDS-001 dataset was captured within an emulated environment of a small business in 2017
and is formatted as a flow-based network traffic. Normal and malicious behaviours were generated
via python scripts [25] [26].
This dataset also contains 14 features and captures over a million network flows which make
it reasonably comprehensive. This dataset has also been cited by more than 50 researchers as of
15
-
December 2019 which adds to its credibility. Its relatively recent release, popularly cited creation
methodology, comprehensiveness, and the flow-based format makes it a top choice for this project.
4.2.3.2 UGR 16
UGR ’16 is another public flow-based dataset which boasts 16,900 million unidirectional flows. This
dataset was captured in 2016 for 4 months and aims to capture the behaviour of the network traffic
using an internet connection provided by an internet service provider. Some attacks were explicitly
run against the system, while other attacks were later identified and labeled as such [27]. Each flow
is labelled and categorised as normal, background, or attack, and includes a variety of attacks such
as DoS attacks, botnet, and port scans.
A potential issue that might limit its use in this project comes with the fact that most of the
traffic in this dataset is labelled as ”background” which could be a normal set of traffic flow or an
attack scenario and the labelling does not differentiate between the two scenarios. This might cause
the accuracy of a model trained on this dataset to drop. However, the comprehensiveness of this
flow-based data still makes it a trustable choice for this project.
4.2.3.3 CICIDS 2017
The CICIDS 2017 dataset was also created using an emulated environment and contains network
traffic in both packet and flow-based formats. It comprises of 5 days of network traffic generated
in an emulated environment and comes in both bidirectional flow-based format and the standard
packet-based format. It also consists of more than 80 features for each flow and enriches the data
about IP addresses and cyberattacks by providing extra metadata for both of them. Normal user
behavior was executed via scripts, and the attacks have a wide range, including SSH brute force,
DDoS, heartbleed, and botnet attacks. CICIDS 2017 is publicly available [28].
A major advantage of this dataset is that it comprises of a wide variety of cyberattack scenarios
such as SSH brute force, heartbleed, botnet, DoS, web and infiltration attacks. This dataset’s
superiority in capturing the network traffic behaviour for a broad range of cyberattacks makes it a
prime choice for its utilization in this project.
4.2.4 Other relevant datasets
The following subsection briefly summarizes some other datasets that have been the subject of some
research and may be explored at a later time.
16
-
4.2.4.1 UNSW-NB15
The UNSW-NB15 dataset was created using the IXIA Perfect Storm tool in an emulated environment
that is available in both packet-based and flow-based formats. It contains nine different families of
attacks including backdoors and DDoS attacks [24].
4.2.4.2 MAWILab
The MAWILab repository contains real network traffic captured between two systems in the USA
and Japan. Since 2007, a 15-minute trace of packet-based data every day and labeled using anomaly
detection methods [29].
4.2.4.3 UNB ISCX 2012
The ISCX dataset was created by emulating a network environment for one week to create a dataset
containing both normal and malicious network behaviours and contains variety of attacks such as
SSH brute force and DDoS attacks [30]. The dataset exists in both a packet-based and bi-directional
flow based format.
4.2.4.4 CTU-13
The CTU-13 dataset is available in both packet-based and flow-based formats and was captured in a
university network. Since the data is captured is real data, the authors of the dataset utilized their
own anomaly detection methodologies to label the data [31].
4.2.5 General Remarks
According to the most recent survey on network datasets published earlier this year [13], the
CIDDS-001, CICIDS 2017, UNSW-NB15, and UGR16 data sets the the most ideal for uses across
generalized applications. UGR16 has a huge volume, CIDDS-001 contains detailed metadata for
deeper analysis, and CICIDS 2017 and UNSW-NB15 have large variations in attack scenarios [13].
Other datasets mentioned in this section are more appropriately utilized in specialized cases and
for certain evaluation scenarios where they may excel.
17
-
4.3 Machine learning in network anomaly detection
The anomaly detection problem boils down to the task of labeling a data point as being either a
normal point or an outlier (i.e. harmless or malicious network traffic in this case). The following
subsections each cover a different machine learning based approach to anomaly-based detection
that this project will explore and implement. As for signature-based techniques, the areas for
Figure 4.1: Taxonomy of network anomaly detection methods
Categorization of the four main approaches to network anomaly detection as shown in [32]. Note that thislist is by no means exhaustive. Research is being done on the applications of new algorithms in anomalydetection such as deep learning, genetic algorithms, and artificial immune systems.
improvement generally rely on having more comprehensive knowledge databases or faster signature
matching algorithms, both of which are outside the scope of this project.
An overview for network anomaly detection techniques is depicted in figure 4.1. Note that an
exhaustive and in depth discussion on the various subcategories of each methodology is not discussed
in this report, but are discussed at length in [12] [8] [32] [9]. This section will center its discussion
on the highest level of the taxonomy tree, namely classification, statistical, and clustering as general
directions for exploration for the project. Models based on information theory also lie outside the
scope of the project as they do not directly fall under the umbrella of machine learning.
4.3.1 Classification methods
Classification methods are a subsection of supervised machine learning categorize data points into
classes (normal and malicious) by building a baseline profile based on normal network traffic
18
-
features in a set of labeled training data. There is a large variety of classification methods, but the
most popular for network anomaly detection are support vector machines (SVMs), K-nearest
neighbors (KNNs), and artificial neural networks (ANNs) [33]. SVMs are of special note because
this classification technique has been widely used to much success not only in network anomaly
detection, but in the machine learning field as a whole [34] [35].
Other classification techniques, including fuzzy logic, regression models, and decision trees have
also been explored to some success [36] [37] [9]. However, classification techniques consume
comparatively more resources than their statistical counterparts, which cause practical
implementation challenges for their usage in NIDPSs. Furthermore, classification techniques are
also hindered by the reliance on successfully modeling baseline profiles for standard network traffic,
and is heavily influenced by the datasets they can be trained on. To date, a large number
classification methods have been evaluated using outdated datasets such as KDD CUP 99
(described in 4.2.2), and their performance would be worse when evaluated on newer datasets [12].
4.3.2 Statistical methods
Statistical methods fit a statistical model to the dataset and apply an inference test to determine
whether a new data point belongs to the normal dataset or if it is an anomaly. Statistical models
tend to be more straightforward to compute and test compared to classification models since they
are closely linked to statistical tests. For example, a successfully implemented statistical model
for network anomaly detection was based on a distance measure from the norm derived from the
chi-square test statistic [38]. Since statistical methods are relatively lightweight, they can be used
to great success in hybrid network anomaly detection methods on real-time traffic [39].
4.3.3 Clustering methods
Clustering methods are unsupervised machine learning methods which do not require prelabeled
data. Its aim is to discover patterns within data and extract rules for assigning data points to
groups based on characteristics such as distance or probability measure. Outliers or sparse clusters
are then retrospectively identified and labeled as anomalous. Clustering methods generally rely on
three key assumptions: (1) normal data falls into clusters while attacks do not, (2) normal data
clusters are closer to the cluster centroid, while attack clusters are far away, and (3) normal data
forms large and dense clusters, while attack data forms small and sparse ones [32] [12].
Clustering techniques are advantageous because they do not require prelabeled data. Since
19
-
there is no single comprehensive and correctly labeled benchmark dataset, clustering techniques
can proceed where traditional classification or statistical techniques may fail. Furthermore,
clustering techniques decrease computational complexity and generally have better performance
than classification methods [32]. On the other hand, clustering falls short in their high false
positive rate and their inability to detect attacks that conceal themselves in a normal cluster [9].
4.3.4 Constraints of machine learning
Despite the huge variety of network anomaly detection techniques being researched, all machine
learning methodologies must keep several key points in mind in order for to be functional in a real
world implementation. According to one of the world’s leading cybersecurity firms, Kaspersky Labs,
network anomaly detection models must: (1) be interpretable, (2) have relatively few false positives,
(3) adapt to counteractions, and (4) be trained on a large and comprehensive dataset [40]. These
key points are touted not only in industry, but also in academic research environments [41].
4.4 Summary
This section first explored methods for formatting and creating network data before proceeding
to highlight several recent and comprehensive datasets. Lastly, this section discussed classification,
statistical, and clustering techniques for applying machine learning to the network anomaly detection
problem as well as their constraints. With ample background knowledge of the topic and relevant
research, the next section will detail how Malmenator builds upon state-of-the-art technology to
create a viable product.
20
-
Chapter 5
Methodology
5.1 Overview
This chapter covers in detail the steps required to complete the project. The project consists of
two phases over two corresponding semesters. The first phase focuses on completing the basic NIDS
hardware implementation, processing selected datasets, and implementing a basic web interface
for reporting. The second phase focuses on network anomaly detection, real-time network traffic
evaluation and web interface completion to unify both phases into a single web platform.
5.2 Network scanner
This section describes the process of building a NIDS hardware device on Raspberry Pi to monitor
network traffic. This will be completed in phase one of the project.
5.2.1 Raspberry Pi hardware
Raspberry Pis are low cost computers that can be customized for a variety of purposes such as web
servers, personal computers, or in our case a hybrid WiFi Router and NIDS device. The recent
release of the Raspberry Pi 4 Model B expanded its processing power from 1GB to 4GB of RAM,
making it an ideal tool to use to build Malmenator. Raspberry Pis can run a variety of open source
operating systems, but Raspbian OS’s latest September 2019 distribution made it the ideal choice
to use in Malmenator for stability and compatibility reasons [42].
21
-
5.2.2 Hardware setup
The Raspberry Pi was used as illustrated in Figure 5.1, where it acted as a NIDS integrated into a
router, similar to the inline NIDS implementation described in section 3.3. The NIDS device was
connected to the internet through a wired Ethernet connection into the HKU local area network
(LAN). From there, the device was configured to act as a wireless router by broadcasting a wireless
local area network (WLAN) and routing all connections through the Ethernet port by implementing
the iptables library for Linux and intensively modifying a number of network configuration files.
To broadcast a wireless network, an external wireless USB adapter was originally utilized before
discovering that Raspberry Pi’s inbuilt wireless card could similarly be utilized to accomplish the
same task. Further customization was performed to ensure that the Raspberry Pi’s internal internet
usage was not impacted and that multiple devices could connect to the wireless network without
interference. By accomplishing this setup, all network traffic sent and received from devices on the
wireless network can be analyzed by the NIDS.
Figure 5.1: Network setup with the Malmenator network scanner
Depiction of the utilization of the Raspberry Pi with integrated NIDS features. The Raspberry Pi is directlyconnected to the internet through an ethernet channel and functions as a WiFi Router.
5.2.3 Utilizing nfdump
Nfdump tools (described in section 3.6), was installed on the Raspberry Pi as the foundation NIDS
upon which the rest of the project will be built. Preliminary trials on light network traffic revealed
that nfcapd was able to capture 100% of traffic that went through the Raspberry Pi.
Data captured by nfcapd and exported by nfdump will be periodically exported into an instance
of Elasticsearch on AWS. Malmenator’s future works in network anomaly detection algorithms will
be implemented on top of this system and deployed for testing and evaluation.
22
-
5.3 Dataset selection
This section outlines the method for compiling Malmenator’s real-time network dataset for testing
and evaluating the network anomaly detection system and also for contributing to the open source
community. This part of the project will be completed in phase one, and is crucial for the overall
success of Malmenator since many existing network datasets suffer from inaccurate labelling and
poor attack diversity [12].
5.3.1 CIDDS-001
Determining a suitable training dataset for anomaly based detection was a two phase process. First,
it required looking into state of the art network traffic capturing technology. From there, it required
datasets that capture similar features to train the models on.
In accordance with the recommendations laid out in [13] and after careful deliberation and
debate with both advisor and another professor specializing in cybersecurity, we opted to leverage
the CIDDS-001 dataset. CIDDS-001 data set was formed in 2017 by capturing NetFlow data in an
emulated small business environment. It contains four weeks of unidirectional flow-based network
traffic and encompasses an external server which was attacked in the internet [25]. In contrast
to honeypots (dummy malware traps) this server was also regularly used by the clients from the
emulated environment. The CIDDS-001 data set is publicly available and contains SSH brute force,
DoS and port scan attacks as well as several attacks captured from the wild [26].
CIDDS-001 contains detailed metadata for deeper analysis, and has a number of advantages of
other datasets including high accuracy for its labeled NetFlow entries, large volumes of data collected
over a long period, and recency and relevancy. Detailed analysis and high level information about the
dataset has already been done by its creators and be found in either its technical report describing
its features [25] or its research paper describing its generation techniques [26].
5.4 Network anomaly detection
This project takes a hybrid approach to network anomaly detection using both signature-based and
anomaly-based detection methods (introduced in section 3.4). This portion of the project will be
completed in phase two only after the selection and processing of datasets have been completed.
23
-
5.4.1 Hybrid network anomaly detection approach
For the signature-based portion, the most effective open source rules will be retained and utilized in
Malmenator such as utilizing a list of known untrusted IPs or geolocation. Much of the work in this
section will arise from proposing and evaluating a novel ensemble of algorithms for evaluating network
traffic data among those referred to in section 4.3 on anomaly-based methods. Work will begin by
implementing standard versions of various statistical, classification, and clustering techniques with
low training times on the dataset to see which methods have the best immediate results. From
there, further research will be done on the more promising techniques. The selection of the final
machine learning approach will be subject to further experiments and will be primarily evaluated
on interpretability, real-time processing speeds, accuracy, and a low false positive rate as justified in
section 4.3.4.
5.4.2 Model evaluation
The most critical aspect of creating an anomaly detection model is its evaluation - this is doubly
true for the cybersecurity space for the reasons listed in the introduction. Thus, Malmenator has a
3-fold approach to ensure its network anomaly detection model is competitive and reliable.
1. Comparing to prior CIDDS-001 research
2. Comparing to commercial netflow analyzers such as ManageEngine and SolarWinds
3. Comparing our models performance on other NetFlow datasets as mentioned in 4.2.4 to
research done on those datasets
Having a 3-fold approach to model evaluation allows Malmenator to ensure consistency and
quality in its performance.
5.5 Web interface
This section gives an overview of the implementation of Malmenator’s web interface for interacting
with the NIDS. The web interface will be continuously refined throughout the course of the project.
24
-
5.5.1 Dashboard
The main functionality of the web interface will be to monitor and control aspects of the Raspberry
Pi NIDS. This includes features such as viewing the different devices connected to the network as
well as seeing live updates and data analytics on the traffic flow. The network data analytics include
network anomaly alerts, IP address origins, and network speeds.
5.5.2 Web architecture
The web interface is centered around the Elasticsearch Logstash Kibana (ELK) stack, three open
source tools that drive thousand of data analytics projects across the world. Elasticsearch is a
powerful search and analytics engine. Logstash is a serverside data processing pipeline for ingesting
data from multiple sources to send to Elasticsearch, and Kibana is used to visualize data with charts
and graphs [43].
Figure 5.2: Malmenator web architecture
The overall web architecture. The blue lines denote standard internet traffic generated by users, while theyellow line indicates the NetFlow log messages that are sent to the cloud server. Note that the networkanomaly detection model may be deployed on the cloud or within the Raspberry Pi depending on processingspeeds.
25
-
The ELK-powered web application will be hosted on Microsoft Azure Cloud Services as illustrated
in Figure 5.2. Elasticsearch will be hosted on the cloud, and the NIDS will feed information into the
Elasticsearch instance via Logstash. Kibana will then be responsible for monitoring and displaying
data to the end users. This enables us to use a modular approach in constructing the web interface
which has advantages based on its agile software development methodology
5.6 Summary
This chapter first the detailed roadmap to the Malemenator project and began by covering the
hardware NIDS setup using nfdump and Raspberry Pi. The chapter proceeded to detail dataset
selection and network anomaly detection approaches before finally providing an overview of the web
architecture behind the web interface to control the NIDS. Armed with a roadmap to the project, the
next and final section will discuss and evaluate Malmenator’s overarching progress and challenges.
26
-
Chapter 6
Discussion and remarks
6.1 Overview
This chapter focuses on the current status and planned timeline of the project as well as self-
evaluation regarding Malmenator’s progress.
6.2 Current Status
The project since its inception in August 2019 has been consistently making progress towards
achieving its goals. Based on the methodology, there are 4 key components in this research-based
project. These are Network Scanner, Dataset Curation, Hybrid Network Anomaly Detection
Model and Web Architecture. This section describes the current progress of the project and
objectively evaluates the current standing of the project in terms of its success in delivering the
key components.
The project has finished working on all aspects of the Network Scanner. The Raspberry Pi
hardware has been successfully configured so that it can connect to any router and broadcast a Wifi
network. Moreover, nfdump has been installed and configured on the Raspberry Pi hardware and
it is able to scan network packets flowing through the system. This part of the project was finished
in October 2019.
During November 2019, dataset curation has also been completed following our extensive research
on public network flow datasets. This includes selecting public datasets as described for which the
CIDDS-001, UGR’16 and CICIDS-2017 datasets have already been extracted and analyzed. Further
27
-
work on analyzing and extracting more public datasets will be carried in accordance with the results
obtained and the time constraints.
Starting in December 2019 the work for the Hybrid Network Anomaly Detection Model has been
started and is currently in progress. The team is currently trying to build the initial version for
an anomaly detection model based on the analyzed CIDDS-001 public dataset. This also involves
trying to train and test the anomaly detection model on the CIDDS-001 dataset using several basic
classification-based methods. Further testing with other types of anomaly detection methods is also
simultaneously being done by the team.
As of February 2019, our progress seems to be consistent with our planned project timetable
provided in table 6.2 and we have been able to complete the deliverables due on September 29,
October 10 and December 10 successfully. Based on our progress we are hopeful about meeting
the deadlines for the remaining deliverables and achieving the goals set out for this research-based
project. A summary of our current progress based on the 4 key components of the project detailed
in section 5 can be found in the following table.
Table 6.1: Progress Evaluation
ComponentName
Description Status
Network Scanner Hardware and software for logging network traffic flow Finished
Dataset Curation Researching, extracting and analyzing public networktraffic datasets for model building
Finished
Network AnomalyDetection Model
Building, training and testing the model with differentmachine learning methods
Ongoing
Web Architecture Setup Elasticsearch, kibana and the data pipeline fromnetwork scanner to elasticsearch
Ongoing
6.3 Discussion of Findings
The Wifi network broadcasted by the Raspberry Pi has been tested by connecting several mobile
devices and personal computers which were able to connect using the Pi’s Wifi network. This has
reduced a major risk in terms of feasibility of the project because the new model of Raspberry Pi
did not have any resources describing the setup for turning the Raspberry Pi into a router that can
broadcast a network along with an internet connection successfully and work with multiple devices
28
-
simultaneously.
Another aspect finished includes a running nfdump on the Raspberry Pi. We found that
nfdump functions very efficiently without causing any lags on the Raspberry Pi’s normal
functioning. Moreover, it was able to capture 100% of data packets that passed through the
Raspberry Pi’s network during a 10-minute test run.
This result confirms the expected behaviours because no alerts should be generated on a malware
and cyberattack free network as even the connected devices on the network did not have any malware
during the test run. It was also observed that nfdump can log each network packet with 7 key
parameters such as the packet size, receiver and sender’s IP addresses, receiver and sender’s port,
packet transfer protocol and the network packet’s content in byte format.
Moreover, exploration and analysis of the CIDDS-001 public data resulted in confirmation that
the 7 parameters captured by nfdump for each data packet are also available in the public dataset
and this makes it possible for us to successfully train the network anomaly detection model on
the CIDDS-001 dataset while evaluating and running the model on the network scanner using the
parameters logged by nfdump.
This enables our project to analyze all the network packets flowing in and out of the system in
approximately real-time. Besides, the ability to capture the parameters in the curated datasets from
the Network Scanner is significantly helping with the development of the hybrid network anomaly
detection model, one of the key goals of Malmenator.
6.3.1 Future Works
While the current progress is promising, there are still a wide array of tasks that are yet to be fulfilled
for a satisfactory success of this project. These include working on further analyzing the UGR16
and CICIDS-2017 public datasets to incorporate them in building the network anomaly detection
model, besides the work done on CIDDS-001.
This further curation of data will enhance the comprehensiveness of the existing curated datasets
and support better training and evaluation of the performance of the hybrid network anomaly
detection model.
Following that, another task is to keep working on the hybrid network anomaly detection model
which will require significant time and experimentation with different machine learning algorithms.
The experimentation is required to discover which set of algorithms fit the curated datasets
effectively.
29
-
Moreover, the experimentation of the accuracy of the model with different algorithms will also
help in creating a more comprehensive evaluation matrix to compare and contrast our methods with
other research on network anomaly detection model.
And finally, we need to continue developing our web infrastructure which will act as the end
deliverable for the users of the research project to interact with the project and perform operations
on the project using a web-based dashboard.
6.4 Preliminary anomaly detection model results
The initial anomaly detection model was made by utilising two approaches based on the CIDDS-001
dataset as described in section 5.3.1.
Approach 1: An initial application involved the usage of a one class SVM model utilising
numeric and categorical fields from the netflow data with features including netflow duration, source
port, Destination Port, Number of Packets, Number of Flows and the labels include a single boolean
information on whether the flow is anomalous or not. For this initial approach the model considers
the attacker and victim labelled flow as anomalous and the normal labelled flow as non-anomalous.
In this way, out of the 8.45 million flows, 7.01 million were anomalous and 1.44 million were non-
anomalous. The One Class SVM model trained on this dataset produced fair results on testing data
with a false positive rate of 10.88%. Correctly classifying almost 89.12% of network flows.
Approach 2: This involved the use of KMeans unsupervised learning on the network flow data
as a time series. The KMeans model utilizes the date first seen feature in addition to the features
utilized by the one class SVM approach. Adding this feature makes it a time series anomaly detection
problem on which the KMeans model was trained. The results on training data only involved 0.637%
of incorrect prediction and a near 99% correct prediction rate. However, as the predictions were
done on the training data this might not be the precise indication of the model’s actual performance
which is susceptible to over-fitting. This measurement will be corrected in the future versions of the
model.
6.5 Project timeline
As mentioned in 5.1, Malmenator consists of two phases. In phase one, we aim to setup the NIDS
hardware and generate a comprehensive dataset for evaluation. General statistical analysis on the
datasets and the creation of a basic web interface for Malmenator will also be completed. In phase
30
-
two, we aim to train, evaluate and deploy a hybrid network anomaly detection technique using
customized hardware.
Malmenator’s project schedule can be found in table 6.2. Phase one will be completed by the
end of December and phase two will be completed by the end of March.
Table 6.2: Planned project timetable
Date Description Deliverable
Aug 22 Initial meeting with supervisor
Sep 20 Second meeting with supervisor
Sep 29 Project proposal and website Project plan and website
Oct 31 Setup nfdump and Raspberry Pi router Hardware setup completed
Nov 30 Dataset compilation Network dataset processed
Dec 31 Web UI hosted on the cloud ELK established
Jan 24 First presentation
Feb 2 Preliminary implementation and interim report Interim report
Feb 28 Network classification model Fully trained classifier
Mar 30 Unified network scanner, classifier, and web UI Functional product
April 1 Project in production
Apr 19 Finalized implementation and final report Final Report
Apr 20-24 Final presentation
May 5 Project exhibition
June 3 Project competition
6.6 Engineering Risk mitigation
There are two main approaches Malmenator uses to reduce the risk of wasting time or failing to
attain desirable results. First, our team is focused on coding the skeleton framework for the NIDS
and web interface first without spending time on the details that may be refined later. This can be
seen in the deliverables for phase one of Malemenator. Second, our team spends a large amount of
time researching previous related works to ensure that our research direction continues to remain
promising. By taking this approach, our team does not waste time coding technology that will not
be implemented and enables our knowledge base to grow at a steady pace without rushing to acquire
31
-
experimental results.
6.7 Challenges
6.7.1 Steep learning curve
There have been several major challenges posed in the way of the Malmenator project, the first and
foremost being our team’s lack of prior knowledge in the cybersecurity field. This directly resulted in
us setting unrealistic expectations for our project scope which required us to reevaluate our decision
and narrow our scope several times. This has resulted in us modifying our initial methodology
several times to more accurately reflect feasibility and utility based on our research.
6.7.2 Scope identification
Cybersecurity is a huge domain in itself, entailing specialisations from malware analysis to
penetration testing. This made it harder to limit the scope of the project as the team started with
a broad set of goals including cleaning of malware from a network as well as malware file
classification. A reason for this can be attributed to the lack of prior knowledge of the team in the
field of cybersecurity. The team was eventually able to identify key problems and the need for
improvement as well as innovation in the niche of network security and solved this challenge by
limiting the scope of research to this field by working on identifying anomalies/cyberattacks on a
network level.
6.7.3 Proprietary information
A second challenge comes from the inherent nature of the cybersecurity field. Unlike other fields
such as computer vision or natural language processing, there are huge barriers in navigating and
researching the field due to a lack of transparency and a general obfuscation of materials. For
example, a universally used benchmark dataset does not exist due to the privacy and legal concerns
over data sharing (in contrast with ImageNet for image recognition problems) as well as the ever
changing nature of cyberattacks. Furthermore, cybersecurity is a field that is prone to industry
research being far ahead of academic research since academic research is accessible by malicious
attackers. Lastly, experimenting with live viruses and malware pose serious safety hazards that
require additional safety precautions, which further slows progress.
32
-
Conclusion
Malmenator is research based project that aims to deliver a powerful and adaptable network security
tool to individuals and small organizations. By exploring powerful and comprehensive anomaly
detection techniques from multiple approaches, our project thoroughly analyzes network traffic data
for suspicious and unwanted activity. Malmenator’s research component will shed new light on the
efficacy of various datasets in evaluating NIDSs and propose novel approaches to detect network
anomalies. To date, the NIDS hardware has already been implemented on a heavily configured
Raspberry Pi using a modular open source NIDS software, nfdump. Furthermore, research into the
appropriate dataset for model creation has been completed, leading us to leverage the CIDDS-001
dataset for building and evaluating a hybrid technique for network anomaly detection. Over the
next month, our team will explore a variety of machine learning models on the dataset to determine
which class of techniques work most effectively.
33
-
Bibliography
[1] The White House, “The cost of malicious cyber activity to the u.s. economy,” tech. rep., The
Council of Economic Advisors, Feb 2018.
[2] Forgerock, “U.s. consumer data breach report 2019: Personally identifiable information targeted
in breaches that impact billions of records,” tech. rep., Forgerock, 2019.
[3] The Internet Society, “2018 cyber incident and breach trends report,” tech. rep., The Internet
Society, Jul 2019.
[4] A. Asen, W. Bohmayr, S. Deutscher, M. Gonzalez, and D. Mkrtchian, “Are you spending
enough on cybersecurity?,” tech. rep., Boston Consulting Group, Feb 2019.
[5] International Data Corporation, “Worldwide semiannual security spending guide,” tech. rep.,
International Data Corporation, Mar 2019.
[6] The Royal Society, “Progress and research in cybersecurity: Supporting a resilient and
trustworthy system for the uk,” tech. rep., The Royal Society, Jul 2016.
[7] K. A. Scarfone and P. M. Mell, “Guide to intrusion detection and prevention systems (idps),”
tech. rep., National Institute of Standards and Technology, U.S. Department of Commerce, Feb
2007.
[8] G. Fernandes, J. J. P. C. Rodrigues, L. F. Carvalho, J. F. Al-Muhtadi, and M. L. Proena,
“A comprehensive survey on network anomaly detection,” Telecommunication Systems, vol. 70,
p. 447489, Jul 2018.
[9] J. K. K. M. H. Bhuyan, D. K. Bhattacharyya, “Network anomaly detection: Methods, systems
and tools,” IEEE Communications Surveys Tutorials, vol. 16, pp. 303–336, Jan 2014.
34
-
[10] H. Holm, “Signature based intrusion detection for zero-day attacks: (not) a closed chapter?,”
2014 47th Hawaii International Conference on System Sciences, Jan 2014.
[11] H.-J. Liao, C.-H. R. Lin, Y.-C. Lin, and K.-Y. Tung, “Intrusion detection system: A
comprehensive review,” Journal of Network and Computer Applications, vol. 36, p. 1624, Jan
2013.
[12] N. Moustafa, J. Hu, and J. Slay, “A holistic review of network anomaly detection systems: A
comprehensive survey,” Journal of Network and Computer Applications, vol. 128, p. 3355, Feb
2019.
[13] M. Ring, S. Wunderlich, D. Scheuring, D. Landes, and A. Hotho, “A survey of network-based
intrusion detection data sets,” Computers & Security, vol. 86, p. 147167, 2019.
[14] Cisco, “Netflow version 9 flow-record format,” tech. rep., Cisco, May 2011.
[15] I. Sharafaldin, A. Gharib, A. H. Lashkari, and A. A. Ghorbani, “Towards a reliable intrusion
detection benchmark dataset,” Software Networking, vol. 2017, p. 177200, Jul 2017.
[16] D. Dua and C. Graff, “UCI machine learning repository,” 2017.
[17] R. Lippmann, D. Fried, I. Graf, J. Haines, K. Kendall, D. Mcclung, D. Weber, S. Webster,
D. Wyschogrod, R. Cunningham, and et al., “Evaluating intrusion detection systems: the 1998
darpa off-line intrusion detection evaluation,” Proceedings DARPA Information Survivability
Conference and Exposition, Jan 2000.
[18] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba, and K. Das, “The 1999 darpa off-line
intrusion detection evaluation,” Computer Networks, vol. 34, no. 4, p. 579595, 2000.
[19] P. Gogoi, M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, “Packet and flow based network
intrusion dataset,” Communications in Computer and Information Science Contemporary
Computing, p. 322334, 2012.
[20] M. V. Mahoney and P. K. Chan, “An analysis of the 1999 darpa/lincoln laboratory evaluation
data for network anomaly detection,” Recent Advances in Intrusion Detection, p. 220237, 2003.
[21] J. Mchugh, “Testing intrusion detection systems: a critique of the 1998 and 1999 darpa
intrusion detection system evaluations as performed by lincoln laboratory,” ACM Transactions
on Information and System Security, vol. 3, p. 262294, Nov 2000.
35
-
[22] A. Vasudevan, E. Harshini, and S. Selvakumar, “Ssenet-2011: A network intrusion detection
system dataset and its comparison with kdd cup 99 dataset,” 2011 Second Asian Himalayas
International Conference on Internet (AH-ICI), 2011.
[23] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed analysis of the kdd
cup 99 data set,” IEEE Symposium on Computational Intelligence for Security and Defense
Applications, 2009.
[24] N. Moustafa and J. Slay, “Unsw-nb15: a comprehensive data set for network intrusion detection
systems (unsw-nb15 network data set),” Military Communications and Information Systems
Conference (MilCIS), Nov 2015.
[25] M. Ring, S. Wunderlich, D. Grdl, D. Landes, and A. Hotho, “Flow-based benchmark data sets
for intrusion detection,” in Proceedings of the 16th European Conference on Cyber Warfare and
Security (ECCWS), pp. 361–369, ACPI, 2017.
[26] M. Ring, S. Wunderlich, D. Grdl, D. Landes, and A. Hotho, “Creation of flow-based data sets
for intrusion detection,” Journal of Information Warfare, vol. 16, pp. 40–53, 2017.
[27] G. Maci-Fernndez, J. Camacho, R. Magn-Carrin, P. Garca-Teodoro, and R. Thern, “Ugr16: A
new dataset for the evaluation of cyclostationarity-based network idss,” Computers & Security,
vol. 73, p. 411424, 2018.
[28] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generating a new intrusion
detection dataset and intrusion traffic characterization,” Proceedings of the 4th International
Conference on Information Systems Security and Privacy, Jan 2018.
[29] R. Fontugne, P. Borgnat, P. Abry, and K. Fukuda, “Mawilab,” Proceedings of the 2018
Workshop on Big Data Analytics and Machine Learning for Data Communication Networks,
2010.
[30] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, “Toward developing a systematic
approach to generate benchmark datasets for intrusion detection,” Computers & Security,
vol. 31, p. 357374, May 2012.
[31] S. Garca, M. Grill, J. Stiborek, and A. Zunino, “An empirical comparison of botnet detection
methods,” Computers & Security, vol. 45, p. 100123, Sep 2014.
36
-
[32] M. Ahmed, A. N. Mahmood, and J. Hu, “A survey of network anomaly detection techniques,”
Journal of Network and Computer Applications, vol. 60, p. 1931, Jan 2016.
[33] S. Dua and X. Du, Data Mining and Machine Learning in Cybersecurity. Auerbach Publications,
2016.
[34] C. Wagner, J. Franois, R. State, and T. Engel, “Machine learning approach for ip-flow record
anomaly detection,” International Conference on Research in Networking, p. 2839, 2011.
[35] M. A. Ambusaidi, X. He, P. Nanda, and Z. Tan, “Building an intrusion detection system using a
filter-based feature selection algorithm,” IEEE Transactions on Computers, vol. 65, p. 29862998,
Oct 2016.
[36] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection,” ACM Computing Surveys,
vol. 41, p. 158, Sep 2009.
[37] I. Corona, G. Giacinto, and F. Roli, “Adversarial attacks against intrusion detection systems:
Taxonomy, solutions and open issues,” Information Sciences, vol. 239, p. 201225, Aug 2013.
[38] N. Ye and Q. Chen, “An anomaly detection technique based on a chi-square statistic
for detecting intrusions into information systems,” Quality and Reliability Engineering
International, vol. 17, no. 2, p. 105112, 2001.
[39] C. Manikopoulos and S. Papavassiliou, “Network intrusion and fault detection: a statistical
anomaly approach,” IEEE Communications Magazine, vol. 40, pp. 76–82, Oct 2002.
[40] Kaspersky Lab, “Machine learning in malware detection,” tech. rep., Kaspersky, 2019.
[41] S. Saad, W. Briguglio, and H. Elmiligi, “The curious case of machine learning in malware
detection,” Proceedings of the 5th International Conference on Information Systems Security
and Privacy, Feb 2019.
[42] Raspberry Pi Foundation, “Raspberry pi foundation,” Sep 2019.
[43] C. Gormley and Z. Tong, Elasticsearch: The Definitive Guide. O’Reilly Media, Inc., 1st ed.,
2015.
37
AbstractAcknowledgementsList of FiguresList of TablesAbbreviationsIntroductionMotivationProblem formulationContributionsReport organization
A Primer on CybersecurityWhat are Cyberattacks?Economic Impacts of MalwareCategories of Cybersecurity SolutionsEthics of Malware ResearchChallenges of Malware Research
Project backgroundOverviewWhat is an nids
nids vs nipsnids implementation strategiesOverview of nids techniquesSignature-based detectionAnomaly-based detection
Packet-based and flow-based data formatsNetFlow
NfdumpSummary
Previous worksOverviewNetwork traffic datasetsMethods for network traffic dataset generation Old reference datasetsKey datasetsCIDDS-001UGR â•Ÿ16CICIDS 2017
Other relevant datasetsUNSW-NB15MAWILabUNB ISCX 2012CTU-13
General Remarks
Machine learning in network anomaly detectionClassification methodsStatistical methodsClustering methodsConstraints of machine learning
Summary
MethodologyOverviewNetwork scannerRaspberry Pi hardwareHardware setupUtilizing nfdump
Dataset selectionCIDDS-001
Network anomaly detectionHybrid network anomaly detection approachModel evaluation
Web interfaceDashboardWeb architecture
Summary
Discussion and remarksOverviewCurrent StatusDiscussion of FindingsFuture Works
Preliminary anomaly detection model resultsProject timelineEngineering Risk mitigationChallengesSteep learning curveScope identificationProprietary information
ConclusionBibliography