a security platform using software defined … security platform for software defined infrastructure...

119
A Security Platform Using Software Defined Infrastructure by Mohammad-Sina Tavoosi-Monfared A thesis submitted in conformity with the requirements for the degree of Master’s of Applied Sciences Electrical and Computer Engineering University of Toronto © Copyright by Mohammad-Sina Tavoosi-Monfared 2016

Upload: danganh

Post on 07-Mar-2018

248 views

Category:

Documents


3 download

TRANSCRIPT

A Security Platform Using Software Defined Infrastructure

by

Mohammad-Sina Tavoosi-Monfared

A thesis submitted in conformity with the requirements for the degree of Master’s of Applied Sciences

Electrical and Computer Engineering

University of Toronto

© Copyright by Mohammad-Sina Tavoosi-Monfared 2016

ii

Abstract

A Security Platform for Software Defined Infrastructure

Master’s of Applied Science

Electrical and Computer Engineering

University of Toronto

2016

In this work, we designed an architecture for cloud network security, leveraging Software

Defined Infrastructure, which enables centralized management of compute and networking

resources. We show that utilizing SDI’s service chaining and its Software Defined Networking

approach, network security functions such as intrusion detection and prevention, as well as

distributed firewalls can be realized as services in the cloud, as modeled in Network Function

Virtualization. In our platform, protective resources are located as close as possible to the entity

being protected. Furthermore, the design of the user-friendly interfaces for these services to be

used is discussed, where the user traffic flows are associated with Enhanced Security Profiles,

together forming Enhanced Security Groups. We also discuss our implemented proof-of-concept

on SAVI testbed.

iii

Acknowledgments

I would like to thank my family, especially my parents, for all the help and guidance they have

provided me up to this stage in my life.

I would like to thank my advisor, Professor Alberto Leon-Garcia, for his kind supervision and

guidance.

I am very thankful to the SAVI team, in particular, Thomas Lin and Hadi Bannazadeh for all

what they taught me, as well as the resources they provided, which enabled me to implement my

architecture on the SAVI testbed.

Last, but not the least, I am thankful to University of Toronto, and the city of Toronto as a whole,

for being an amazing community. I am, and forever will be a proud Torontonian, for the

harmony of its diverse cultures.

iv

Table of Contents

Acknowledgments.......................................................................................................................... iii

List of Tables ................................................................................................................................ vii

List of Figures .............................................................................................................................. viii

List of Acronyms .............................................................................................................................x

Chapter 1 Introduction and Problem Statement ...............................................................................1

1.1 Motivation ............................................................................................................................1

1.1.1 Cloud Security Needs ..............................................................................................1

1.1.2 Categorization of Cloud Security Mechanisms .......................................................2

1.1.3 Network Security Paradigm Shift ............................................................................3

1.2 Problem Statement ...............................................................................................................7

1.2.1 Interoperable NIDPS Platform .................................................................................7

1.2.2 Platform Scalability and Component Coordination .................................................8

1.2.3 Scope of Research ....................................................................................................9

Chapter 2 Background and Related Work .....................................................................................12

2.1 Background ........................................................................................................................12

2.1.1 Software Defined Networking (SDN) ...................................................................12

2.1.2 Network Function Virtualization (NFV) ...............................................................13

2.2 Software Defined Infrastructure ........................................................................................15

2.2.1 Conceptual Architecture ........................................................................................15

2.2.2 Opportunities and Capabilities in Providing Security ...........................................17

2.3 Related Work .....................................................................................................................18

2.3.1 IDS Frameworks for the Cloud ..............................................................................19

v

2.3.2 IDS Interface Design..............................................................................................22

Chapter 3 Overview of Hybrid Security Platform .........................................................................24

3.1 High-Level Design .............................................................................................................24

3.1.1 Design Requirements .............................................................................................24

3.1.2 High-Level Design Components............................................................................28

3.2 Design Considerations .......................................................................................................36

3.2.1 Design for Scalability ............................................................................................36

3.2.2 Design for Testability ............................................................................................40

3.2.3 Design for Extensibility .........................................................................................41

3.2.4 Design for Cost Management ................................................................................41

3.2.5 Design for Self-Protection .....................................................................................42

3.3 Interoperable IDS API .......................................................................................................43

3.3.1 Integration with Analytics-based Detection...........................................................44

3.4 Distributed Mitigation System ...........................................................................................45

Chapter 4 Software Architecture and Implementation ..................................................................47

4.1 Deployment Architectures .................................................................................................47

4.2 SDI Enabler ........................................................................................................................50

4.3 Enhanced Security Groups .................................................................................................51

4.4 Prototype Implementation ..................................................................................................52

4.4.1 IDS Appliances ......................................................................................................53

4.4.2 Component Placement ...........................................................................................55

4.4.3 Configuration API ..................................................................................................56

4.4.4 Master Coordination Agent....................................................................................56

4.4.5 Software Configuration Agents .............................................................................58

4.4.6 Load Balancer ........................................................................................................59

vi

4.4.7 Auto-scaling ...........................................................................................................62

4.4.8 Web User Interface ................................................................................................63

Chapter 5 Testing and Evaluation ..................................................................................................68

5.1 Functional Verification ......................................................................................................68

5.2 Testing Methodology .........................................................................................................70

5.2.1 Parameters of Interest ............................................................................................71

5.3 Test for Detection Time .....................................................................................................74

5.4 Test for Relative Delay Measurement ...............................................................................80

5.5 Test for Scalability .............................................................................................................83

5.6 Tests for Detection Accuracy and Information Integrity ...................................................91

Chapter 6 Conclusion .....................................................................................................................95

6.1 Overall Evaluation .............................................................................................................95

6.2 Future Work .......................................................................................................................96

6.3 Contribution .......................................................................................................................97

References or Bibliography ...........................................................................................................99

vii

List of Tables

Table 5-1 – Chosen Testing Parameters

Table 5-2 – Summary of the Experiment Measuring the Detection Time of Different Security

Sensitivities

Table 5-3 – Round Trip Times (RTTs) for the three VA Scenarios

Table 5-4 – Round Trip Times (RTTs) for the two PA Scenarios

viii

List of Figures

Figure 1-1 – Typical DMZ Style Firewall Deployment

Figure 2-1 – Conventional Switches vs. SDN-based Switches

Figure 2-2 – SDI Components

Figure 2-3 – Realization of Security Module in SDI

Figure 3-1 – Attack from Outside Scenario

Figure 3-2 – Attack from Inside Scenario

Figure 3-3 – General Resource Coordination Architecture

Figure 3-4 – High-Level Components

Figure 4-1 – Flat Deployment / Communication Architecture

Figure 4-2 – Hierarchical Deployment / Communication Architecture

Figure 4-3 - VA Implementation Using Virtual Machine including Snort and OVS – Note the

direction of traffic

Figure 4-4 – Detailed Components of Master Agent

Figure 4-5 – Distributed Load Balancing Scheme

Figure 4-6 – Distributed Load Balancing Scheme

Figure 4-7 - Sign in Page of the GUI

Figure 4-8 – VM List Page of the GUI

Figure 4-9 – ESP Page of the GUI

Figure 4-10 – Chaining Page of the GUI

ix

Figure 5-1 – Verification of Chaining Using Ping and TCPDUMP

Figure 5-2 - Test using Netcat utility (The packet containing "attack" string from samplevm4 to

samplevm3 is dropped)

Figure 5-3 – VA Test Using a Low Sensitivity DoS Threshold

Figure 5-4 – VA Test Using a Medium Sensitivity DoS Threshold

Figure 5-5 – VA Test Using a High Sensitivity DoS Threshold

Figure 5-6 – PA Test - Low Sensitivity

Figure 5-7 – PA Test - Medium sensitivity

Figure 5-8 – PA Test - High Sensitivity

Figure 5-9 – Total BW against time for scenario A

Figure 5-10 – Total resource number scaling as the BW sum increases in a step-like manner

Figure 5-11 – Number of Low/Small IDS Resources vs. the total inspection bandwidth over time

Figure 5-12 – IDS-1 and 2 Resource Utilizations and Bandwidths throughout the experiment

Figure 5-13 – IDS-3 and 4 Resource Utilizations and Bandwidths throughout the experiment

Figure 5-14 – Expected growth rate of packet loss for a given IDS

Figure 5-15 – The attack used the second scenario to measure the packet loss

Figure 5-16 – Iperf test, first without an attack, and then under an attack scenario

x

List of Acronyms

API – Application Program Interface

AD – Anomaly Detection

ASIC – Application Specific Integrated Circuit

BW – Bandwidth

CAPEX – Capital Expenses

CDN – Content Distribution Network

CPU – Central Processing Unit

DAQ – Data Acquisition

DB – Database

DDoS – Distributed Denial of Service

DST – Destination

DMZ – Demilitarized Zone

DNS – Domain Name System

DoS – Denial of Service

DPI – Deep Packet Inspection

ESG – Enhanced Security Group

ESP – Enhanced Security Profile

FPGA – Field Programmable Gate Array

FW – Firewall

xi

GDP – Gross Domestic Product

GUI – Graphical User Interface

HIDS – Host-based Intrusion Detection System

HTTP – Hypertext Transfer Protocol

HTTPS – HTTP Secure

IAM – Identity and Access Management

IDS – Intrusion Detection System

IDEMF - Intrusion Detection Message Exchange Format

IETF – Internet Engineering Task Force

IP – Internet Protocol

IPS – Intrusion Prevention System

IT – Information Technology

LB – Load Balancer

LTE – Long Term Evolution

M&M – Monitoring and Measurement

MAC – Media Access Control

NetFPGA – Network Field Programmable Gate Array

NIDS – Network Intrusion Detection System

NIPS – Network Intrusion Prevention System

NIDPS – Network Intrusion Detection and Prevention System

xii

NIST – National Instantiate for Standardization Technology

NFV – Network Function Virtualization

OPEX – Operational Expenses

OVS – Open vSwitch

PA – Physical Appliance

QoS – Quality of Service

RC – Resource Controller

REST (RESTful) - Representational State Transfer (based)

RMS – Resource Management System

ROC – Rate of Change

RTT – Round Trip Time

SAVI – Smart Application for Virtual Infrastructure

SDI – Software Defined Infrastructure

SDN – Software Defined Networking

SQL - Structured Query Language

SRC – Source

SSH – Secure Shell

SW – Software

TCAM –Ternary Content Addressable Memory

TCO – Total Cost of Ownership

xiii

TCP – Transport Control Protocol

TLS – Transport Security Layer

UI – User Interface

VA – Virtual Appliance

VM – Virtual Machine

VNF – Virtualized Network Function

VPC – Virtual Private Cloud

VPN – Virtual Private Network

VXLAN - Virtual Extensible LAN

WAN – Wide Area Network

1

Chapter 1 Introduction and Problem Statement

In this chapter, we discuss the general motivation, the direction of the cloud network security

paradigm, and state the particular problem we tackled.

1.1 Motivation

1.1.1 Cloud Security Needs

In order to motivate the significance of our discussion on cloud network security, we begin by

reviewing some recent statistics that are relevant to our work. In 2015, it was estimated that the

total annual cost of malicious cyber-attacks ranges from $100 Billion to $1 Trillion (US dollars)

[1] [2] [3] (Compare this number with Canada’s 2015 GDP, estimated to be $1.548 Trillion [4]).

From 2014 to 2015, there was an increase of 38 percent in detected security attacks [5]. The

average length of time a hacker stays inside a network undetected is estimated to be over 140

days [6]. In a survey of 814 qualified IT security decision makers (for organizations with at least

500 employees), at least 52 percent of them anticipated there would be a successful attack on

their network infrastructure within the year 2015 [7]. These statistics increasingly relate to cloud

computing. In the case of global organizations, they increasingly use clouds. In 2015, only about

38 percent of global organizations see themselves ready to mitigate sophisticated attacks [8].

One may feel the attractiveness of cloud for hackers by looking at NIST’s definition of cloud

computing: “Cloud computing is a model for enabling ubiquitous, convenient, on-demand

network access to a shared pool of configurable computing resources” [10]. In particular, the

cloud is to be available everywhere, be able to launch attacks on-demand, with convenient

remote access. It is hard to find a hacker that would not love such technology. In 2015, it was

estimated that about 6.5 million unique hosts were leveraged by cyber criminals to launch attacks

[11]. There are many clusters of “unique” hosts available on the cloud, making it lucrative for

hackers to gain hands on.

Similarly, clouds are increasingly becoming an exotic target for attackers. Due to its economy of

scale, both small and large enterprises have been moving part or their entire computing

2

operations to the cloud. This implies that we shall see an increasing shift in targets. The statistics

around that are not surprising at all. Exploits affecting corporate and internal networks have

grown to 40% in 2015, up from 18% in 2014 [12].

On the other hand, the detection of cyber-attacks has remained a difficult task. In 2015, an

organization that investigated billions of security events in 17 countries, reported that 59%

percent of the victims did not discover the breaches themselves. While cloud is attractive for

attacks both originated from and targeted to it, it is evident that conventional frameworks have

many shortcomings in dealing with cloud network security. We will discuss these shortcomings

in the sections ahead.

1.1.2 Categorization of Cloud Security Mechanisms

The term “security” may have many different connotations. In order to properly scope our work,

we propose a brief categorization of cloud security mechanisms into four main types:

1- Encryption-based: Utilizing Encryption, Hashing, Virtual Private Networks (VPNs), etc.

to ensure confidentiality and integrity. There are approaches both at lower and higher

layers (end-to-end).

2- Intrusion Detection based: The topic of our work, covered in the background chapter.

3- Hypervisor / OS-based: Isolation of processes, and possibly including a firewall below

the Virtual Machine (VM).

4- Virtualization Based: E.g. Network slicing using Software Defined Networking (SDN),

or adding proxies to SDN controllers. Not a direct topic of this work.

3

1.1.3 Network Security Paradigm Shift

In order to understand the current trends in the network security industry, we first review the

traditional network security frameworks. Then, we state our understanding and anticipation of

the current trends in network security.

1.1.3.1 Conventional Firewalls and the DMZ Model

Traditionally, the most frequently used module in network security has been the firewall. The

firewall is an entity that monitors ingress and egress traffic, taking actions (i.e. allow traffic to

pass or else block it) based on a set of pre-defined rules [13]. In particular, firewalls prevent

exposure / access to certain software ports. The firewall may be implemented in hardware or

software, be hosted (within a computer) or external (as a separate module or box). With a hosted

firewall (e.g. the default Firewall that comes with the operating system), usually the unused /

unnecessary software ports are closed. In the case of an external firewall, packets that can be

identified (by looking at the packet headers) as communicating with certain unwanted ports are

dropped.

Perhaps the most well-known traditional architectural framework for deploying firewalls in

network security is the DMZ (Demilitarized Zone) model. DMZ is a small network or

subnetwork that exposes front facing external services. Such services are facing the internet or

else less trusted / outside networks [14]. The DMZ in network security is designated as the

network / subnet that is not as restricted. The reason for that is the services it should expose to

the outer world, for which certain firewall ports must be left open. As shown in Figure 1-1, the

DMZ is usually between two secured zones, separated by means of Firewalls.

Figure 1-1 – Typical DMZ Style Firewall Deployment

4

In cloud computing, firewalls are not implemented quite the same way, and hence are often

named differently, namely “Security Groups”. A Security Group acts as a virtual firewall [15],

usually implemented at the hypervisor. Hypervisor is the entity dividing the underlying physical

infrastructure in order to provide virtualization / virtually isolated chunks of resource (i.e. Virtual

Machines (VMs)). A Security Group may be used for one or more VM instances in the cloud.

Besides the policy and port range, security groups may also include direction (inbound or

outbound) or specific source addresses (e.g. open the email port only for communications

from/to a specific IP address).

The DMZ model has shortcomings when applied to the cloud. In particular, the DMZ model

relies on the assumption that one network / subnet can be more trusted than another. However,

how can one define the interior and exterior and draw such borders in a cloud, particularly a

public one? In particular, in a given public cloud, there may be VMs of various entities running

on the same physical machine at any given point in time.

1.1.3.2 Intrusion Detection and Prevention Systems

Besides the firewalls and DMZs, we shall review the concept and evolution of Intrusion

Detection Systems (IDSes) as well as Intrusion Prevention Systems (IPSes). An IDS is a reactive

module that is used to detect attack traffic, while an IPS is an active module that would prevent /

mitigate detected attacks (e.g. by re-routing or blocking incoming attack traffic). IDSes come in

different types. These include the following:

1- Signature/Pattern-based: They tend to be network based, looking for specific

signatures patterns within the packet header and payload. Their advantage is that their

detection accuracies for the attacks with well-known signatures are known to be high

(e.g. over 99%). In particular, they almost lack false positives (unless the signature

assigned to them is dual-use).

2- Analytics-based: This type of IDS typically detects deviations from a norm / baseline,

which is established using a mathematical model. As such, an alternative name for

this category is Anomaly Detection (AD). AD may detect the authenticated users

5

merely demonstrating a strange usage pattern as an attack, labeling a perceived

regular activity as anomalous (i.e. false positive) [16].

Another IDS classification can be done based on the platform monitored. In particular, we have

the following categories:

1- HIDS (Host-based IDS): In this category, the detection system regularly monitors the

system logs, processes, file system, and the network interface (from inside the host

machine).

2- NIDS (Network-based IDS): The IDS of this type operates at the network layer, by

looking at the packet header and the content of the network / IP layer. This thorough

inspection operation is called DPI (Deep Packet Inspection), which involves looking at

the payload [17]. If the NIDS is capable of blocking the attack traffic, (E.g. through inline

placement) it is then noted as an NIDPS (Network Intrusion Detection and Prevention

System).

In the next Chapters, we propose an interoperable model where IDSes of different categories

may integrate together to form a complex IDS system. In particular, an AD Detection module

may work together with a signature-based IDS for further verification. For the most part, we

focus on signature-based NIDPSes, as our proof-of-concept is majorly built with that.

1.1.3.3 The New Paradigm

Going back to the question of cloud network security, even if we were able to draw the lines

between the trusted and untrusted networks in the cloud (e.g. through use of Virtualized

Networks), the port-based firewall technology is inherently flawed, as it can never deal with

vulnerabilities that exist in traffic belonging to the well-known ports (e.g. SQL injection).

Clearly, Packet header information does not provide sufficient criteria for malware detection. As

attacks become more sophisticated, the defenses have had to become more detailed and fine-

grained.

To answer the challenge, a new paradigm in network security has emerged. In this new

paradigm, the defenses are rather located closer to the resource being protected, sometimes

6

located at the application layer itself. This is the approach that Google has chosen [18]. In

particular, they have been brining security operations (access control, authentication, and

authorization) to the application layer and do not use the traditional firewalls. This framework is

also along the famous “Application knows the best” argument (also known as “the end-to-end

argument”) [19]. Other cloud landscape trends include the growth of in-house custom software to

increase security. In addition, the interior security is now just the same as perimeter, as if it’s all

perimeter. This is reflected in Google’s Zero Trust Model, treating the internal and external

networks (e.g. Internet) the same. This is said to be in the direction of a paradigm shift in the

security of future networks [20].

Nevertheless, when it comes to network security, the solution we provide in this work is a mix of

local and global, inside and on the network edge security. There are situations with attacks that

can be best mitigated in the switches or edge routers. For those attacks, it is the networking and

infrastructure layer that can more efficiently be deployed to block the attack. This is especially

the case for DoS (Denial of Service) and DDoS (Distributed Denial of Service) attacks. These

attacks take advantage of a device or service being network connected, typically by flooding it

with traffic requests [21]. At times, these attacks require an asymmetry of resources, sometimes

can be achieved by the overhead of application layer processing (making the flooding info more

time and resource consuming to process).

Of course, the borders of the layers are increasingly collapsing, and our work is aligned with this

trend, too. The new paradigm, as we will see in the case of SAVI (Smart Applications for Virtual

Infrastructures), is to have the network infrastructure have accessible API for the so-called

“smart” applications (perhaps they get their “smartness” from the exceptional access they have to

control the network infrastructure). If not already, this may be a game changer in the networking

world in the years to come.

Therefore, DMZ is increasingly replaced with security groups that act as a defense circle close to

the exact entity being supported. As mentioned, while HIDSes are easier to install and configure,

the network flavor of security could be more effective, especially in terms of mitigation.

We need ever more scalable security orchestration to protect our clouds. Furthermore, we need to

minimize the amount of human labour we require by automating tasks as much as possible.

7

Likewise, it is preferable to be able to control the security platform in a logically centralized

way. At the same time, the security infrastructure itself may be distributed in implementation in

order to scale. In particular, any platform shall be able to auto-scale up and down based the

demand.

Platforms are growingly become adaptive in terms of their decision making. Establishing future

trends in input demand, learning future from the past is increasingly added to business

production solutions [22]. Similarly, Analytics-based attack detection techniques are increasingly

integrated together with the conventional security schemes [23].

As well, as we discuss in our NFV (Network Function Virtualization) section, network security

is increasingly moving from “appliances” to a set of services (e.g. honeypot, IDS, Firewall, etc.)

that are connected through service chaining. Such ecosystem of virtualization, cloud, and NFV,

motivates inter-operability, and hence industry players are increasingly co-operating through

Open Source communities [24].

Motivated by these, our solution places security defenses closer to the resources being protected,

scale on demand at the scale of the cloud, and it is interoperable to provide the space for mixing

and matching, and being creative.

1.2 Problem Statement

In this section, we define the exact problem we attempted to solve. In particular, we shall note

the scope of our solution, in terms of what it covers and what it does not. While defining the

scope is important in general, it is extra important in the security field.

In short, our problem is to design a cloud network security platform, which realizes a scalable

architecture for NIDPS deployment. This design is to provide automation, scaling, and

coordination among the network security components. In the next sections, we will discuss our

problem and objectives in further detail.

1.2.1 Interoperable NIDPS Platform

NIDPSes come in many shapes and forms. There are many different implementations of them,

including the following: Proprietary hardware: ASICs (Application Specific Integrated Circuit),

8

Standard computer hardware (e.g. x86 architecture), NetFPGA hardware (and hardware

description language), Software (e.g. Snort), Virtual appliance (which is usually a virtual

machine, container, etc.).

More importantly, NIDPSes come from different vendors. In general, the vendors do not like

their enterprise customers to be able to mix and match their products with their those of their

competitors. The consequence is a lack of interoperability. This is one important objective of the

problem we attempted to solve, to come up with a design that provides interoperability for

NIDPSes, such that they can communicate and coordinate with each other.

Such architecture enables hybrid and heterogeneous solutions. Hybridity here primarily means

the ability to include physical and virtual resource (within the same solution / ecosystem). For

that, as we discuss in Chapter 3 and 4, we had to come up with a systematic approach to

abstracting the IDS resources, regardless of what vendor or what implementation it comes from.

While the overall trend may be moving towards virtualized resource, there has already been vast

CAPEX (Capital Expenses) spent on the legacy physical devices, which we note as PAs

(Physical Appliances). As well, virtualized resource (noted as VAs (Virtual Appliances)) tend

not to match the dedicated hardware’s performance (in both Bandwidth (BW) and delay).

Nonetheless, the VAs may perform the role of overflow. The only other alternative to that is to

further increase the capital investments to acquire a greater number of hardware-intensive PAs.

Of course, that approach comes with increased OPEX (Operational Expenses), too [25].

Hence, interoperability matters as it enables flexible IDS platforms to be realized. While

interoperability is important by itself, perhaps a main driving force for it in today’s context of

cloud computing is the increasing need for scalability, which as we discuss in the next

subsection. Scalable solutions be realized through hybrid utilization of virtualized resources in

addition to physical ones.

1.2.2 Platform Scalability and Component Coordination

Scalability, in general, may be defined as a measure of the system capability to increase its

performance [26], in our case, for a growing input traffic / demand. To put this into perspective,

one may consider the case of an enterprise such as Hewlett Packard (HP). It has been estimated

9

that in 2013, HP had to deal with 1 trillion security events per day, which amounts to roughly 12

million events per second [27]. Without scalability in mind, no solution can ever be designed to

match that level of input inspection traffic.

With cloud’s ever increasing growth in size, the security apparatus has to scale as well. That is

why scalability is a key component of our problem, and we chose it as an objective, detailed in

Chapter 3 and 4. As mentioned, the main goal we consider for virtualized resources is to act as

overflow, essentially increasing the scalability of the system.

As we will discuss in the next chapters, a particular type of attack we focus on a lot in this work

is DoS / DDoS attacks. The reason for that is this type of attacks makes up one of the two major

categories of attacks that signature-based NIDSes detect. In these attacks, the response time from

detection to mitigation (e.g. blocking, rerouting, or honey potting the traffic) matters a lot. We

discuss that in details in another work of ours [28]. In particular, in the case of DoS attacks, the

higher the delay, the more problematic traffic is allowed in the network, potentially slowing

down other connection, not just those of the targeted service node(s).

As we will discuss, the scalable approach about blocking these attacks is to have a distributed

architecture in both detection and mitigation (e.g. blocking at the edge of the network). However,

the distributed architecture should not result in increased detection to mitigation delays. In order

to aim for reduced attack response times, we seek an in-depth integration of detection and

defenses measures. For that, coordination matters a lot. As we discuss in Chapter 4, having

logically centralized decision makers plays an important role in providing this coordination.

1.2.3 Scope of Research

In this section, we discuss what we intend to cover. More importantly, we state what we will not

be covering. In this work, we focus on designing a DPI / signature-based NIDPS platform

architecture for clouds. We do not cover design of Analytics-based detection. As well, we will

not be getting into the details of the DPI algorithms (e.g. Boyer-Moore pattern match [29]). This

implies that we treat the NIDS modules as black boxes for the most part. As well, we are not

concerned with adding new signatures to IDSes. While we do identify the need for a signature

update scheme, we do not detail a design for that feature.

10

Likewise, we do not aim to cover a great number of exploits as use cases or for testing. Of

course, we will categorize the attacks in general, and will have a representative of each kind in

our testing and measurement as well as our use cases. The single type of attack we most focus on

is DoS. For the most part, we consider DDoS attacks as an extension of DoS. In practice, AD-

detection schemes may provide much better detection for DDoS attacks as well as outright abuse

of the system for authenticated users than signature-based detection. However, an in-depth

comparison of IDSes of different types is outside our scope. Security may have inherently

become a big data problem (E.g. the example we mentioned on HP in the last subsection, with 1

trillion security events per year). However, we do not cover big data techniques, rather aim for

the scalability of our platform.

Likewise, our testing will be limited to very basic attack types. We do not perform our testing

with real captured traffic, as that may require very specific signatures. It is commonly said that a

100% secure system does not exist unless the system is in complete isolation (which would make

it useless). We do not claim our work is a direct security solution. Rather, our work is an

architectural framework that can motivate a real solution.

Our work does not provide a sophisticated security alert management system. In particular, for

our defense system, we simply block any suspicious traffic, which may imply an underlying

assumption that the signature-based NIDSes have a near perfect accuracy (of course, this is a

safer / militarized approach, false blockings may be reported and dealt with separately). While

we believe signature-based detection provides better detection accuracy for attacks with well-

known signature, different categories of inaccuracies (false positive, false negative) may exist. In

fact, the accuracy of the IDS is a function of the accuracy of the signature, and how well is has

been correlated with various attack and non-attack traffics. However, we do not perform a

detailed study on that.

It is crucial to emphasize that our signature-based platform does not detect zero-day attacks. As

well, it may not detect unusual usage patterns. As mentioned before, Analytics-based IDSes may

detect these two categories, however, at the cost of reduced overall accuracy. In particular, there

are certain anomalies that may be just an unusual yet valid usage pattern (e.g. one of the staff

who has traveled to a part of the world in a different time-zone has logged into the system, which

11

was a different UTC time). These type of anomalies typically don’t cause any issues with

signature-based detection, since in signature-based detection there is no use of clustering

algorithms. However, they may get flagged by anomaly detection system. This is a typical false

positive situation in Analytics-based solutions, while that would not typically happen on our

platform. Hence, there are trade-offs that need to be considered when choosing between AD and

signature-based detection.

Needless to say, DPI-based detection is not directly useable for encrypted traffic. There are

various schemes to deal with encrypted traffic, such as a militarized approach where unapproved

encrypted traffic (i.e. encrypted with an unapproved key) may be simply dropped. However, we

do not concern the management of encrypted traffic, as this is a wider security aspect of the

cloud network, which is outside the scope of our architecture.

12

Chapter 2 Background and Related Work

In this chapter, we will introduce background concepts and some of the terms that are frequently

used throughout the thesis.

2.1 Background

2.1.1 Software Defined Networking (SDN)

SDN is a recent network packet switching framework. Its core principle is to separate the data

and control planes within the packet switches and routers in a network. Generally, within the

switches and routers, the data plane is where the basic packet forwarding operations take place

(e.g. buffering, scheduling, and then forwarding based on a table lookup), while the control plane

is in charge of routing operations (e.g. setting up the values in the lookup table) [30]. Often in

switches and routers, both the control and data planes are implemented together in hardware, all

within the same box.

In SDN, however, the two planes are decoupled, where the data plane can be implemented using

cheap commodity hardware while the control plane is taken out of the switch. The control plane

could then be centralized (logically and/or in implementation), resulting a single controller

managing several switches. The centralized controller is implemented in software. In a sense, the

brain/intelligence of the switch is taken out of its hardware, and switches themselves play the

role of dumb packet forwarders. This concept is demonstrated in Figure 2-1, contrasting the

conventional vs. SDN switches.

13

Figure 2-1 - Conventional Switches vs. SDN-based Switches [31]

The most significant implementation of SDN to this date is OpenFlow, which is a standardized

protocol for the controller and dumb switches to communicate [31]. OpenFlow controllers

instruct the switches to act on incoming packets according to certain rules, which are called”

flows” in the SDN context. Flows can be used to classify packets according to almost all L2, L3,

or L4 header fields (e.g. ingress Ethernet port number, MAC and/or IP destination addresses,

choice of IP and/or transfer layer protocol, Layer 4 port numbers) [32]. Hence OpenFlow

switches tend to have flow tables instead of simple forwarding tables, which are populated

typically using Ternary Content Addressable Memory (TCAM).

SDN is particularly relevant to our work in two aspects, the first is service chaining, and the

latter is our distributed defense system. As we discuss in the next section, the service chaining

we utilize is implemented using SDN / OpenFlow under the hood. As well, our distributed

defense relies on the SDN-based firewall we implemented. We described the distributed firewall

in detail in [28], and we will revisit it again later in this thesis.

2.1.2 Network Function Virtualization (NFV)

Network Function Visualization is a recent framework where various network functions are

classified into distinct building blocks/modules, where they can be implemented in a virtualized

manner [33]. Such building blocks are called Virtualized Network Functions (VNFs). Examples

of the VNF abstraction would be Virtual Machines (VMs) within a multi-tiered cloud that

14

perform specific network functions such as Load Balancing, Firewall, Encryption, Virtual

Private Network (VPN), Wide Area Network (WAN) Accelerators, and DPI. The modules/VMs

may then be chained together to provide the intended network services. In its radical form, NFV

attempts to virtualize the entire network functions and nodes.

In terms of objectives, NFV clearly attempts to bring the well-known advantages of

virtualization into the world of networking. Such general advantages include scalability, reduced

overall power consumption (by dynamically turning on and off the underlying hardware), and

reducing Total Cost of Ownership (TCO) by reducing both CAPEX and OPEX. However, in our

view, the most significant potential advantage of NFV, in comparison to the unabstracted

networking service models from the past, is its enabling of more than ever rapid innovation as

well as flexibility in assigning hardware. One can see the latter through the capability of

migration of network services/VNFs from cloud to cloud or cloud edge to cloud core, so long as

the delay requirements can be met [25].

Next, we discuss Security as a Service within NFV. Several security operations introduced above

can be implemented as VNFs, such as encryption, firewall, DPI, NIDS, NIPS, and VPN [34].

Implementing these as VNF has many advantages, perhaps most importantly, the scalability.

For firewall, it can be located at the edge of the cloud, to keep attacker outside the cloud as far as

possible. This is meaningful when considering DOS attacks that make the cloud nonfunctional

by leaving little to no bandwidth for its essential network operations.

Thanks to VM live migration technology [35], system administrators can have little to no

downtime for network topology update procedures. Implementing DPI itself in software than

buying proprietary hardware boxes, huge cost savings are introduced [28]. Also, for NIDS and

NIPS, the security platform could scale up and down its software based DPI resources on the fly,

and have a dynamic security investigation policy according to the sensed scale of an attack.

SDN is crucial for the VNF implementation of such security services, as SDN enables the system

to dynamically adjust flows that connect or disconnect certain network nodes. Also, say if we

have limited DPI computing resources/limited number of expensive DPI hardware boxes, then

we could use SDN to sample a portion of the network traffic at a time by looping through flows.

15

In this thesis, we argue that network security of a cloud can be modeled as a network function

and then realized as a service. Hence, as we shall see, our work has an NFV flavour to it, as our

implementation of NIDPS as a service can be modeled as a VNF.

2.2 Software Defined Infrastructure

In this section, we introduce Software Defined Infrastructure (SDI), and a particular

implementation of it, code-named Janus. As part of that, we review Smart Application for

Virtual Infrastructure (SAVI) testbed, which is a multi-edged cloud platform we leveraged to

implement our architecture.

2.2.1 Conceptual Architecture

Software-defined Infrastructure is a new architectural framework for supporting applications by

virtualization and integrated management of converged heterogeneous resources in a multi-tiered

cloud [36] [37]. SDI's goal is to enable programmability of both the cloud applications and

network functions by providing High-Level abstraction interfaces for programmers and their

applications. Heterogeneous resource types include computing, programmable hardware (e.g.

NetFPGAs), and networking resources.

In essence, SDI combines Software Defined Network (SDN) and Cloud Computing to realize an

infrastructure where applications can be deployed rapidly and with increased flexibility, taking

advantage of virtualized heterogeneous resources and sliced network.

Figure 2-2 depicts the components of the SDI Resource Management System (RMS). The RMS

has various Resource Controllers (RCs), each of which providing the resource-specific support

(i.e. similar to hardware specific drivers). The RCs are accessed and managed through two

higher level modules, SDI Manager, and the Topology Manager.

16

Figure 2-2 - SDI Components [37]

Topology Manager monitors computing and network resources, storing particular parameters of

interest, such as link status, bandwidth, processor utilization, etc. the SDI Manager is in charge

of deploying the applications, consulting with the Topology Manager as needed. Both Topology

Manager and SDI Manager have open interfaces for the user, which could be the system

administrator, a developer, or the application itself.

The Canadian Smart Applications for Virtual Infrastructure (SAVI) Testbed has an example

implementation of SDI [37]. The SAVI Testbed has a data-center as Core node and seven Smart

Edge nodes at seven Canadian universities. These nodes provide virtualized resources.

SAVI include an SDI RMS. The RMS uses OpenStack as the controller for computing resources

and uses an OpenFlow-based control platform for the networking resources. These components

are ruled by a higher level SDI Manager, code-named Janus.

SDI Manager Topology Manager

Resource Controller A

Resource Controller B

Resource Controller C

SDI Resource Management System (RMS)

Open Interface

External Application

System Administrator

Resource CResource BResource A

Converged Heterogonous Resources

PhysicalResource

VirtualResource

17

2.2.2 Opportunities and Capabilities in Providing Security

Due to the centralized management of computing and networking resources, we propose that the

best place to implement a security detection and mitigation controller module is within the SDI

Manager itself. Such controller module could either be implemented as a module that runs on top

of Module Manager, or a separate entity that interfaces with the SDI Manager. The latter, which

is used for this work, is depicted in Figure 2-3. The resources controlled by such security

controller could be physical (e.g. proprietary IDSs), or virtual (e.g. Virtual Appliance based on

Snort).

Figure 2-3 -Realization of Security Module in SDI

Then, when the SDI Manager schedules the resources, it could take into account the security

constraints of the application that is to be run over the resources. In particular, it makes sense to

locate intrusion detection resources closer to the monitored/sensitive resources, when possible.

As well, the SDI Manager can facilitate the service chaining required to ensure all the traffic of

the monitored resources passes through IDS resources.

SDI Manager could interface between Anomaly detection and Signature-based detection

systems, passing traffic marked as suspicious by Anomaly detection to the signature-based IDS

18

for further investigation. This is particularly useful in a scenario where we have limited Deep

Packet Inspection resources for IDS.

Also, having access to all OpenFlow switches, SDI Manager could help to realize a distributed

firewall to block attacker traffic anywhere within the platform, most importantly in the gateway

switches and routers. Once an attacker has been identified, the header fields associated with the

attack packets (e.g. source IP address) can be passed to SDI Manager to install blocking rules in

all the switches within the path to the victim. This SDI mechanism leverages SDN.

Finally, the SDI Manager could be used in orchestrating security resources, giving higher

priority to hardware-based and NetFPGA based IDSs (due to their superior bandwidth, lower

latency, and higher power efficiency), using Virtual Appliances as overflow. The interfacing

with Janus SDI is further discussed in the design chapters.

As mentioned, besides the distributed firewall, SDI provides us with an interface for service

chaining. Increasingly, providing a traffic chaining capability is becoming important. Most cloud

providers, however, do not provide that at the moment. Hence, this justifies our choice of SAVI

Testbed, due to its Janus chaining API available. We discuss the chaining API that our work

leverages in detail in Chapter 4.

2.3 Related Work

Our work is certainly not the first attempt in solving the general cloud IDS deployment and

management problem. In this section, we review and analyze the relevant literature. As well, we

state how our solution is different from the past research, starting in this section, and furthered in

the contribution section of Chapter 6.

Before we being our review, we shall note that in order to have an in-depth analysis, we limit our

review to two particular classes of work. We found these two to be most related to our work. The

first category is IDS framework design, and the latter is IDS interface design. In terms of other

areas that did not dive deep into, we would like to mention load balancing and service chaining.

For example, in [38], a general framework for middle box service chaining and load balancing is

presented. However, such designs tend to be more general and not pertain to particular security

19

needs. Nevertheless, the paradigm involved is same as ours. For instance, we similarly realize

our load balancing using centralized service chaining (discussed in Chapter 5). We now move to

discussing the two main categories of interest, which is the topic of the next two subsections.

2.3.1 IDS Frameworks for the Cloud

In this subsection, we go over the related work in Cloud IDS Architecture for security resources.

We start with a detailed review of the work by Roschke et al. [39], as it is the most closely

related work to ours. In their paper, they perform an analysis on the requirements of IDS

deployment in the cloud and propose an architecture for the management of IDSes. Their goal,

which we share, is to make the IDS utilization more user-friendly for non-administrator users in

a cloud environment. To some extent, the architecture they present resembles the hierarchical

deployment scheme we discuss in Chapter 4. In particular, they recommend IDSes to report to a

central management module, which is also in charge of remotely controlling the IDSes. As well,

we intersect with them on the idea of a feature that enables the user being able to communicate

with the cloud to pick their desired IDS. They similarly propose correlating IDS alerts to gain

higher accuracies / bigger picture.

Nevertheless, their work merely focuses on detection and not on prevention. As such, they

consider no network security defense (e.g. blocking the traffic at a switch). Hence, there is no

feedback to the networking resources.

Admitted by the authors, their work does not address scalability (and orchestration). As well,

their work almost lacks any proof-of-concept, which they state as a future work. For their data

acquisition, they merely rely on the hypervisor to provide monitoring info (similar to the case of

SAVI Monitoring and Measurement system, which we discuss in a later subsection). However,

in practice, such tapping point by the hypervisor may or may not be exposed to the developer by

the cloud infrastructure. In that sense, as we will see in the next chapters, our data acquisition

(DAQ) system is more general, as our polling-based DAQ has no such requirement. As well,

they provide no comparison to other platforms.

Then, they state that it is the Cloud provider’s responsibility to provide various IDS VMs as well

as the enabling attachment of virtual IDSes to specific VMs. By doing so, they do not consider a

20

well-defined scheme for the user to communicate their security requirement and have the

management system map them to corresponding IDSes (in Chapter 4, we introduce the notion of

Enhanced Security Profiles (ESPs), particularly dedicated to solidifying a communication

scheme / protocol for this very purpose).

Their scheme is centralized in implementation (not just logically). Yet they do not provide a

scheme to increase the resiliency of their system (in the event of the centralized manager going

down). Likewise, they consider no self-protection scheme for their system (e.g. for the event of

the IDS controller itself being under attack). In addition, their platform only focuses on virtual

resources, in particular, a combination of virtualized NIDS and application layer HIDS, and not

NIDSes of different types (physical vs. virtual).

They delegate the responsibility of assigning a particular IDS to the user itself, and do not

consider sharing IDSes for a user’s traffic of the same security requirement. Their solution lacks

an automated IDS assigner and load balancer. Therefore, while their work has certain similarities

to ours, it generally states a set of (at times vague) ideas than solidifying them in the

specification or realizing them in practice.

Next, we briefly discuss the work of Dhage et al. [40]. They also attempt to solve the problem of

detection in a distributed cloud ecosystem by providing an IDS model for the cloud. Unlike

Roschke et al., they consider mitigation as part of their cloud IDS platform. Their platform

deploys individual “mini” IDSes per user. These mini IDSes are managed by node controllers,

which may contain other mini IDSes of other users. Each controller performs analysis on the

IDSes it manages. Their assignment scheme is different than ours, and as we will analyze, we

find our IDS to user flow assignment / categorization more efficient in terms of resource

utilization.

Next, in [41], the authors outline an architecture for DDoS detection using collaboration among

the IDSes. In particular, their solution deals with a vast number of logs, and they present a

quantitative solution (based on Dempster-Shafer Theory) for analyzing alerts. However, their

work only considers virtual appliances.

21

In [42], the challenge considered is the case of customers of a public cloud requiring to

implement their own Virtual Appliance based IDS in the cloud. For that, they suggest a very

simple architecture that has 3 components, the IDS VM, a management VM, and possibly a load

balancer VM. Their work does not utilize overlays, SDN, and recent chaining techniques, instead

using Amazon's Virtual Private Cloud (VPC) to force the traffic to pass through the IDS VM.

In [43], a framework for sharing IDS data across various levels of infrastructure as well as across

different clouds is presented. The authors hope that their framework would lead to collaborative

IDS clusters across many clouds.

In [44], a scheme for sharing trust in a distributed IDS platform is distributed. In particular, they

argue that increasing collaboration between the IDSes results by providing them with a scheme

to decide whether or not accept an alert result in reduced detection time. The work only covers

DoS attacks, and they only consider the case of virtual appliances.

In [45], again the DoS attacks are tackled, this time for a distributed implementation in both

detection and prevention. They suggest placing IDS at cluster controller, which is the highest

entity managing the orchestration / computing aspect of the VMs, through a hierarchical scheme.

They also discuss the idea of having one IDS per physical machine. However, this approach may

not be scalable, as the individual VM traffic may vary, possibly resulting in low levels of

resource utilization. As well, it does not consider the case of VMs within a physical machine

attacking one another.

In [46], the authors describe a simple scheme to deal with DDoS, by observing for traffic spikes.

In the event of a spike, they pass the traffic through IDS, log all the traffic, and if no SYN ACK

detected they send it to the honeypot. Their system seems not to consider many of the general

cases, as it focuses merely on TCP SYN ACK flood (we used this attack in our testing, too).

Scalability concerns of the matter are not visited at all. Hence, ignoring the security orchestration

aspects of the problem. Their IDS implementation, however, has a similarity to ours. In

particular, they assume that the DPI software is installed in the virtual switch, which happens to

be similar to our implementation of VA. However, our VAs are not primarily designed as

switches. In our case, the virtual switch is rather used to facilitate the traffic chaining through the

IDS program.

22

In [47], the authors present the idea of implementing CIDS (Cloud IDS), which is an IDS

platform, as a network service. They have a dedicated layer for the database operations, which

perhaps makes the memory operations of their system more efficient. However, their work does

not concern interoperability and extensibility for the IDS platform. As well, they do not consider

orchestration, as their system is merely controlled by the users, without any load balancing.

2.3.2 IDS Interface Design

Next, we review the IDEMF (The Intrusion Detection Message Exchange Format) protocol [48].

Introduced by IETF (Internet Engineering Task Force) in 2007, it is an attempt to design a

unifying protocol for IDS and IPS communications. The primary purpose of this protocol is to

define a universal format for IDS-related information, which in turn may enable coordination

among separate security devices and the management modules. The devices and modules may

come from different vendors, and IDEMF emphasizes on interoperability and extensibility.

While IDEMF has been around for a while, it is individual vendor’s choice to implement an

interface based on it or not. As discussed, vendors typically do not desire to implement

interoperable protocols. This is the reason that our work does not rely on the NIPDS

device/module to implement this protocol or not.

As well, IDMEF’s extensibility comes at the price of increased overhead, both in terms of added

communication overhead as well as computation (parsing). If a platform is to implement

IDMEF, it cannot just support part of it, it either covers it all or not. We did not require all the

functionalities included in IDMEF. For that reason, we decided not to implement it. Our main

purpose is to abstract resources, regardless of what interfaces they come with, and so long as the

abstracted resources share a consistent interface, that suffices for us. As mentioned, the cloud

industry trend is moving towards in-house custom solutions to increase security. Having a

universal communication protocol is inherently in contradiction with that.

In [49], the authors designed a console to view Snort alerts and Graphical User interface, namely,

SnortStarf, to use the Snort IDS. One especial scenario considered by them is an attacker

intentionally filling up Snort log with decoy logs to overwhelm a human operator, for which they

suggest an algorithm to visually divide up the alerts in the display such that the human operator

23

could find an anomalous event in O(log N) interface operations. The work is perhaps useful for

small scale personal systems where a human operator would occasionally check the logs.

However, all bigger systems need some degree of automation to find and process certain event

logs.

Finally, in [50], the authors present an interface design for applications to report their exceptions

to the IDS (e.g. if they expect a false positive is going to be generated). However, one may argue

that having such interface itself adds to the threat vector.

24

Chapter 3 Overview of Hybrid Security Platform

In this chapter, we discuss our architectural framework, which is the essence of our design. We

will go over the high-level design, starting by stating the design requirements, then covering the

overall components and use cases. Then, we have dedicated a section to design considerations,

where we state the particular features and design decisions made in order to meet

3.1 High-Level Design

In this section, we discuss the high-level design, which includes the architecture. We shall first

explore the purpose of our architecture, which is best stated in terms of its design requirements.

3.1.1 Design Requirements

Design requirements are important as they determine the extent of the design’s success. Such

requirements can be categorized into functions, objectives, and constraints [51]. Functions are

the expected behaviour of the system. Functional requirements that are either met or not (binary),

and they describe the general / abstract application of the design. Then, objectives are set of

goals, which can be met to a certain degree, and hence we need to have specific criteria to

measure the extent to which they are met (as well as defining their “success” line). Constraints

could be thought of as objectives that must be met; otherwise, the design is inherently flawed /

will fail anyway.

The first function of the security platform is to detect attacks that have a well-known signature.

Such signature may be in the packet header or else in the payload. It is important to emphasize

here that our method of detection is limited to information pattern detection per packets from the

traffic of a certain flow / link. Similarly, our scope of detection is limited to those attacks that

happen to have a signature / packet information pattern, that is possible to detect, and is already

associated with a well-known attack. It is important to note that this IDS method does not detect

zero-day attacks. As well, it does not detect misuse of the system, while as mentioned before,

anomaly-based IDSes could detect those, too. However, again, the advantage of pattern based

IDSes is in their relatively less rate of false positives and detection errors, which stems from the

25

fact that unlike anomaly-based detection, they do not leverage probabilistic models or machine

learning.

The second function of the system is to mitigate the attacks. Essentially, our platform not only

includes IDS but also IPS, making it a cloud NIDPS platform. The mitigation may come in

different forms. The options include blocking the packets generated from a certain source,

immediately upon detection, as well as potentially blocking the future packets from the same

source. Another possible action is to provide a honeypot as service, re-routing the traffic to the

honeypot. Honeypot is essentially a duplicate of the attacker’s target server, but with less to no

service value (i.e. it is not providing actual service to the clients). This target server is left open

to the attacker so that the attack can be investigated after it is over [52]. Its concept is similar to

sandboxing in the broader cyber-security, except the target server may or may not be actually

sandboxed, depending on how advanced the honeypot system is. A proper sandbox involves

restricting the actual access of the attacker to the system [53], and hence creating such decoy

server is a more sophisticated procedure.

Next, we discuss the objectives of our design. The first objective is the architecture has to be

scalable, in different ways (horizontally and vertically), for various components (detection,

mitigation, management, etc.). In particular, the detection capability has to scale based on:

number of flows (with “flow” defined as pair (src, dst)), flow bandwidth. Then, the mitigation

has to be scalable in the sense that the defense is to be modular, and distributed. With a

distributed defense, we could have a feedback to the switches, as we showed in [28].

The feedback to the switches ensures that the blocking action is not only done that the NIDPS

modules, but also at the edge switches and routers. This, in turn, avoids bottlenecks at the inline

IDSes. For instance, software based IDSes, such as Snort, tend to suffer from limited processing

bandwidths. Their typical bandwidth is multiple times lower than regular commodity switches

(based on our measurements, detailed further in the evaluation section). This makes sense, as

they have to do computation and processing the packet contents. Even the most efficient IDSes

(e.g. ASIC or FPGA based ones) do not provide throughputs in the scale of switches. Hence,

scalability in inspection bandwidth matters a lot, and that could be the ultimate purpose of our

26

platform. In the real world, software base IDSes are most likely only used for overflow capacity.

This overflow feature ties back to the concept of scalability, which matters a lot in our work.

Our second objective is for the platform design and architecture to be interoperable. In our view,

interoperability can be best defined by the capability to include both physical and virtual

appliances (PAs and VAs), from various vendors. Our idea of such ecosystem where PAs and

VAs live around each other is certainly inspired by the Network Function Virtualization

paradigm, where ultimately it should not matter in any way to the user whether the resource used

in physical or virtual (to the point that it is even suggested that should be fully hidden from the

user) [54].

To be realistic, however, we do have to define the scope to which we hope to be interoperable. In

particular, the NIDPS implementations that we are to include shall meet certain minimums, in

that they should provide certain minimal functionalities (e.g. Deep Packet Inspection) as well as

a well-defined interface to use their products. The interface should include features such as the

ability to program remotely, push attack events, or pull attack logs. Our design is to take

detection and mitigation implementations (which meet the criteria described) as black boxes. Our

security resources are to be chosen such that they are not already interoperable, and then we add

a computation layer that abstracts the resource. This is again in line with NFV’s vision.

Our interoperability objective makes more sense when one looks at the possible trends for NFV’s

universal adoption. In NFV’s vision (which itself is inspired by the general vision and direction

of cloud computing), virtualized resources will eventually become ubiquitous, which is due to

their long-term cost efficiency. However, enterprises have already spent huge sums of money on

the Capital Expense for the legacy / proprietary / physical detection and mitigation equipment

they have. For now, they are not going to throw away their old firewall machines right away.

Therefore, there will be a possibly long transitionary period, during which both new virtual and

old physical appliances have to work together. As well, as mentioned, physical appliances may

be more power efficient for their specialized sort of computation (as opposed to x86 hardware).

Hence, we felt the urge to design a hybrid platform that would abstract and include both virtual

and physical resources.

27

It also worth noting that our notion of interoperability inherently includes extensibility embedded

into it. Extensibility may be defined as the capability to extend the design / framework to include

new elements. In our case, extensibility is defined by the ease of adding a new network security

resource to the framework. This closely ties together with interoperability, in the sense that if our

design is interoperable, it should be relatively easy to include new resource types in it. At the

same time, modularity in design helps with extensibility. Similar to object-oriented

programming, where adding a new module is usually adding a new class, our architecture is to be

designed to be modular, so that adding new resources does not require a significant change in the

architecture, design, or previously written code.

Now that we have defined our main objectives (scalability and interoperability), we should also

define how we intend to measure the extent of our success in meeting them. For intrusion

detection scalability, perhaps the best way to measure is to look at the growth rate of the size and

the number of required resources as the input demand grows. The input demand can be modeled

in various ways, such as the number of traffic flows (again, flow defined by the pair (src, dst)),

individual flow bandwidth, or total inspection traffic size per unit time (inspection throughput).

Therefore, for example, if we define a function, where an input is the size of total inspection

traffic, and the output is the number of unit intrusion detection modules used, we are interested to

look at the limiting behaviour of it. Ideally, if the growth is linear, this would imply that our

system is scalable. We define such functions more precisely, and discuss our measured growth

rates for them in our evaluation section.

Next, when it comes to our other objective, namely, interoperability, it is similarly difficult to

define a quantitative measure for its success. An ideally interoperable cloud network security

platform should be able to include any possible or existing NIDPS resource in it. However, as

mentioned, we have reserved some basic requirements for the NIDPSes so that we are able to

abstract them, to begin with. Stating that assumption, we have already limited the set of NIDPS

resources we may include (e.g., some NIDPS resource may not include an API that enables to

program them remotely). Nevertheless, we may still try to define a basic measure, such as the

capability to include at least two virtual appliances and 1 physical appliance, all from different

vendors. However, “being from different vendor” is rather vague when it comes to intrinsic of

28

the resources, as the resources may be very similar inside, just come in different brands.

Considering all this, perhaps the best approach is to have a qualitative measure than a

quantitative one. In particular, we argue that our platform is able to work with a great majority of

the products out there in the market. We perform a more detailed analysis in our evaluation

section.

One may wonder, why keeping the cloud secure is not noted as an objective of ours. This is a

crucial scoping consideration. In particular, it is very difficult to define a measure for security.

Even if we define one, our work relies on certain parameters that are outside our control. For

example, the assumption of whether an attack has a well-known signature or not relates to the

field of IDS design and signature extraction, which is outside the scope of this work. Hence, we

do not aim for the security in a direct manner, rather, by coming up with a scalable platform

design for existing IDS technology and existing attack signatures.

3.1.2 High-Level Design Components

In this section, we discuss our architecture by going over its high-level design components. In

order to understand the components and the architecture, we first discuss our typical use cases.

From there, we move on to our architecture and workflow.

3.1.2.1 Use Cases

As our work is about cloud network security, our use cases all involve some attack traffic

coming into or through a cloud network. For the typical use cases of the cloud network security

platform, we shall consider all possible attack origins. This is again, due to the fact that the

periphery is undefined or else very hard to be defined in a cloud network. Our attack scenarios

include cyber-attacks from the within, as well as those from outside, which have some network

footsteps involved in them. This is almost the case for all the attacks in the cloud, as they all tend

to require the access to the virtual machine, which is on the hypervisor host, to be done remotely

(i.e. they don’t allow people to walk in and connect to the hypervisor directly).

The attack may be generated within the cloud network, or from outside networks. Using a setup

that is explained in the workflow section below, all user flows pass from an assigned IDS. The

user flows have two end hosts, one the user host (which may be within or outside the cloud

29

network) and one server host (which is within the cloud network). We shall note that our notion

of “cloud network” is intentionally vague for generalization, as such notion may be realized in as

the actual cloud local area network, a particular virtual network, or an isolated subnetwork.

Regular user traffic experience shall significant difference. Attacks are to be detected, through

IDSes that the user flows are chained through. Again, by attack here, we mean attacks that

already have a well-known signature, detected over plain-text packets (i.e. not encrypted). Once

an attack is detected, it is to be both immediately blocked inline, and as well, a feedback to the

relevant switches / routers is provided to block future packets originating from the attacker.

From here on, we assume that our given cloud network is SDN-based. We argue that this

assumption, while strictly speaking is not required, does not hurt the generality of our work. In

particular, SDN seems to increasingly dominate the cloud network paradigm, due to its economy

of scale, and its harmony with a virtualized environment, as one of SDN’s main consequences, is

to virtualize the network itself [55]. SDN also very much matches our framework (e.g. it already

has “flows” modeled in it, suitable with our definition of user flow). Conveniently, we may call

having SDN-based network a constraint of our implementation, since the only cloud networks

we had access to at the infrastructure level for our proof-of-concept were SDN-based. However,

this can also be seen as a design implementation decision, where we chose SDN due to its

capability to rapidly innovate and experiment with its current open source implementations (e.g.

OpenFlow).

Figure 3-1 depicts the attack from outside scenario, while Figure 3-2 shows the scenario of attack

from the inside. In the case of the attack from outside, the attack can best be blocked at the

Gateway / Edge Router, preventing it to get into the cloud network at all. In the scenario of

attack originating from inside, however, the best place to block the attack is the switch that is

nearest to the attacker’s host.

30

Figure 3-1 – Attack from Outside Scenario

Figure 3-2 - Attack from Inside Scenario

It is important to note that NIDS and DPI entities have been used interchangeably in the two

figures, implying that our NIDS is DPI / signature-based. This module is a logical entity, and

may be distributed in implementation (e.g. a load balancer dividing the inspection task), which is

more scalable.

31

As we will mention in our implementation section, our platform leverages SDI Enabler for traffic

chaining, and not using SDN controllers directly. However, we depicted the module as controller

above, as it is a more generic representation.

In SDN’s perspective, there is no inherent difference between a switch and a gateway (as their

control plane is separate from them, anyway). This greatly fits our use cases, as in one attack

comes from outside the cloud network, while in the other it originates from the within. One

attack type passes through the Gateway Router, while the other passing through the Ingress

Switch. SDN enables us to provide the same feedback (of the attacker’s address), regardless of

the audience being the router or the switch.

3.1.2.2 Architecture Workflow

In this subsection, we discuss the workflow of the platform that takes place behind the scenes,

and then review the high-level components of our architecture. When a user host first initiates

contact with a server on the cloud, the user flow traffic, denoted by (src, dst), is assigned an IDS

resource. What layer will be used for addressing (e.g. Data Link (MAC), or network (IP

address)) will be specified in Chapter 4. The assignment of the flow to the IDS may take into

account several factors. One is latency or Quality of Service (QoS) constraints the user might

have. The users of the user flows that are to be treated specially need to have notified the

administrator in advanced about it. Packets of the user flows with high sensitivity delays need to

be assigned to physical resources, which are known in general to have less delay overhead

(again, due to specialized hardware designed to efficiently parse through the packets than

bringing it all the way to the application/software layer).

Based on the QoS requirements, we expand the user flow to become the following triplet: (src,

dst, delay sensitivity). The delay sensitivity is a binary/flag parameter of which values are “low”

or “high”. Our assumption is that we are already provided with a pre-defined list of all sensitive

users so that we can mark them as “high” or “PA only”.

A possible extension of delay sensitivity assignment is automatic traffic characterization, where

for example, a module assigns the sensitivity by looking at the protocol used in traffic (e.g.

streaming vs. download). However, we did not implement that as part of proof-of-concept. This

32

feature may be more difficult to implement as the nature of communication in a user flow may

change over time, so which requires the system to be state-full and dynamic.

Another important note on the PA assignment is that the utilization of the PA has to be taken into

account, too, not to overwhelm it with too many high sensitivity flows (i.e. not ruin the QoS of

past user flows since we need PA for more flows). By default, our system would assign the high

sensitivity flows to regular / VA type of IDS resource. In addition, if a user flow is found to be

rogue or too overwhelming, it may be reassigned a regular weight IDS (VA) upon the detection

of such behaviour / bandwidth usage pattern.

Next, we discuss the high-level architecture and a more detailed workflow based on it. Figure 3-

3 depicts the general scheme we use for the communication and coordination of our NIDPS

resources. In this scheme, we use software agents to abstract the security resource, such that the

resources are somewhat unified from the viewpoint of the main point of control, which we call

the “Master Agent” (our Master Agent is realized as “Master Security Controller”, later on).

Once the resource is abstracted, the Master Agent can treat the resources as a black box, yet

program it using the API provided by the associated SW agent. Hence, for example, the IDS may

be a legacy / specialized hardware based IDS, VA, or NetFPGA-based. More generally, the IDS

might be even a Host-based one. This should make almost no difference to the Master Control

(excluding the delay requirements, as explained earlier). Similarly, the IPS module may come in

different implementations, a legacy hardware firewall, a software one, or a distributed SDN-

based one. Besides the SW agent, what enables this scheme is the common interface shared by

the SW agents, which is described in more details towards the end of this chapter.

33

Figure 3-3 – General Resource Coordination Architecture (Hierarchical)

This general approach, as we will outline in the last subsection of this section, is inspired by

SDI’s architecture, bringing heterogeneous resources together. Some of the resources may be

virtualized, while some are not. This is, again, also in line with NFV’s envisaged ecosystem.

The Master Agent decides on the coordination of resources, how is work divided, which user

flow is assigned to what IDS (as we will explain later, user flows are not to be broken between

IDSes). This Master Agent is a logical entity, which may be distributed in implementation, in

order to distribute the trust among components. This goal is equivalent to not reducing single

points of failure. In particular, we do not want to lose the control over the security resources in

the event of the compromise of the Master Agent.

The Master Agent may also coordinate with an Analytics based detection module, which we call

“M&M Anomaly Detection”. The name is inspired by SAVI’s M&M (Monitoring and

Measurement) module. The naming also stems from the fact that in order to have analytics based

anomaly detection, one needs to have some M&M / Data Acquisition system within their cloud

system, recording parameters / features that reflect on the resource usage profile (such atypical

usages can be detected).

34

Figure 3-4 depicts the high-level components of the system, which is also useful in visualizing

the workflow. The detailed pre-attack / setup workflow is as follows:

1- Through Admin UI (User Interface), RESTful Config API, and then configuration API,

IDS sensor requirements are passed to the SW agents, where they use their Appliance

Specific Configuration API to program the IDS resources.

2- At the beginning of communication, IDS flows (for the forward and reverse paths) are

assigned by the Load Balancers (based on delay sensitivity, and IDS throughput/resource

utilization). The traffic is chained by the SDI Manager through use of the SDI Enabler

API.

Figure 3-4 – High-Level Components

35

The Admin UI may be command-line based and / or a Graphical User Interface (GUI, for

example, a webpage) that only network administrator(s) of the cloud have access to. It leverages

a RESTful Network Security Configuration API, which we have designed, to communicate the

administrator’s network security requirements to the network security platform system. This is

more detailed in the next chapter.

There are a few points that need to be mentioned regarding the representation in Figure 3-4. In

terms of the reverse path load balancer, it is a design decision to allocate the reverse traffic to the

same IDS that was assigned for the forward-path or not. In particular, it depends on whether the

attacks we are to detect require knowledge of the traffic both ways or not. As well, for the

reverse path chaining, in addition to the use of SDI Enabler, we implemented it using a tunneling

scheme, as well. We will discuss this scheme, as well as its pros and cons in Chapter 4.

Also, one may notice that there are components in Figure 3-3 that are not in Figure 3-4. The

components that were not included in the actual proof-of-concept are not present in Figure 3-4.

However, we mentioned them in Figure 3-3 to point out that in general they could be done. We

will revisit this in our future work section.

Then, the post-attack workflow is as follows:

1- The attack gets detected and blocked inline immediately by the assigned IDS.

2- At the sampling/polling point, the associated SW agent notices change in log file of

IDS (through Polling API), parses the attacker address, and passes it to the

coordinator agent, which then uses the SDI Enabler API to install the appropriate rule

in the first accessible switch / gateway on the path from the attacker to server. In

general, the IDS itself may also have the capability to push the events to its assigned

SW agent, but the pull-based model seems to be more encompassing (as it is unlikely

to have a device / module that pushes events but does not have the logs available to

pull the events from).

In the model shown in Figure 3-4, the SW Agent is in charge of both configuring as well as

pulling / reading events from the IDS. However, these two tasks could have been further broken

down to be assigned to separate logical entities. For sake of simplicity / ease of visualization, we

36

combined the two. Logical entities may be implemented together or separately, the placement of

logical entity is an implementation problem, to be visited in Chapter 4.

Lastly, we overview the Anomaly Detection (AD) Correlation Workflow, which is the following:

1- M&M Anomaly detection module requests the traffic of a resource (either a specific

pair (src,dst) or all of a resource traffic) to be inspected, through AD Coordination

API. This API would be very similar, if not the same as, to the RESTful Config API.

2- Rest of the workflow is same as pre-attack setup.

The AD Correlation feature may be used to verify traffic flagged as suspicious by the M&M AD

module. In particular, for attacks with well-known signature, the signature-based IDSes tend to

have a lower rate of error / inaccuracies in detection as opposed the AD modules. Hence, this can

be used to investigate certain portions of network traffic, similar to a police officer contacting a

specialized detective of a certain field to investigate a certain case in further detail and provide

their feedback to increase their sense of certainty around their guesses. This is also similar to

getting second-hand opinions from other sources.

3.2 Design Considerations

In this section, we have designed our network security platform such that it can perform the

necessary functions and meet the requirements we identified for it earlier. In particular, the main

headings of this chapter will read as “Design for X”, a terminology derived from the principles

of engineering design [51]. We recall that our two main objectives were scalability and

interoperability; here we address the design decisions made for them.

3.2.1 Design for Scalability

As mentioned in our design requirements section, the scalability of the platform itself has

multiple dimensions / different aspects in both definition as well as evaluation. The subsections

here are to address some of these various aspects.

37

3.2.1.1 Horizontal vs. Vertical Scaling

The number of users and their extent of server utilization varies throughout time. This means that

the amount of traffic to be inspected also varies accordingly. Our NIDPS platform is to scale

based on the demand.

First, we shall discuss the scaling of the detection resource. The act of scaling may be performed

in different ways. In particular, it may be done horizontally or vertically. Horizontal scaling (aka

scaling in and out) involves increasing the number of detection resource instances, working

together to provide a larger detection bandwidth. This requires load balancing, which is

discussed in the next subsection. Vertical scaling (aka scaling up and down), on the other hand,

involves increasing or decreasing the size of a particular resource, in particular, adding more

CPU, memory, storage, etc. [56]. Vertical scaling tends to have a limit, as the size of one

resource may only be increased to a certain point. The limit may be due physical constraints,

such as rack size limitation, size limitations, hardware limitations, and hypervisor utilization by

other virtualized resources. However, horizontal scaling tends to be more flexible.

As a result, we made the design decision to go majorly with horizontal scaling. Of course,

vertical scaling remains an option to the cloud administrator (i.e. they can use the cloud

infrastructure API directly to increase the individual size of certain virtualized resources / virtual

machines). In fact, vertical scaling is useful where one wants to increase the bandwidth of a

certain flow, beyond the point of capacity that a certain IDS provides. It is important to note that,

in general, user flow break up between a number of IDSes (more than one) is undesirables for us.

More specifically, the issue is with the attack information that may be in a user flow traffic, and

breaking it up may result in counts of certain pieces to be distributed, in the way that the

threshold of each IDS is not met for them to detect the traffic as attack any longer. This problem

may be solved by distributed design of IDS itself, which is out of the scope of our work.

While breaking up to multiple flows is not an option for us, should a flow exceed the current

limitations of a virtual resource, the resource may be scaled up. However, beyond a certain point,

it might make sense to move a very demanding flow back to PAs, as they tend to be more power

efficient, and in general provide less operational costs.

38

3.2.1.2 Load Balancer

Now that we decided that our scalability would be mostly based on horizontal scaling, we shall

review the important piece that enables such scaling feasible, which is called load balancer. Load

balancer, as the name suggests, is a module that divides an input load over a number of

processing resources. It is a technology commonly used for the websites that have significantly

high demands. It is particularly interesting to us as we have decided on a distributed IDS

approach (again, for scalability reasons).

Our load balancer, however, may be a bit different than the typical website / web services load

balancers, in that once the flow is assigned to a certain server (in our case, an IDS module, of

which traffic has been chained through it), the server is not going to change. On the other hand,

in web services, there may be a shared database service with a more distributed server system,

where any component / server instance might pick up the request and respond to it. Hence, our

situation is somewhat more complicated, requiring constraints on the load balancer behaviour. It

can be round robin on the initial assignment, but cannot do round robin per packet or per

individual user requests / segments of user flow traffic.

Initial assignment may be done in several ways. As already mentioned, it could be a simple

round robin. However, we may happen to have some information regarding the expected /

anticipated user flow traffic pattern over time. Utilizing these probabilistic assumptions, we can

come up with an assignment algorithm that on average does better than round robin.

We came up with a greedy IDS assignment algorithm, which tries to not only optimize over cost

/ power usage but also ensures that there is always some overflow there in case there are surges

in the traffic that need to be investigated by the NIDS platform. This is particularly crucial, as

our measured IDS VA up time (using SAVI’s cloud platform, which is based on OpenStack,

which is obviously not the best in terms of performance) in the scale of minutes. We do not want

the users to have to wait minutes for their connection set up, nor do we want to allow

uninspected traffic through. Hence, the algorithm had to take all the time availability of overflow

IDSes as an objective, too.

39

In order to realize our algorithm, we implemented a distributed monitoring / DAQ system, where

SW agents collect certain resource utilization data. This scheme is discussed in detail in Chapter

4. As part of our algorithm design, we decided to use CPU utilization percentage as the main

deciding factor for the algorithm. The rational for this selection is explained in Chapter 5.

To ensure spawning enough overflow resources for any point in time, one may consider

including calculations based on the rate of change. We considered an advanced machine learning

or analytics-based algorithm for this purposes out of the scope of this work, falling in the general

category of orchestration and cloud resource planning, not so much for the network security

discussion (this is the reason that in the industry, they implement big projects in teams!). The

algorithm first checks if the user flow is of high delay sensitivity (e.g. 4G LTE traffic), in which

case, it sees if it can assign to the PA (by looking at how utilized the PA is and how many user

flows of what weight have been assigned to it already). If either of the two conditions is not met,

it starts looking at the size of the requested user flow (small, medium (default), large), as well as

assignment and utilization logs of the existing VA resources. If any of them is free enough, it

assigns it to them. Otherwise, it assigns to the overflow VA resource, as well as calling for

another overflow VA resource to be spawn.

3.2.1.3 Auto-scaling

The act of scaling, whether it is leveraging load balancer or increasing the size of the resource, is

useless if it has to be done manually. In particular, it is in contradiction with being scalable to

require manual / human intervention too often. Let us imagine the time it would take the admin

to scale a single resource, multiple by hundreds and thousands; we would probably need too

many admins at a given point in time, increasing our cost.

Hence, we implemented auto-scaling feature, which would leverage, for the most part, horizontal

scaling, using the load balancer. This auto-scaling is not for scaling up, only. Rather, it can also

scale down, when realizing that there are too many under-utilized resources. In addition, when a

user flow is observed to have been inactive for too long, its IDS assignment may be removed.

Another scenario of action is where almost all of the IDS VA resources (which are for overflow)

are taken. Then, as previously described, we spawn more resources not to lose the overflow

40

margin we have at any given point in time. When a certain flow grows larger than the capacity of

a single VA, one of the three actions may be taken:

1- Re-assign to the legacy IDS

2- Re-assign to a larger IDS VA / Increase the size of the original VA

3- Break up the user flow between IDSes. Again, this is undesired. As explained before

we may lose certain info (e.g. count of certain patterns) by doing this.

We shall note that procedure 3 itself may be implemented using different algorithms (e.g. round

robin or Weighted Fair Queuing). We implemented the auto-scaling feature using polling; we

will revisit it in Chapter 4.

3.2.2 Design for Testability

It is important to consider the testing schemes while figuring out the design. In particular, we

considered specific testing points, where test data can be acquired from. Even more important

than that, in networking, is to have debugging points. Debugging network issues could be a more

complicated and involved task than debugging a piece of code (in SDN, some code may need

debugging, too).

In particular, we had extra internal network interfaces for our virtual appliances. They all had an

internal software switch installed in them, which helped not only with installing the appropriate

forwarding rules (i.e. having the NIDS listen to a specific interface, as well as by default

isolation of the IDS interfaces) but also with tapping the NIDS traffic at any given point in time.

This was helpful when investigating in real-time whether a packet has been let through or not.

As well, component placement matters a lot when it comes to the testability of design. We

grouped the functionalities as closely as possible inside particular virtual machines. For example,

we combined our SW Agents as certain scripts inside the master agent. This made sense as we

could have had all the logs in one place for consideration, making testing and verification, as

well as data collection much easier. Of course, as we shall discuss, we considered redundancy

schemes for the master agent, so that we do make it too much of a single point of failure and

further distribute our points of failure.

41

3.2.3 Design for Extensibility

As mentioned, extensibility is one of the main objectives of the design. We ensured to have a

modular design that is easy to extend. Extension, in our case, is the addition of new type of IDS

resources (physical or virtual). Such addition shall require not too many changes to the existing

code, but rather merely adding a new SW Agent that communicates with the new type of IDS

resource. Our architecture is plug-and-play based in terms of its IDS SW Agents, making it

extensible, similar to an extensible object oriented design pattern. However, our architecture is

also different, in that it does not directly utilize inheritance or object oriented structures.

In addition, our communication protocol mattered a lot. We decided that Secure Shell (SSH) is a

common secure protocol that most IDS resources provide an interface for. This may seem to

limit the set of IDSes that can join the platform. However, that is not the case. Let us assume

there is an IDS resource that uses some Non-SSH protocol (e.g. REST API on TLS (HTTPS)).

For such resource, the particular SW Agent may utilize the protocol / API. Covering all possible

cases, of course, is out of the scope of our work.

3.2.4 Design for Cost Management

No architectural design would make it to production unless it can be shown that it can save some

cost. As already mentioned, IDS VAs are to be used mainly as overflow capacity for the PAs,

due to them being less power efficient (as well as their longer processing delays). Use of the

VAs, together with the auto-scaling feature, would result in savings in both CAPEX (Capital

Expenses) and OPEX (Operational Expenses).

In particular, Capital Expenses are reduced, as now there is a less need to spend money on the

capital infrastructure. In the scenario that involves no usage of VA resources, there still needs to

be some overflow resources available. Using the cloud, one has to pay only when their

virtualized resources are spawned, whereas, if the platform entirely used PAs, then there had to

be overflow PAs that were always on, whether used or not. Having such PAs involves both the

CAPEX and OPEX, for the capital cost of the device, as well as keeping it on at all times, which

we avoid by leveraging our dynamic auto-scaling feature.

42

However, PAs are not the only cost saved by our platform design. The platform may also spare

having 24/7 dedicated IT Security Staff, who would need to monitor the system continuously for

attack event notifications. Our automated blocking means that once we distrust a user flow, it is

going to be blocked. Such militarized approach has its own pros and cons. The advantage is

saving one from having to hire more people. Nevertheless, the disadvantage is that in the case of

false positives, we may block users who did not truly launch on attack on any servers. However,

we may assume that our error rate is small enough that this is justified. As well, once a user has

been blocked unjustly, they can go the administrator and ask to be removed from the black list.

Overall, automation and orchestration come with pros and cons. Where it saves money, with

little risk of fault, it is the way to go. We shall note that the rational used here for the CAPEX

and OPEX savings is similar to what we described in [25].

3.2.5 Design for Self-Protection

We consider three aspects of security of our platform, their attack vectors, and how they are self-

protected. In terms of attacks on its availability, DoS attacks could happen on the control

channels. As well, there may be exhaustion of IDS resources through misuse. In terms of

confidentiality, there may be reconnaissance done by tapping control channels, learning existing

deployments or expected defense patterns (including expected attack signatures). As well, there

may be attacks on integrity, such as spoofing the user or master controller to provide false

deployments, using admin / super user for the platform or each component, deleting or

modification of detection signatures, or clearing logs to hide the footsteps of previous attacks.

Our defenses are at three levels, overall platform, within platform components, and the

infrastructure layer. The last of the three relates to cloud virtualization / isolation, which is out of

the scope of this work. In terms of confidentiality, all our APIs are encrypted using public-

private key cryptography (use HTTPS for the UI, SSH for IDS configuration). In terms of

availability, we utilize availability sensors on both IDSes and management module (through

having them log their performance periodically somewhere outside their own VM). This

overlaps with our scaling mechanism. In the event of a management or IDS component not

reporting back, it may be replaced automatically. As well, we distributed the possible points of

failures, both physically and logically. For integrity defense and access control, we utilize SAVI

43

IAM, which is a derivative of OpenStack’s IAM service. There may be different key distribution

schemes, such as pre-assigned keys and Certificate Authority.

Our design’s redundancy scheme (discussed in further detail in Chapter 4) allows for self-

investigation, by the redundant master controllers adding DPI to the traffic of an IDS or a

controller. For such scheme, the admin’s SSH traffic may be excluded, for example. However,

that adds to the attack vector of the system. However, it is important to be careful during a

network attack period. The more prepared is the platform ahead of time, the better, as less attack

traffic will get into the network.

As a result, our architecture may assign certain IDSes to investigate itself. It is important to

protect the IDS killing scripts from outside access. As well, we ensured not to have a platform

that would accidentally self-annihilate, as we distributed the trust across various management

components. This approach of ours is in the direction of the network security industry, where

distributed implementations are dominating.

3.3 Interoperable IDS API

Our extensibility, and to an extent, our scalability depends on the design of our IDS API. This

API determines the communication and management protocol of the IDSes. If it assumes the IDS

shall provide a feature it cannot, that would result in partial coverage of the API across the

IDSes. This is inconsistent and creates issues in terms of both scalability (i.e. by complicating the

IDS assignment procedure) and extensibility (in that we may not want certain types of IDS in our

design at all).

To prevent aforementioned issues, we performed an analysis of the minimal requirements for the

SW Agent and IDS interfaces used to be generic enough, so that we cover as many IDS products

as possible. These requirements directly transit to the functions that the API is to provide, which

are the following:

1- Requirement: Ability to detect, block, and log an attack with a well-known DPI based

signature

44

a. Note that the IDS may or may not have an event / push-based mechanism for

notification, which is considered fine by us. We deem it the developer’s

responsibility to add it.

Corresponding API Function: Notification of detectable attack, as well as automated

corresponding actions (blocking, in addition to the feedback to the nearest switch / gateway

router).

2- Requirement: Ability to remotely program

a. One particular access protocol we prefer is SSH, due to its built-in encryption,

which helps the internal communication be kept confidential. However, that is not

a requirement, so long as the IDS can be programmed to encrypt over whatever

protocol it uses, or else the IPS using an already encrypted protocol, such as

HTTPS.

Corresponding API Function: Providing the GUI + RESTful API for the user to specify

how their programmed IDS should look like in terms of its level of security. This will be

further discussed in Chapter 4, as part of our Security Profiles notion.

We consider these two the bare minimum requirements, which are as general and abstract as

possible. The also correspond to the bare minimum API functions, as we will see in Chapter 4.

3.3.1 Integration with Analytics-based Detection

We do not limit our interoperability to IDSes, rather aiming for any security resource possible.

This is in line with NFV and SDI’s visions. As such, we considered potential integration with

Analytics based detection as part of our design.

As mentioned, analytics-based detection tends to have lesser accuracies, when it comes to attacks

with well-known signatures. Therefore, it could use some help verifying its guesses by

leveraging signature-based IDS detection.

What we find nice about our API design is that we did not have to do anything extra for

compatibility with any such AD module. In particular, our API is useable by both human and

45

machine user (provided keys have been distributed to the machine to SSH into the master

module). This stems from our modularity, extensibility, and generality in design. Applying

design principle pays off.

3.4 Distributed Mitigation System

Scalability is a major design concern for us. However, this pertains to not only detection but also

the mitigation. To understand the importance of defense scalability, one may consider the

scenario of a DoS attack that is blocked at an inline NIDS. While the IDS may be quite capable

of blocking all the DoS packets, there still exists an issue, which is the passage of regular traffic.

Since the processing bandwidth of NIDS is limited (in our case, it was roughly an order of

magnitude less than a regular switch).

In order to prevent a bottleneck at the IDS limit our scalability, there is a significant advantage of

blocking the attack at the first switch / gateway router. Where they may be several paths

available for an attacker, it is desired to block the attacker in all possible first switches in those

paths.

This even relevant to insider attackers. The distributed defense is needed to isolate the attacker,

where they may have access to multiple networks / gateways. Due to expected delay for the

communication system to work, the IDSes shall block attack instantly. We leverage the SDI

Manager as the SDN Controller and the enabler for the feedback to the switches. Essentially, the

SDN provides us with a distributed firewall. Our SDN-based firewall is much more flexible than

just blocking the ports at the hypervisor. We described that firewall in [28].

Each VA IDS is equipped with the SDI API, as well as a cron job regularly checking the attack

logs, to extract new black last addresses. Once a new address comes in, two scripts are run, the

first one utilizing Janus to get all the switches on the path, and the second to send the feedback to

the first inline switch / gateway router. The PA is equipped with an external checker, which

could be a cron job either in a dedicated VM or else in the master agent and/or its redundant

duplicates.

As well, the blocked IDSes are to be recorded in the Master Agent (and its duplicates), which

provides a centralized syncing point for the IDSes to communicate. Another possible scheme is

46

peer-to-peer communication between the IDSes FW agents. While we did not implement this

approach for our proof-of-concept, perhaps that approach would be further scalable in terms of

coordination.

47

Chapter 4 Software Architecture and Implementation

In the last chapter, we overviewed the high-level designs, along with some of our fundamental

design decisions, which demonstrated the design the principles we have been following. In this

chapter, we dive deeper into the architecture, this time looking at it from a detailed software

perspective. We also go over our implementation and name our proof-of-concept components.

As we shall see, our proof-of-concept includes most of the features we designed, missing a few.

4.1 Deployment Architectures

Back in Chapter 3, we discussed the general coordination scheme between various security

components. In particular, the resources do not communicate directly, rather, they have a

software agent assigned to them that does the communication on their behalf. This

communication is made up of the attacks they have detected, and more importantly, the work

division among them (which IDS inspects which user flow). So far, we have only introduced a

hierarchical scheme for this deployment in Chapter 3, where all the SW Agents talk to the

Master Agent and get their assignments from there.

The existence of such Master Agent, however, may expose limit on the scalability. Of course,

such concern is only valid at very high scales of communication, only. Nevertheless, having a

single Master Agent could be too much concentration of trust in a single component. A particular

goal in our design is to distribute the point of failures so that we do not have single points of

failures.

One alternative to the hierarchical deployment / communication scheme is a flat architecture.

This architecture is shown in Figure 4-1. In this deployment, the trust is distributed, and no

single component is necessarily trusted by everyone. Of course, the management of such scheme

could be much more complicated. In particular, if there is a conflict between two components,

there has to be a way to resolve it. As well, each SW Agent should include access to the Janus

48

API for chaining. One may argue distributing that access may or may not be a secure approach

itself.

Figure 4-1 – Flat Deployment / Communication Architecture

Our choice of architecture for our proof-of-concept was the hierarchical one, as shown in Figure

4-2. As mentioned, there may be scalability and security concerns regarding the Master Agent

being a single point of failure. Our response to that concern is our scheme for the redundancy of

the master module, where there are one or more redundant Master Agents, which share the same

state as Master Agent by polling it. The moment they realize that Master Agent is under attack

and has been compromised, they come into the play, playing the role of the Master Agent and

overriding any of its communications. Of course, this scheme works for certain cyber-attacks

better than the other does.

49

Figure 4-2 – Hierarchical Deployment / Communication Architecture

It is important to note that security at any costs is no good security. In particular, the measures

we establish to secure the system should be cost-effective in terms of the effort spent in them and

the complexities they add to the solution. We do not consider the case of Master Agent’s

compromise common enough, so for our proof-of-concept, while we propose the redundancy

scheme, we did not implement it.

Overall, we established the trade-offs between the design choices we had over the deployment

architecture and implemented the hierarchical scheme without redundancy. Perhaps, there may

be better coordination with the hierarchical model. Of course, that would depend on what

distributed scheme (for the case of the flat deployment) we are comparing our approach with. In

general, distributed algorithms may suffer from greater delays in resolving their state, as the

decision may be not the output of a single / centralized decision maker.

Lastly, it is important to note that we compared the two approaches (hierarchical vs. distributed)

as two alternative methods. We did so by anticipating the trade-offs and analyzing our guesses.

As such, our comparative analysis lacks metrics, and our argument has been purely a high-level

architectural one.

50

4.2 SDI Enabler

In this section, we review the specific parts of the SAVI SDI (Janus) API we used in our

implementation. The parts of the API we used correspond to the two main roles of the SDI

Manger in our platform, which were traffic chaining and distributed blocking (making it the

main module on the feedback loop to the switch).

The chaining for different components differed slightly. In particular, the network interface setup

for our VAs and PAs differed. Our VA implementation is a Virtual Machine (VM) that runs

Snort IDS [57], together with Open vSwitch (OVS) software [58] providing virtual interfaces for

the Snort to use (isolated interfaces for input and output). Consequently, with the help of OVS,

our VA uses only one main network channel to input and output packets.

On the other hand, our PA, which is a Fortigate 111C Unit [59], use two separate network

interfaces for input and output. The interfaces have different MAC addresses, however, they are

invisible to the user, as we used the transparent mode of the inline NIDS. As a result, we needed

different implementations of the chaining API for different kinds of resources we had. This could

be seen as an extensibility concern. However, we believe we have covered the problem space in

this case, as most NIDSes either come in one or two major ports for their main operation (Of

course, also a separate network interface for control).

Our “chain” and “block” are implemented as Python scripts. These scripts can be run on any

device that has a SAVI Client and has been authenticated to use Janus. In particular, these scripts

leverage two fundamental Janus functions:

1- Get path (src, dst): This returns the DPID (Datapath ID) of all the switches that are on

the path from src to dst (which are IP addresses). The path is already determined by

the Janus network controller, based on its path optimization algorithm.

2- Install rule (OpenFlow Rule, DPID): This call installs a given forwarding rule (which

is made up of a match and an action) in a specific switch.

51

For chaining, we install a rule of which action is forwards the matching traffic to the service box

/ VM. For blocking, the action is simply dropping the packet. For the reasons discussed, the

chaining script for the VA and PA slightly differ.

An alternative to using Janus API was to install and utilize our own OpenFlow controller. Such

controller could either work on a dedicated network, a virtual slice or else on an overlay network.

However, Janus had already solved the problem efficient enough, so we choose it for our proof-

*of-concept implementation. We may also consider it a constraint, as we had a PA (Fortigate

111C) that needed to be directly connected to the SAVI network. It is hard to imagine how we

would dynamically chain the PA’s traffic without the use of SAVI’s in-house service chaining

tool.

4.3 Enhanced Security Groups

In this section, we introduce two notions we came up with, namely, Enhanced Security Groups

(ESGs) and Enhanced Security Profiles (ESPs). The first name is a derivative of the “Security

Group”, a concept we discussed in Chapter 2, which is the usual implementation of a hypervisor-

based firewall in the clouds. However, as mentioned, the conventional firewalls have

shortcomings when dealing with new types of attacks. One that is of particular interest to us is

the fact that firewalls tend to be only port-based. This means that they only look at the header of

the packet and not the content / payload.

Realizing this, we decided to extend the concept of firewall to utilize Deep Packet Inspection

(DPI) within it. Using DPI, the possibilities in the packet match space is now infinite (unlike the

case of firewalls, where port number is finite). Therefore, the user definitely needs some help

with detection. For that, we provide the user with a pre-programmed set of DPI rules, which we

call ESP. Another feature of ESPs is to provide a scheme for updating the attack signatures.

Hence, we call the set of DPI rules the ESP, and once its rules are installed in a given NIDS

resource and the NIDS is chained, the rules and the resource together form an ESG, protecting

the VM in a more enhance fashion as opposed to the conventional firewalls.

Using ESGs instead of the regular security groups has performance and network capacity

advantages. This is because using the distributed defense that comes with our ESGs, the blocking

52

of attack will take place at the first ingress gateway router / switch. This reduces the blocking

time, as well as the amount of network usage the attack would have had if it were to be blocked

at the hypervisor level. The drawback of this approach, however, may be a duplication of

responsibility, as the hypervisor firewall overhead may exist regardless of utilizing it or not

(depending on the hypervisor implementation).

As discussed in our scope section, our work does not concern covering many attack types; rather,

we focus on two categories of attacks (DoS and Keyword-based). Since the attack space in terms

of the keyword / pattern-based attack is much bigger, most of our focus is dedicated to DoS

attacks. Each ESP is defined by the DoS sensitivity level it has (which translates to the threshold

in packet count of the same type before flow is detected as an attacking one), as well as

keywords associated with it. We have defined three levels of security sensitivity for DoS attacks:

“High”, “Low”, and “Medium”. These pre-defined levels correspond to different threshold

numbers within the VA and PA. We discuss them as well as the values associated with them in

our evaluation section.

Hence, the user shall specify with their ESP what level of DoS sensitivity they want. This

delegates the responsibility of the trade-offs to the user (as opposed to the cloud manager). As

well, if a user wants, they can dictate the type and size of resource to be used, too. In particular,

they can set their flows to be only run on PA, or VA of different sizes (low, medium, large).

Overall, ESPs represent and granulize certain security parameters that the user is typically

unfamiliar with. This is in line with a side objective of ours, which is user-friendliness. We will

see the manifestation of our ESPs in our UI design subsection.

4.4 Prototype Implementation

In this section, we discuss our proof-of-concept implementation in detail. In particular, we note

the proposed features that were realized in implementation, as well as the components associated

with them. This will prepare the reader for next chapter, which is dedicated to the testing and

measurement.

53

4.4.1 IDS Appliances

In this section, we describe the particular appliances we used in detail. As already mentioned, we

note the used appliances under two major categories throughout our design, VA (Virtual

Appliance) and PA (Physical Appliance).

A VA is cloud hosted, running on a virtual infrastructure (i.e. Virtual Machine (VM)). This

differs from the PA, in that PA runs on a dedicated hardware box. This hardware box may be

made up of hardware specifically designed for DPI, or else it could be dedicated legacy hardware

(e.g. x86) running a proprietary software. For competition-based reasons, the inside is by default

hidden from the users. Another term we may use for the PA is “bare-metal”, which is used for

physical resources that are connected as part of the cloud, but are not virtualized.

Our VA implementation was a combination of Snort Open Source NIDS software running

together with Open vSwitch (OVS), as shown in Figure 4-3. As mentioned, the OVS helps with

providing the by default isolate virtual interfaces for the Snort to listen to (P2 and P3 from the

Figure below). There are static OpenFlow rules in the VM that direct any incoming packet not

destined for the IDS itself to the interfaces that Snort connects them. Hence, in this inline

configuration, Snort acts as a two-port switch, between P2 and P3. P1 is the interface to outside

(typically named “Eth0”). The blue arrows in Figure 4-3 represent the static OpenFlow rules that

steer the relevant inspection traffic to the Snort, and from the output of Snort (P3), back to P1. If

a packet has been successfully inspected and not blocked, it will be then sent from the VA to the

destination node.

54

Figure 4-3 - VA Implementation Using Virtual Machine including Snort and OVS – Note the

direction of traffic

It is worth noting that our implementation of VA was not transparent, as it involved changing the

MAC address of the packet before sending it out through P1. The reason for this was a default

Firewall rule implemented in our hypervisors, that a packet going out from a VM should have

the source MAC address of that VM. This is to ensure no spoofing occurs inside the network.

We preferred not to touch that default rule, as it could lead to compromise of other VMs. Hence,

unlike our PA, our VA implementation was not transparent, as both end-hosts would see the

MAC address of the VA IDS in their packets. One may argue that this may help potential

attackers in identifying the IDS to attack it directly. We traded-off this risk with the hypervisor

firewall concern we discussed. As we will discuss in the next subsections, this also had

implications in terms of load balancing for the VAs.

The alternatives we considered for the implementation resources include Snort, Fortigate 111C

unit, Bro IDS, and custom NetFPGA-based NIDS. We believe Snort was a good choice for VA,

as it is one of the best well-written and most popular Open Source IDS software [60]. It

55

popularity implies that it fits well for abstraction and generalization of NIDS functions. As well,

it covers most attack types and has a flexible rule system, which is easy to use and program

remotely. Snort has been well documented and has various configuration modes and settings,

which we found helpful (e.g. can use it for both tap and inline). We were able to easily make

copies of the VA by storing its VM Image into our OpenStack image registry module and

spawning VMs based on that later on.

Likewise, we found Fortinet Fortigate 111C a good representative of the physical NIDS boxes.

In particular, Fortinet has been known to an industry leader in terms of network security

solutions over the recent decade or so (standing along other major players such as Cisco) [61].

The Fortigate 111C had all that was needed, including an SSH interface to program remotely,

and had relatively lower processing delay.

4.4.2 Component Placement

We introduced our high-level components in Chapter 3. In particular, we introduced the Security

Master Agent / Controller and a set of SW Agents for security resources, one per each NIDS

type. However, there are various ways for these components to be realized in implementation.

In this subsection, we discuss two main component placement considerations. The first is the

placement of SW and FW Agents, and the second is about NIDS resource placement and its

impact on delay.

In hierarchical deployment, the SW Agents, they may be implemented as part of the Master

Agent / Security Controller’s, placing it in the VM associated with it. This would not work for

the flat deployment architecture, as each SW Agent needs to be implemented in a separate VM

(since trust is not centralized anywhere). As already mentioned, we chose the hierarchical

deployment. Due to the reduced performance overhead, we chose to implement the SW Agents

inside the Master Agent. In particular, as we will discuss in a later subsection, the SW Agents are

implemented as scripts inside the Master Agent.

In terms of NIDS placement, our cloud platform (SAVI) has several edges, each edge having

several Agents, which are the hypervisor servers. As we will show in Chapter 5, the placement of

the NIDS has a considerable impact on the round-trip time. Of course, the placement argument

56

only applies to the VAs, as the PA cannot be placed (on-demand) with as much flexibility. For

our PA’s case, which was static, we used a switch near a SAVI Smart Edge node located at the

University of Toronto to connect the Fortigate 111C unit to the SAVI network.

For the VAs, however, we decided to place them dynamically and as close as possible to the

server we are protecting. We implemented a simple greedy algorithm that attempts to place the

NIDS in the same agent / hypervisor as the server we are protecting. Of course, a better

alternative would have been to come up with an algorithm that considers the whole default path.

4.4.3 Configuration API

Our NIDS configuration API is closely related to Enhanced Security Groups (ESGs). This API

essentially communicates what is needed there in the IDS. We ensured to design a transparent

API, in that our API requires the user to have no knowledge of the underlying platform. All they

provide is their security requirements (through the ESPs), which the Master Security Controller

(Master Agent) realizes them by applying the configuration API.

Our API is RESTful, and hence enjoys the advantages such as interoperability, scalability, and

familiarity (what it exposes is known ahead of time) [62]. As well, it is stateless, in that in only

functions based on the sequence of the input it gets, not storing anything about any IDS from the

past. Perhaps, that also contributes to ensuring the privacy of the cloud users, as we do not record

what levels of security they have had at a particular point in time.

4.4.4 Master Coordination Agent

The title of this subsection is yet another name for our Master Agent / Master Security

Controller. Perhaps we liked using different names for it to make it as general as possible. This

module has various responsibilities, such as implementing the communicated ESPs into ESGs

(through leveraging its SW agents, which are scripts inside of it), logging and tracking security

events (e.g. detection / FW action by a FW agent), and logging resource CPU and bandwidth

utilizations.

57

As well, we used it as the HTTPS server that the trusted users utilize in order to communicate

their ESPs. If we wanted to be safer, we could have perhaps separated this server from the

Master Module. However, there trade-off is increased delay (both processing and network delay

overhead are added). Given that spawning VA resources may take time in order of minutes, we

did not desire to add any further delay. Perhaps one may feel that the field of security is all about

security and performance trade-offs [63].

For similar performance reasons, we delegated the IAM (Identity and Access Management) to

SAVI’s OpenStack-based IAM [64]. This IAM is a secure system designed particularly to be

scalable. As well, we added a bit of Access Control on top of it ourselves, in that we decided

who would know the specific URLs has to access the system. Of course, our implementation was

rather small in scale, as it only had a single admin account.

For ease of use, increasing portability, and easy migration, we implemented the Master Agent

and the User Webserver inside a single Docker container [65]. We initially took a container that

had the SAVI client and Janus API built in and added the rest of the features of the Master Agent

on that infrastructure.

Figure 4-4 describes the components of Master Agent in detail. In particular, the user, which has

already authenticated through SAVI and has access to SAVI’s internal IP, may access the server.

The RC1 (Resource Controller 1) and RC2 are the SW Agents, named in this Figure following

the naming convention of the Software Defined Infrastructure (SDI). They are realized through

scripts, which is discussed in the next section. The Master Agent essentially parses the

requirements of the ESPs and passes the info the corresponding SW Agents / Resource

Controllers to implement them.

58

Figure 4-4 – Detailed Components of Master Agent

The ESP Repository is essentially a set of files defining the ESPs. We did not implement this as

a proper database, as we did not need that for our scale of attack knowledge. However, we do

propose that for bigger enterprise systems that may have many ESPs, it might make sense to

implement a dedicated database. This would also help with any ESP / attack definition updating

scheme, which our implementation does not include.

4.4.5 Software Configuration Agents

Our Software Configuration Agents, which have also introduced as SW Agents, are the key

backend components that enable the realization of our interoperable NIDS configuration protocol

and our ESPs in the actual security resources. For security, they utilize an SSH-based interface,

directly and securely running commands within the terminal of the machine itself. They are

indeed among the most trusted and crucial components of the design.

In our implementation, they happen to be located together with the ESP Repository in the Master

Agent. Their implementations are set of Bash and Expect scripts, that perform the setting

configuration remotely on the NIDSes, whether PA or VA. As mentioned previously, use of SSH

is not a requirement, and SW agents can be implemented for any NIDS that would allow some

form of secure communication to edit the security settings in it remotely.

59

4.4.6 Load Balancer

We performed an in-depth analysis of NIDS Load balancing schemes, which we summarize in

this subsection. We found that in our case, Load Balancing, Resource Scheduling, and Service

Chaining are closely related. In particular, Load Balancer (LB) implementation depends on the

resource sharing policy, and its implementation affects the resource scheduling. As well, to

enable an LB, service chaining is needed one way or another. Our idea is to combine these into a

logical entity that does all the three.

Whether such combination is a good idea or not depends on how the control architecture

distributes its responsibilities. In particular, it may be centralized (logically and/or physical) or

distributed (to reduce / distribute points of failure). We reviewed various schemes and compared

them in order to make a proper design decision.

4.4.6.1 Distributed Scheme

The first alternative, as shown in Figure 4-5, is the distributed LB architecture. In this scheme,

we divide the user flows among a set of load balancer modules. Each VA pool is dedicated to

one LB, which is assigned a particular security profile and set of requirements (e.g. delay). We

should note that each VA (and our PA) is only capable of handling one security profile at any

given point in time. Hence, we need to consider both bandwidth and delay requirements when

coming up with the VA pools.

Figure 4-5 – Distributed Load Balancing Scheme

60

We implemented this scheme using an OVS based VM. In particular, we used a feature of

OpenFlow 1.1, where a rule’s action can be assigned as a simple round robin or hash-based

selection over a set of actions [66]. Its integration with OVS, however, was recent at the time of

our experimentation and we had to fix an OVS bug to make it work.

The chaining of the traffic from the LB to the NIDS could be done in different ways. There was

an issue with the default hypervisor’s firewall, where packets leaving a VM should have the

source MAC address of that VM in order to prevent L2 spoofing. For that, instead of direct

chaining, we experimented mainly with VXLAN (Virtual Extensible LAN) overlay connections,

from the LB to the NIDS, where the load balancing is done by distributing the packets over the

VXLAN’s virtual interfaces of OVS. This worked well for both forward and reverse path load

balancing.

Another scheme we tried in our forward load balancing was to do it by changing the source and

destination MAC addresses. This works well for the forward path, as the ultimate MAC

destination (of the server being protected) is known in advance. As well, this works well with our

architecture since we do not have packets of user flows not going to the same server end node

passing through the same NIDS (i.e. each NIDS is tied to a specific server it protects). However,

this scheme does not work for reverse / response path load balancing. The reason is, in the

response path, we are not dealing with a single destination MAC address (as user MAC

addresses are different), and having a way to figure out what MAC address to translate to is not

easy. One possible way is to utilize the overlay for the response path.

We combined the two approaches mentioned in the last two paragraphs. In particular, we used

MAC address swapping for the forward path and overlay based load balancing for the response

path. Utilizing this L2 / Data path based load balancing, we achieved fast decision making (for

the forward path).

The most important factor in load balancing, for us, is resource utilization. We could have each

LB module collect resource utilization data (by either leveraging SAVI M&M or a dedicated

Data Acquisition System). However, this may result in the same information being processed

several times, which is not as efficient as having a centralized decision making module.

61

As well, we had certain concerns regarding this proposed scheme of ours. In this scheme, we

always need multiple LBs to avoid bottlenecks / single point of failure. As well, we need to

categorize LBs based on security requirements. However, even with these, each LB could still be

a bottleneck for its given category of user flows. More importantly, there is extra delay caused by

the LB switches because of their addition to the path.

4.4.6.2 Centralized Flow Assigner

Another scheme we implemented and chose for our final design was a decision centralization-

based one. The main idea was that instead of using extra LB modules, we have the load

balancing buried in the initial service chaining, updated at times based on security resource

utilization, if there need be. This scheme is depicted in Figure 4-6.

Figure 4-6 – Centralized Load Balancing Scheme

We found this scheme easier to experiment ideas with, as the algorithm was implemented in a

central location than in a number of LB modules. There is no additional delay or any additional

bottlenecks introduced in this approach (as opposed to the distributed one). The only challenge in

this approach is reassigning a flow to a different resource without disruption, should a flow grow

62

too large. Of course, a remedy to this is live upsizing of the VA resource, if the cloud

infrastructure provides that capability.

This centralized approach also suits our work, as it has an SDN / SDI flavor to it. It allows

establishing a globalized view of the entire resource utilization, as we shall see in our

experiments in Chapter 5. For that, it is desirable to have a distributed hash table to store the

states (which we did not implement).

A concern with this approach is the need for resiliency. For that, we propose to allow for

multiple centralized control modules (Master Agents) to operate at the same time, some running

as overflow. In the event of one controller going down, another shall take over. As mentioned,

we shall note again that while we implemented the centralized scheme, we did not implement the

proposed redundancy scheme. Our analysis was that this feature was not worth the effort, given

the low probability of the Master Agent being attacked directly (of course that is not the case if

an attacker with insider knowledge comes to the game). Therefore, it was not just in our design,

but also our implementation, where we had to forgo certain features ideally we would have liked

to have.

4.4.7 Auto-scaling

Since we decided that our load-balancing scheme is to be essentially reflected on the initial

assignment, it became important for us to have an ongoing process / service (i.e. a daemon) that

polls the resource utilization data every now. It should then ensure that there are enough

overflow resources available at any given point in time. It must upscale the number of VAs when

there is an increase in usage, and downscale when the pick in usage is finished. As well, it may

monitor for rogue flows / flows that grow beyond a certain threshold, and move them to a

different VA (we did not implement this last feature, but do propose it for future work).

Initially, we considered using OpenStack Heat project for the orchestration, as well as SAVI

M&M sensor system for our auto-scaling. In particular, Heat offers built-in auto-scaling [67],

and SAVI M&M taps the hypervisor directly to get resource usage numbers. However, we found

Heat to have too much computation overhead for our simple incremental algorithm. As well, to

ensure compatibility between our components, we preferred to implement our own in-house

63

orchestration tool, which utilizes the Nova client that we included in the Master Agent’s Docker

container. Similarly, we found that SAVI M&M tends to break often, as it is a research solution

with no production level resiliency and support. Hence, we found it more time-efficient to

implement our own data acquisition system than spending time to reverse engineer and modify

SAVI M&M to become more reliable.

The most important factor for our auto-scaling system is the number of overflow VA resources

we have at any given time. We developed a program that regularly polls the utilization data, and

based on, decides if there needs to be upscaling, no action, or downscaling. Our up and down

scaling consists of a single resource addition or removal.

We measured that our VAs could take up to 2 minutes to boot up and get running. This

important, as it reflects the importance of having enough number of VAs in advance. A feature

we did not implement, but do propose is the potential use of ROC (Rate of Change) of the

number and bandwidth of user flows. Another is to perform historical analysis to customize the

increments and/or prepare enough overflow resources in advance to deal with traffic surges. In

any case, our simpler greedy auto-scaling scheme works fine for a proof-of-concept.

In our implemented algorithm, the data points we use are one sample per minute, reported by the

VAs themselves (by the SW Agent in the case of the PA) to the Master Agent (by accessing the

system via SSH in and writing to a log file). As we shall see in Chapter 5, we could have used

either Bandwidth or CPU Utilization. However, we chose CPU Utilization, as we found CPU to

be the particular bottleneck, and its usages percentage has a well-known upper bound (i.e.

100%). Our auto-scaling script is called periodically (roughly about once a half hour), as well as

every time a flow is assigned, to see if we need to add more overflow VA resources or not. It

works by averaging samples to see if they go below or above a certain threshold (10% for

downscaling, 70% for upscaling).

4.4.8 Web User Interface

This subsection describes the Web User Interface (UI) that the trusted users of the cloud platform

have access in order to configure their user flows and NIDSes. As mentioned, in our

64

implementation, the backend is located within the Master Agent’s VM, but could have been

separated for increased security (at the cost of extra configuration delay)).

This UI essentially realizes the requirements of Enhanced Security Profiles, which translate to

the NIDS requirements by the Master and SW Agents later on. As mentioned in Figure 4-4, we

utilize code written in Node.js for the backend, due to its simplicity in implementation

(behavioral JavaScript), as well as the significant speed of prototyping, which is due to its rich

built-in libraries and frameworks such as Express [68]. For example, switching our initial HTTP

to HTTPS required changing less than 10 lines of code.

As mentioned, our server is HTTPS, to provide the users with confidentiality. However, we did

not anticipate or design a particular certificate distribution scheme, since we find outside the

scope of our work. This is similar to key distribution for the VM’s accesses, which we delegated

to SAVI IAM and Key Management Systems (which themselves are extensions on up of

OpenStack project).

Another reason we picked Node.js was for its capability of working asynchronously, which is a

scalable choice for the server. For increased security, the server required SAVI Login for

authentication (for which, the server included a built-in SAVI client). For the client-side pages,

the Bootstrap [69] JavaScript framework was used. As part of that, we took advantage of

MaxCDN Content Distribution Network, which caches the Bootstrap script library.

Screenshots of our GUI are shown in Figures 15-18. The first page shows the sign in page, which

a SAVI user can login with their SAVI credentials. The second page shows the list of their VMs

within the specific tenant and region they logged into along with their IP addresses, which could

later be used for chaining. The third page is where ESPs could be defined and applied. The last

page is where traffic-chaining requests are made. To chain all the traffic going and coming to a

VM, the chaining can be made between the gateway switch and the VM.

65

Figure 4-7 - Sign in Page of the GUI

66

Figure 4-8 – VM List Page of the GUI

Figure 4-9 - ESP Page of the GUI

67

Figure 4-10 – Chaining Page of the GUI

The ESPs defined specify three parameters: Resource Type, DoS Threshold, and Attack

Signature. Resource Type could be either "Physical" or "Virtual", Physical meaning the Fortigate

111C unit, while Virtual refers to the Snort-based Virtual Appliance (VA). DoS Threshold could

be "Low", "Medium", or "High". These three levels were translated to specific threshold values

of 50000, 5000, 1000 TCP Syn packet/second (for the VA’s case, for the PA the threshold

measure is slightly different). The Attack Signature field contains the signature of a specific

attack to block. The ESP fields were communicated and stored in JSON format.

68

Chapter 5 Testing and Evaluation

In this chapter, we present our testing and evaluation, which demonstrates the extent to which we

have achieved our objectives. We first begin by going over the functional verifications we

performed to ensure our design performs all the necessary functions. Then, we discuss the

parameters of interest, which work similar to figures of merit for us. We follow by talking about

the specific tests we have performed to measure scalability, detection accuracy, information

integrity, and finally an analysis on our interoperability. These tests will provide us enough

evidence to make our conclusions in our fifth and final chapter.

5.1 Functional Verification

Our functional verification of the prototype consisted of several parts. First, we ensured the

chaining worked. To check that, we performed chaining for both the PA and VA using the SDI

Enabler. To do so, we would spawn three VMs, one user host, another for the middleman (which

is the NIDS VM), and a third one representing the server on the cloud. We would then call our

Janus script, which performs the chaining, providing the internal IP addresses of the three

aforementioned VMs as parameters. The script would then go through all the switches on the

path, ensure appropriate forwarding rules are put there so that the traffic would first go to the

middleman IDS, and from there to the server, as well as the reverse.

Once the chaining is done, we used a typical network debugging and packet displaying tool

called TCPDUMP [70]. With TCPDUMP, we could choose just to see the header, or the entire

packet, as desired. We used this tool to verify that the packets are indeed chaining through the

NIDS, as shown in Figure 5-1. In particular, we should note the difference in the difference of

the MAC address of the user (from the server’s perspective) before and after chaining, as well as

the difference in time for the packet to go through. The former stems from our IDSes lack of

transparency from the server side. However, the IDS remains transparent to the user host.

69

Figure 5-1 – Verification of Chaining Using Ping and TCPDUMP (above is for chained, below

is before chaining)

After that, we needed to check if the IDS worked. To do so, it had to pass two basic tests: It can

pass a healthy packet, and drop one unhealthy (i.e. defined as an attack) packet. For the test

setup, we created two virtual machines in EDGE-TR-1 of SAVI Testbed. The first virtual

machine was to act as attacker machine, utilizing hping3 [71] to launch TCP Syn Flood attacks.

The second virtual machine was the victim machine, utilizing Speedometer utility [72] to display

the amount of ingress traffic.

70

To test the attack signature blocking, we used the Netcat utility [74], which worked similar to a

plain-text chat application over TCP between the two virtual machines. A screenshot of testing

using Netcat utility is shown in Figure 5-2.

Figure 5-2 - Test using Netcat utility (The packet containing "attack" string from samplevm4 to

samplevm3 is dropped)

We did the Netcat test separately for the traffic passing through VA (Open vSwitch and Snort

VM) as well as PA (Fortigate 111C unit). Overall, our functional verification was successful, as

our base IDS operations as well as the chaining were demonstrated to work as expected.

Another way to confirm that chaining is done and IDS is functional, is through observing the

round trip delay between two hosts that are chained through a functional IDS module. In

particular, we expect that the delay observed to be larger once the chaining has taken place. We

will demonstrate this in another subsection of this chapter, but as part of a test to measure delay,

not just a functional verification.

5.2 Testing Methodology

This section is to describe our approach to testing. We did not require it for the functional

verifications, as whether a function is performed or not, is linguistically easier to define, and

usually goes with common sense. However, when it comes to testing, we are going to use its

outcomes to judge our design and make certain claims. So we shall first define what we refer to

as success, and what is to be considered failure. For that, we have to define the testing

71

dimensions, which we discuss as part of our parameters of interest below. First, let’s briefly

recall our problem and its scope, so that we ensure our tests are going to make sense for them.

In particular, it is vital to note that the actual “security” of the cloud, or “how secure is it” is hard

to be quantified or even defined in a quantitative way. For example, one measure may be “risk”,

which is “cost of compromise” multiplied by “probability or frequency of it happening”.

However, digging deep into that, we have to deal with lots of factors that are out of the scope of

our work. It is jokingly said that an ideally secure system is a system that has no connection to

outside whatsoever (including any interactions with a human user), in which case, is of no use.

We are not going to be testing with a diverse set of traffic. Rather, we would be using a few

known attacks, such as “DoS”, “DDoS”, and “Attack Keyword or RegEx Pattern”. Together,

these cover two proposed categories of attacks commonly detected by signature-based IDSes:

1- The category of attacks that involve a pattern or packet to be repeated a certain number of

times. This category is usually detectable using header info alone.

2- The second category covers the mere existence of a certain keyword or pattern within the

payload of any given packet. For this category, we definitely require DPI, as the data /

payload of packet shall be investigated. The IDS, obviously, does not work on the

encrypted payloads (unless militarizing it to drop any unknown packet).

These two proposed categories, combined, represents almost any attack that can be detected by

an IDS. If both categories looked at the header and the payload, then the second one would have

been a special case of the first one. However, as mentioned, that is not the case, as one tends to

involve mostly the header info, while in the other the payload has to go through complete

inspection.

5.2.1 Parameters of Interest

The parameters that we base upon our tests are important, as they are a derivative of our testing

methodology. Therefore, in order to have a correct methodology, one has to be looking at the

right parameters. An analogy to this would be the selection of variable as features in machine

learning, only the variables that make sense and directly correlate with the objectives are to be

72

chosen. The particular “learning” we try to achieve is to see how much our design actually

works, and to what extent it meets its objectives.

As mentioned in the last subsection, our design has performed all of the functions we manifested

for it. Now, in order to see how well it has met them, we will look at the following parameters:

1- Detection Time: We define this as the time it takes for the attack to be detected from the

time it is launched. While we cannot control the detection time of our individual IDS

modules (VA or PA), we argue will attempt to measure whether leveraging SDI to

smartly place the modules helps us to reduce this detection time. This could, in turn,

provide a better defense / attack mitigation.

2- Relative Delay Overhead: This is a measure of the added delay because of the added

security. It consists of the DPI delay as well as the extra path delay. We can measure at

least the sum of these two, and see what factors may contribute to the sum.

3- Scalability: We define our scalability measure as the ratio of the growth rate of the

computing and networking resources or resource utilization versus the inspection

throughput. In particular, we will be commenting on the slope of the best fit line (Least

Square Method).

We could also look at the number of flows or number of (identical / unit) resources used.

The latter involves specifying some simplifying assumptions, e.g. having weights for

different sizes of VAs and the PA.

4- Detection Accuracy: We define this parameter for a set of packets, for varying degree of

attack to benign traffic mixes, and we intend to measure the detection and attack

penetration ratios.

5- Information Integrity: This measure is similar to detection accuracy, except this is the

ratio of the information in the payload of the benign packets received safely and in order

(i.e. not dropped or arriving out of order).

73

6- Performance: In our case, we can define it by looking at the throughput and delay. The

throughput performance is best understood from the scalability of our platform (its actual

value matters less, as we are not competing with the production systems), while the delay

is best measured by the relative delay concept already discussed.

7- Overall resource utilization: Could be defined by subtracting on the ratio of the time that

a unit is unused. If we wanted to be further specific, we could also look at the ratio of

utilization, as well (i.e. to ensure a unit is not marked as “utilized” while it’s under-

utilized (e.g. single inactive user flow)).

Overall, these parameters of interest help us document and discuss the process as well as the

outcome of our tests, and further aid us in our overall evaluation of our design. Table 5-1 lists the

testing parameters.

Parameter Name Exact Measure

Detection Time Detection Time in ms (milliseconds)

Relative Delay Overhead RTT difference (to baseline) in ms

Scalability (horizontal) Ratio of rate of resource growth (number of

resources) vs. growth in demand (mbps)

Detection Accuracy Ratio of attack packets penetrating (for a given

attack)

Information Integrity Ratio of non-attack packets safely arriving

Overall Resource Utilization Ratio of resource utilization (CPU usage) over

all deployed resources (over a period of time)

Table 5-1 – Chosen Testing Parameters

74

5.3 Test for Detection Time

In this test, the testing scheme is similar to that of our functional verification. In particular, we

have three VMs, all in the same network, one user host VM, another the IDS (VA or PA), and

one server host VM. The traffic of user to server and reverse is chained through the IDS. We

now measure how long it takes the IDS to detect attacks. For the attacks, we will have two, each

representing one of the major attack categories described earlier. In particular, one DoS attack

(TCP Syn Flood), representing the DoS category (i.e. attacks that require counting) and another

representing the attacks of which signature is mere existence of a keyword or regex pattern.

We performed the detection time test for both VA and PA. However, we are more interested in

the case of the VA, because we get some options in terms of the placement of the VM (from a

specific set of hypervisor servers). Hence, we can play with the placement of the IDS VM to see

to what extent its placement makes a difference.

First, we performed the test for a typical / representative case of a single IDS doing detection.

We appreciate that single experiment evidence is not as representative as a repeated test over

time, so we did the latter as well, but first, we discuss the individual case in order to get a better

intuition on the DoS inspection traffic we are dealing with.

As mentioned in chapter 4, the security profile for our IDSes come with three different DoS

threshold levels. This leveling, while reducing the granularity, is helpful to the users with less

security technical knowledge in configuring their profiles. In specific, they have to choose

between “low”, “medium”, and “high” security sensitivity / levels. In this way, we also delegate

the choice of the possible trade-offs (e.g. detection accuracy vs. how secure the system is) to the

user. For the VA, we defined our Snort to have a 10 second DoS (Syn Attack) detection

threshold window (i.e. the counter’s reset period is 10 seconds), while Fortigate 111C unit’s

threshold period is fixed at a minute. However, it seems that it is ultimately the rate that matters,

as the detection resources do not take the whole window for their detection to kick in.

Figures 21 to 23 demonstrate the detection time measurements we did for our implemented VA,

while Figures 24 to 26 are for the case of PA. In particular, each graph indicates the Speedometer

reading of the server side. The initial peak is the point attack starts arriving (E.g. almost mid-way

75

through Figure 5-3, labeled as red). Each sample represents a one-second average. The end point

is where the attack traffic is blocked and a sharp drop can be seen. The time difference (reflected

by the number of samples) of these two points determines the detection time, while the area

under the graph would be the bandwidth delay product. It is important to note that the vertical

axis of the graph is log scaled (i.e. the main four levels are 1KiB, 32KiB, 32MiB, 1GiB).

Figure 5-3 – VA Test Using a Low Sensitivity DoS Threshold (significantly long detection time,

which is the reflected in the number of data points between the red line (attack starts) and purple

lines (attack blocked at IDS))

76

Figure 5-4 - VA Test Using a Medium Sensitivity DoS Threshold (less detection time compared

to "Low" setting)

Figure 5-5 - VA Test Using a High Sensitivity DoS Threshold (as if the attack traffic hardly ever

gets to the victim VM)

77

One may wonder why the peaks are different for the same flavour of IDS (small sized). There are

at least two factors that affect to the performance of the Snort-based VM IDS:

1- VM Hypervisor / Agent placement: The bottleneck may be certain virtual or physical

switches, depending on the paths between the VM nodes / VM topology.

2- IDS VM Performance: It may differ from agent to agent. In particular, in SAVI’s

implementation of hypervisor, there is no true CPU virtualization.

The threshold translates to the delay and the bandwidth of the attack received by the victim (i.e.

the bandwidth – delay product, which is analogous to the amount of water that a flooded street

has received as a result of the flood). The lower the threshold, the higher the security sensitivity,

the lower the delay and the bandwidth of the attack. Of course, higher level of security has its

own well-known trade-off, which is loss in performance. In our particular case, the loss in

performance is realized by the increased number of false positives as well as bigger number of

logs generated (as traffic is detected as attack more often). This increases the need for automated

management aspect of the problem.

Figure 5-6 - PA Test - Low Sensitivity (50000 packets per second threshold)

78

Figure 5-7 - PA Test - Medium sensitivity (5000 packets per second threshold)

Figure 5-8 - PA Test - High Sensitivity (1000 packets per second threshold)

79

Rather than fixed attack rates, we used flooding for our experiments, which we observed up to

3.5 MiB/s (Mega Bytes per second) attack traffic being felt at the victim. We used the tool

Hping3 to generate these attacks, and used the tool Speedometer to display the traffic [72]. We

used the default sampling rate of Speedometer, which is one sample per second.

The point of this anecdotal evidence here is to help the reader appreciate the amount of traffic

that go through the system before the detection. This matters a lot, as such transitional traffic,

depending on its size, may be strong enough to disable all network operations for a significant

amount of time (even after already detected). Hence, as we will discuss in the next section,

strategic placement of IDS to reduce the detection time matters. In particular, the DoS traffic

may be measured by its bandwidth – delay product, which is a measure of how much data is on

the network being delivered. By reducing the delay, we reduce the amount of attack traffic

transitionally residing on the network before being detected and mitigated. As well, we put care

into separating control and inspection network channels; otherwise, this could have been

troublesome for the detection entities to coordinate with one another.

We do not discuss the case of keyword detection, as we believe the primary factor in the

detection time of such attacks is the time spent for (as low as) a single packet to traverse through

a network to the detection resource. This the case because there is no counting, thresholding, or

further processing on a group of packets is needed to detect these type of attack, and we believe

the detection of a single packet is almost instant (at least by human standards, of course, it is

many CPU cycles). For the measurement of such traversing to be representative, we would have

to model cloud network traffic and queuing times, perhaps using a probabilistic model, which is

out of the scope of our work.

As discussed, in general, we expect a better performance from the PA. However, this is primarily

manifested in the reduced processing delay, reflected in the round trip time, which is discussed in

the next section. As for the detection time itself, the PA seems to be slightly more efficient than

our VA implementation (which is OVS and Snort based VM).

Besides the anecdotal evidence, we repeated the VA detection time experiment for different

sensitivity levels (Low, Medium, and High) over approximately 20 hours. In particular, we had

the attacker log the attack initiation time, and the IDSes log the detection time. The Firewall

80

feedback functionality was disabled so that the attack could go happen repeatedly. Different

attacks took different lengths of time, so the actual number of attacks differ. In particular, for

high, medium, and low security levels, we had 1890, 991, and 573 attacks. Table 5-2 summarizes

our findings.

IDS Sensitivity Average Min Max Ratio of Attacks

Not Detected

High 8.451 sec 5.216 15.044 sec 0%

Medium 23.310 sec 22.349 sec 25.557 sec 0%

Low 38.827 sec 38.462 sec 39.432 sec 21.12%

Table 5-2 – Summary of the Experiment Measuring the Detection Time of Different Security

Sensitivities

These results justify our choices of the security levels, as one can deduct the trade-offs, the major

one being how much bandwidth-delay product is the platform gets exposed to versus the

accuracy of the detection platform.

5.4 Test for Relative Delay Measurement

This test was not only to measure the relative delay, but also, to see whether the placement of

IDS matters or not, and if so, how could that be utilized to increase the performance of the

platform. Delay matters a lot. Together with bandwidth, it is considered one of the most

fundamental performance characteristic of any packet delivery network [73].

For this test, we considered several scenarios:

1- Scenario 1: The IDS VA is in the same hypervisor as the server being protected.

81

2- Scenario 2: The IDS is in a different hypervisor. We expect this scenario to have a

higher round trip delay (topologically, the packets have to traverse an extra hop from

the IDS to the server).

3- We also have a baseline scenario (Scenario 0), which is for the case where the

chaining through the IDS has not taken place yet (i.e. packet directly going from user

to server VM, without going through the IDS VA).

In all cases, our VMs were on the same edge of the cloud. In general, placement of IDSes in

separate edges does not make sense (unless it is a constraint at a given point in time (e.g. one

edge is completely full)). Tables 5-3 summarizes our finding for the three scenarios mentioned.

Scenario # Min RTT Average RTT Max RTT Mean Deviation

0 (Baseline) 0.484 ms 0.945 ms 41.235 ms 4.152 ms

1 0.897 ms 1.779 ms 47.538 ms 4.601 ms

2 0.737 ms 2.073 ms 43.517 ms 4.169 ms

Table 5-3 - Round Trip Times (RTTs) for the three VA Scenarios

The test was done for 100 pings. We find it particularly significant to point out the difference of

average round trip time between the scenario 1 and 2, which is about 15%. While the difference

seems small, it matters when dealing with low time budgets (e.g. the case of LTE) [25].

We also did a similar measurement for the PA, result shown in Table 5-4. We only needed to

anticipate two scenarios here, as whether the VMs are in the same agent or not makes no

difference in this case, and the physical device cannot be moved on demand to accommodate

delay needs (again, demonstrating the benefit of NFV paradigm).

82

Scenario # Min RTT Average RTT Max RTT Mean Deviation

Baseline 0.425 ms 0.639 ms 34.125 ms 4.112 ms

Chained 0.719 ms 1.549 ms 35.595 ms 4.588 ms

Table 5-4 - Round Trip Times (RTTs) for the two PA Scenarios

As one may observe, the overall round trip time using the PA is the best (1.549ms). As expected,

this stems from the lower processing time that the PAs typically have. In all test cases, since

there was no DoS attack involved, we had zero packet loss, which demonstrates our system to be

reliable in terms of information integrity.

Another observation is that there may seem to be a considerable variance in the packet delay of

our network. This stems from the fact that SDI’s chaining is implemented using SDN /

OpenFlow under the hood, which requires the first packet to be forwarded to the controller. In

fact, that very first packet is the one causing the max RTT (35.595ms in Table 5-4 and 43.517ms

in Table 5-3). As well, there may be a small amount of variance added by the IDS’s deep

processing of the packet, resulting in some packets passing through faster or slower than the

others.

While the Scenario 1 has a higher mean deviation than Scenario 2, the difference is negligible

(roughly within 10% of each other). If we exclude the first packet, the mean deviation for the

Scenario 0 decreases by roughly a factor of 50. This may question our choice of testing with

merely 100 ping attempts. However, we verified our assumption that it is the first packet of

which delay heavily deviates, the rest of delays are much closer to one another in distribution. In

fact, use of 100 pings emphasizes the importance of the first packet delay, especially for the case

of flows that are small in size / bandwidth yet are delay sensitive (e.g. voice samples for

conversation). Depending on how forgetful the OpenFlow based switches are, the first packet

delay event may occur regularly, often enough to consider the first packet delay performance. In

our case, it occurs 1% of the time, and yet it introduces a significant change in the delay

distribution. Further discussion of OpenFlow implementation details is outside the scope of our

work, we left it mostly to SAVI Janus.

83

Another aspect of concern is that the load balancing scheme choice has an impact on the relative

delay. In our case, we did not use middleman load balancers and did our load balancing

distributed at the switches, guided by the central controller / master agent. Had we gone with an

implementation where the decision making was distributed (as discussed in Chapter 4), we

would have had additional delays. We estimate the additional delay may have been as high as 0.5

to 1ms, depending on the topology and hypervisor location.

5.5 Test for Scalability

As we mentioned in our parameters of interest section, perhaps a good measure for scalability is

through showing some linear form of growth of resources per input / demand unit. We further

claim that such linearity could be shown through an incremental analysis, where we show how

going from one level / step to another, the increase will continue to be linear (similar to a proof

by induction, except we do not intend to be that mathematical in argument here).

As discussed, the scalability could be horizontal and vertical, and the linearity could be measured

against several input parameters. Some of those parameters include individual or total user flow

bandwidth, individual or total resource utilization, or the total number of user flows. Here, we

only focus on horizontal scalability, and we study two of the mentioned parameters of interest,

namely, total / sum user flows bandwidth, total resource utilization. Hence, we measured the

number identical / unit VA (in this cased the small flavor of the VA) against these two

parameters.

We performed an incremental measurement, first for the case of up scaling (Scenario A), where

we measured how the system up-scales based on the bandwidth. Then, in order to argue for our

system’s efficient resource utilization, we will do a second case analysis for its down scaling

behavior (Scenario B). Here, we use scale up / out and down / in interchangeably, as we have

already specified our analysis is for horizontal and for unit sized VAs only.

Ideally, we would have liked to test our system to see how far it can maximally scale. However,

we were unable to do so, as the portion of our research cloud dedicated to this project was

limited to about 17 VMs at any given point in time. So we tested with up to 10 user server pairs

(i.e. five user flows), five IDS VAs (as well as a master module VM).

84

We tested our scaling which is measurement-based. As mentioned earlier, our chosen measure is

the CPU utilization, which we base upon the threshold for an action (scale up or down) to be

performed.

However, we shall admit that essentially our test is scheduled, as we manually add and remove

resources (and call load balancer / assigner script in the Master module) throughout the

experiment. However, there is only one overflow assigned in advanced, and the rest of the VA

resources are spawned and deleted automatically, not by our intervention.

We had not implement a real-time IDS assignment and chaining module and hence our system

needs pre-setting. Even if we did, however, our measurement would have still been scheduled, as

we would have to decide when in particular the traffic comes in, which may be arbitrary. Hence,

the best approach is to test with live traffic, which we later discuss as a potential future work.

Figure 5-9 and 28 shows the incremental addition of resources, against certain levels of

bandwidth (BW), as pertaining to Scenario A. As the demands grow more than a certain

threshold, the number of VA resources scale up. However, as one may observe, there is some

room for flexibility. It is important to note that for our decision making algorithm, we actually

used the resource utilization (based on CPU), which very closely correlates with BW, as we will

highlight in our Scenario B measurements.

85

Figure 5-9 – Total BW against time for scenario A (The x-Axis is aligned with Figure 5-10).

Figure 5-10 – Total resource number scaling as the BW sum increases in a step-like manner.

Overall, we see a stepped-like function, which can be described as the superposition of

rectangular step functions. If we connect the threshold points, we find a linear function. This

demonstrates the incremental scalability of our design, in that our algorithm automatically adds

resources based on demand, growing linearly ( O(n), where n is total input bandwidth).

0

2

4

6

8

10

12

14

16

0 20 40 60 80 100 120 140 160

Tota

l BW

(M

bp

s)

Time (Minutes)

Total User Flow Input BW of 10 Host - Server Pair (Mbps)

0

0.5

1

1.5

2

2.5

3

3.5

0 20 40 60 80 100 120 140 160

# o

f Lo

w /

Sm

all V

A R

eso

urc

es

Time (Minutes)

Total # of Identical VA Resources

86

Next, we review the case of Scenario B, the downscaling based on resource utilization, as shown

in Figures 29 – 31. The first of the Figures shows how our platform downscales for a reduced

load of total user flow bandwidth, retiring overflow resources that may no longer be required.

This is important; it is not merely the upscaling, but also the downscaling that makes our

platform to be as scalable as it could be.

In our Scenario B, we have a set of huge user flows initially kicking in, each consuming an entire

VA resource. We then downscaled based on them getting inactive over time. The experiment

involved IDSes, named IDS-low-1 to IDS-low-5 (“low” here meaning “small” flavor of VM, not

to be confused with low security sensitivity discussed earlier).

Overall, we measured an average resource utilization of 42.34%, for our experiment that ran over

2 hours. We could have studied the utilization in particular periods (user flow active and

inactive). However, we find this number sufficient to show that we have a fair resource

utilization ratio, considering we had one overflow resource that was unused the entire time.

87

Figure 5-11 – Number of Low/Small IDS Resources vs. the total inspection bandwidth over time

In Figures 31 and 32, the resource utilizations of the IDSes 1-4 are graphed (IDS #5 was ignored

since it was an overflow resource for its entire short-live operation). The utilization is described

through both CPU utilization, as a percentage, as well as bandwidth, in Mbps (different than the

scale used in the Speedometer, which was log-based MBps). The X-axis is time, unit being

seconds. However, in this case, we left the actual timestamps in order to show the matching

timeline. We aligned the BW and CPU graphs, to see how well they compare. While we did not

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0:00:00 0:28:48 0:57:36 1:26:24 1:55:12 2:24:00 2:52:48 3:21:36

# o

f Lo

w /

Sm

all V

A ID

S R

eso

urc

es

Time

Number of IDS VA Resources

-5

0

5

10

15

20

25

30

35

40

45

50

0:00:00 0:28:48 0:57:36 1:26:24 1:55:12 2:24:00 2:52:48 3:21:36

Tota

l In

spec

tio

n B

W (

Mb

ps)

Time

Total Inspection BW (Mbps)

88

calculate their correlation, it can clearly be visually confirmed that they are closely related and

more or less follow the same trend. However, the CPU is the bottleneck, while the BW depends

on the CPU, especially in the VAs. As mentioned, there is no true CPU virtualization in SAVI

Testbed, which means that we shall keep an eye on the CPU capacity at all times, especially

when CPU is a bottleneck. As well, CPU utilization ratio has a well-known maximum value (i.e.

100%). So we can use this to properly set overflow resources (we cannot use the BW due to its

peak itself being variable).

We shall discuss the timeline of events in terms user flows being chained. In particular, at about

time 00:30 (half hour past midnight) the system was at near full utilization for 5 user flows, and

one overflow VA resource. After that, incrementally, the user flows become inactive / idle,

which results in downscaling in terms of the number of resources. As we can see, this correlates

well with both resource utilization and bandwidth. Hence, we have established a linear step-like

relationship between the following: Bandwidth, Resource Utilization, and Number of VA

resources. This demonstrates that our platform is scalable.

89

Figure 5-12 - IDS-1 and 2 Resource Utilizations and Bandwidths throughout the experiment

0

10

20

30

40

50

60

70

80

90

100

0:28:48 0:57:36 1:26:24 1:55:12 2:24:00 2:52:48

Co

re U

tiliz

atio

n (

%)

Time

IDS-1 CPU Utilizaiton

0

10

20

30

40

50

60

70

80

90

100

0:28:48 1:40:48 2:52:48 4:04:48

CP

U U

tiliz

atio

n (

%)

Time

IDS-low-2 Resource Utilization

-5

0

5

10

15

20

0:28:48 0:57:36 1:26:24 1:55:12 2:24:00 2:52:48

BW

(M

bp

s)

Time

IDS-1 Inspection Bandwidth

-2

0

2

4

6

8

10

0:28:48 1:40:48 2:52:48 4:04:48

Insp

ecti

on

BW

(M

bp

s)

Time

IDS-low-2 BW

90

Figure 5-13 - IDS-3 and 4 Resource Utilizations and Bandwidths throughout the experiment

0

10

20

30

40

50

60

70

80

90

100

0:28:48 1:40:48 2:52:48 4:04:48

CP

U U

tiliz

atio

n (

%)

Time

IDS-3 Resource Utilization

0

10

20

30

40

50

60

70

80

90

100

0:28:48 1:40:48 2:52:48 4:04:48

CP

U U

tiliz

atio

n (

%)

Time

IDS-4 Resource Utilization

-2

0

2

4

6

8

10

12

14

16

0:28:48 1:40:48 2:52:48 4:04:48

Insp

ecti

on

BW

(M

bp

s)

Time

IDS-3 Inspection Bandwidth

-1

0

1

2

3

4

5

6

7

8

0:28:48 1:40:48 2:52:48 4:04:48

Insp

ecti

on

BW

(M

bp

s)

Time

IDS-4 Inspeciton BW

91

Another point is that there is a delay for the scale up or down operations to kick in. This delay is

required to observe the resource utilization over time in order to properly decide whether there is

a need for up or down scaling or not.

As we can see in the paired graphs, inspection bandwidth and resource utilization have a one-to-

one relationship, following the same trends in terms of increase and decrease. Hence, for the case

of VA, it is the computing capacity of the CPU that determines the maximum inspection

bandwidth. As well, as discussed, we know the maximum value for its ratio parameter. As such,

this justifies our choice of CPU utilization as the measure for up and down scaling.

Figure 5-10 and 29, together, are proof for the high potential for the scalability of our system.

They demonstrate how resources are spawn and assigned based on demand and user

requirements. It realizes IDS virtualization for various user flows, while not mixing the traffic of

different protected entities and isolating them. Overall, we showed how our system triggers

scaling up/down while using overflow resources to take into account the transients in

performance from when the change is detected to the point where the resource adjustments are

actually performed. Of course, our thresholds our static; a more adaptive system may involve

dynamic thresholds that may be learned throughout the time.

5.6 Tests for Detection Accuracy and Information Integrity

In this section, we discuss the tests involving two of our figures of merit, the detection accuracy,

seen as the ratio of penetration prevention, as well as the ratio of packet loss, which is a factor in

the information integrity of the communication involved. This is especially the case for transport

protocols that do not involve acknowledgements. In particular, we tested using UDP protocol, so

that our packet loss would be indeed a loss in information integrity (as opposed to TCP, which

does self-correction over unreliable networks).

Defining detection accuracy may be done in several ways. If we go with a definition based on

individual IDS performance, we see a 100% accuracy under normal circumstances. This

changes, however, when IDS is tested not with a single isolated flow, but several user flows.

Once the IDS reaches its capacity, it starts dropping packets.

92

Revisiting our basic functional test, we can say that the possibility of a keyword based attack

penetrating the system (under unencrypted scenario) is zero. However, not all cases of the attack

not penetrating may be due to the IDS blocking it, but it could also be a network issue or IDS

buffer running out. In particular, we tested a scenario when the ratio of attack to benign traffic is

much smaller (less than 0.01 percent) than our functional verification, and still saw a zero

penetration.

However, under attack condition, we expect to initially have zero packet loss, until attack grows

beyond the maximum inspection capacity of the IDS, at which point it start increasing, at the

same rate that the total input traffic of the specific IDS grows. Figure 5-14 shows an example of

our estimate, where the max inspection BW for the individual IDS is 10Mbps, BW growing at a

rate of 2Mbps per second (2 Megabits per second squared).

Figure 5-14 – Expected growth rate of packet loss for a given IDS

For the test, we considered a simple scenario, using UDP packets, generated by Iperf [75]. We

first tested without an attack, and then, with an attack. Similar to before, we used hping3 to

generate DoS traffic, with the period of 1 millisecond (i.e. 1000 packets per second). Figures 36

-5

0

5

10

15

20

25

0 5 10 15 20 25

Pac

ket

Loss

(M

b)

Input Traffic (Mbps)

Expected Packet Loss Rate

93

and 37 demonstrate the attack, as well as the test (the second figure has first the benign case and

then mix of attack and Iperf case).

Figure 5-15 – The attack used the second scenario to measure the packet loss

Figure 5-16 – Iperf test, first without an attack, and then under an attack scenario

We see a moderate loss of 2.8% under the second scenario. It is important to note that our attack

was small and only lasted a bit (as the first figure shows), yet its small effect is still felt. We find

this extent of testing enough to show the idea. An ideal test should involve a production network

94

traffic, or else, a replay of carefully chosen traffic to simulate a normal network traffic level,

which is beyond the scope of our work.

However, this could indicate the lower bound on our information integrity, as this was done

using the smallest flavor of our VA (vs. a bigger flavor or else the PA). Hence, we expect an

almost 100% integrity below the threshold, and a linearly growing packet loss after threshold

point. We have also demonstrated that we are comfortable with our overall detection accuracy.

95

Chapter 6 Conclusion

In this chapter, we present an overall evaluation of our work and discuss the conclusions we have

obtained. The last two subsections are dedicated to discussing our future work as well as the

contributions we have made.

6.1 Overall Evaluation

Perhaps the best way to evaluate our design to is to look at the extent it has met its desired

requirements. As shown by our functional evaluation, the main desired function, which

signature-based intrusion detection is indeed provided by our system. In terms of mitigation, we

provided the blocking both at individual NIDPSes as well as through our scalable distributed

firewall, which includes the feedback to the switch.

In terms of scalability, we have indeed established a linear relationship through our incremental

analysis in Chapter 5. We saw this in our step-like linear functions for scaling up and down the

number of virtualized resources based on the demand. We showed that our threshold-based

system is very scalable, so long as it’s overflow assigning rate is matched with the input traffic.

Of course, this requires traffic modelling which is outside the scope of our work. Nevertheless,

our claim of scalability is supported by our experiment data, which demonstrate our system

scales based on the demand, leverages overflows, yet has a reasonable overall utilization ratio

(roughly 40% for 2.5 hours of experiment).

We have realized an interoperable solution, and extensible API with relatively little overhead

(e.g. encryption for SSH), which was justified for the added security of our solution. Our

interoperability is demonstrated through the integration of both physical and virtual appliances in

our platform.

While we did not measure the security itself, we showed that our trust distributed approach, the

redundant master controllers, and our proposed self-investigation schemes have increased the

overall reliability of our system.

96

We have demonstrated various levels of security sensitivity, as well as providing the user with

the flexibility of choosing the flavour of security they desire (as well as the trade-offs involved).

We have demonstrated reasonable detection times for various sensitivity levels. Implementing a

user interface and user-centric API, we have demonstrated that our system is user-friendly.

Likewise, we have shown that given the correct / perfect signatures, our system has a near

perfect detection accuracy and information integrity. In particular, packets are not lost unless a

user flow demonstrates rogue behaviours (i.e. going over the reasonably assigned bandwidth

limits). In our case, the capacity of the virtualized resource between 15 to 20 Mbps, which is a

reasonable performance for a cloud-based solution.

Through our relative delay analysis, we showed how our system is capable of assigning the IDS

resource as close as possible to the protected VM (within the cloud network). This helps with

solving the problem we mentioned in our motivation section, which was the lack of applicability

of “periphery” to the cloud networks.

6.2 Future Work

Our future work includes the following:

1- The addition of the Analytics-based detection: The sample implementation could be a

basic clustering algorithm leveraging our IDS assignment API in order to further

investigate traffic detected as anomalous. The clustering algorithm may utilize a cloud

monitoring and measurement system (e.g. looking at the CPU Utilization profile of the

VMs in a given cloud).

2- The implementation of automatic real-time user flow assigner: Our current IDS

assignment scheme is passive, in that it requires the flows to be pre-defined for the IDS

assignment to take place. An alternative approach is to have flows being actively setup in

real-time. This may use a controller that extracts from the packet header the information

defining the user flow (i.e. source and destination IPV4 addresses). There may also be a

static flow setup, giving the direct SSH access to the administrator’s traffic without

having to go through the controller. A major concern for this scheme is the flow setup

time.

97

3- The inclusion of other security resources: NetFPGA, Host-based Intrusion Detection

Systems (HIDSes), legacy firewalls, other PAs and VAs. Our proof-of-concept merely

included only one VA (Snort-based VM) and one PA (Fortigate 111C unit). The more of

those included, the more extensible / interoperable the system is demonstrated to be.

4- Testing with real-time live traffic: This test requires expanding our attack signature set,

which relates to the next suggestion. As well, one shall ensure much stricter user privacy,

in particular, anonymization of the traffic logs.

5- Addition of attack signature / antivirus update scheme. As well, considering to implement

the ESP repository using resilient database than using raw filesystem (current

implementation). The database approach is more scalable, and may include redundancy

to increase the resiliency.

6.3 Contribution

The main focus of our work has been on the network security architecture. Our contribution

includes the following:

- We presented a novel scalable architecture for IDS deployment and management

which takes advantage of the Software Defined Infrastructure’s integrated

management of networking and compute resources.

- By leveraging the SDI’s service chaining and management capabilities, our platform

can locate the desired security resource as close as possible to the entity being

protected. As well, we demonstrated a framework where load balancing is buried in

service chaining / IDS assignment, through logically centralized decision making.

- We introduced the notion of Enhanced Security Groups (ESGs) and Enhanced

Security Profiles in order to communicate and realize the security needs of the user in

the intrusion detection platform. As discussed, this scheme is solid, clear, and novel.

98

- We discussed various aspects of IDS scalability and management / deployment

architectures (hierarchical and flat), and showed that our hierarchical architecture is

quite scalable, distributing the trust, and lacking single points of failure.

- Our approach to IDS deployment was a network service based one. We presented

how our work can be modeled in various paradigms, such as NFV, SDI, and SDN. In

doing so, we presented a pluralistic view in the depiction of our design.

- Our proof-of-concept implementation was quite extensive. We came up with a

flexible implementation where our modules are orchestrated as building blocks to

realize any given user flow topology, yet be load balanced and scale up and down

based on the demand.

At the end, we would like to note that there is no such thing as ultimate security solution for

cloud networking. However, we hope to have made this point clear, that the security schemes

may have to change fundamentally when facing evolving compute and networking paradigms.

References or Bibliography

[1] Center for Strategic and International Studies. (2013, July) The economic impact of

cybercrime and cyber espionage. [Online]. Available:

http://www.mcafee.com/us/resources/reports/rp-economicimpact-cybercrime.pdf

[2] A. Zaharia. (2016, may) 10 alarming cyber security facts that threaten your data. [Online].

Available: https://heimdalsecurity.com/blog/10-surprisingcyber-security-facts-that-may-affect-

your-online-safety

[3] H. P. Strategies. Cybercrime costs more than you think. [Online]. Available:

http://www.hamiltonplacestrategies.com/sites/default/files/newsfiles/HPS%20Cybercrime20.pdf

[4] Statistics Canada. (2016, aug) Canada: Economic and financial data. [Online]. Available:

http://www.statcan.gc.ca/tablestableaux/sum-som/l01/cst01/dsbbcan-eng.htm

[5] PWC. (2016) The global state of information security survey 2016. [Online]. Available:

http://www.pwc.com/gx/en/issues/cybersecurity/information-security-survey.html

[6] Microsoft Inc. (2016) Advanced threat analytics. [Online]. Available:

https://www.microsoft.com/en-us/cloudplatform/advanced-threat-analytics

[7] C. E. Group. (2015) 2015 cyberthreat defense report north america & europe. [Online].

Available:

https://www.bluecoat.com/sites/default/files/documents/files/CyberEdge2015CDRReport:pdf

[8] ISACA. (2015, jan) 2015 global cybersecurity status report. [Online]. Available:

http://www.isaca.org/pages/cybersecurityglobal-status-report.aspx

[9] A. Jumratjaroenvanit and Y. Teng-Amnuay. "Probability of attack based on system

vulnerability life cycle." 2008 International Symposium on Electronic Commerce and Security.

IEEE, 2008.

[10] P. Mell and T. Grance. “The NIST definition of cloud computing,” 2011.

100

[11] Kaspersky Inc. (2015, dec) Security bulletin 2015. [Online]. Available:

https://securelist.com/analysis/kaspersky-securitybulletin/73038/kaspersky-security-bulletin-

2015-overallstatistics-for-2015

[12] Trustwave. (2016) 2016 trustwave global security report. [Online]. Available:

https://www2.trustwave.com/GSR2016.html?utm_source=library\&utm_medium=web\&utm_ca

mpaign=GSR2016

[13] N. Boudriga. (2010). Security of mobile communications. Boca Raton: CRC Press. pp. 32–

33. ISBN 0849379423.

[14] S. Northcutt, L. Zeltzer, S. Winters, K. Fredrick, and R. Ritchey, “Inside Network Perimeter

Security”, New Riders, 2003, p 4

[15] Amazon Inc. (2016). Amazon EC2 Security Groups for Linux Instances. [Online].

Available: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html

[16] A. Milenkoski, , M. Vieira, S. Kounev, A. Avritzer, and B. Payne(2015). Evaluating

Computer Intrusion Detection Systems: A Survey of Common Practices. ACM Computing

Surveys (CSUR), 48(1), 12.

[17] T. AbuHmed, A. Mohaisen, and D. Nyang, “A survey on deep packet inspection for

intrusion detection systems,” arXiv preprint arXiv:0803.0037, 2008.

[18] H. Xu (2016, jul). Cloud Native Security Paradigm Shift. [Online]. Available:

https://www.sdxcentral.com/articles/contributed/cloud-native-security-paradigm-shift-2/2016/07

[19] J. H. Saltzer, D. P. Reed, and D. D. Clark, “End-to-end arguments in system design,” ACM

Transactions on Computer Systems (TOCS), vol. 2, no. 4, pp. 277–288, 1984.

[20] L. Armasu (2016, apr). Google’s Zero Trust 'BeyondCorp' Infrastructure Shows Future Of

Network Security. [Online]. Available: http://www.tomsitpro.com/articles/google-beyondcorp-

future-network-security,1-3229.html

101

[21] A. Hussain, J. Heidemann, and C. Papadopoulos, “A framework for classifying denial of

service attacks,” in Proceedings of the 2003 conference on Applications, technologies,

architectures, and protocols for computer communications. ACM, 2003, pp. 99–110.

[22] H. P. Levy (2015, oct). What’s New in Gartner’s Hype Cycle for Emerging Technologies,

2015. [Online]. Available: http://www.gartner.com/smarterwithgartner/whats-new-in-gartners-

hype-cycle-for-emerging-technologies-2015/

[23] A. Raff (2013, dec). From Prevention to Detection: A Paradigm Shift in Enterprise Network

Security. [Online]. Available: http://www.securityweek.com/prevention-detection-paradigm-

shift-enterprise-network-security

[24] U. Frank and S. Strecker, “Open reference models-communitydriven collaboration to

promote development and dissemination of reference models,” Enterprise Modelling and

Information Systems Architectures, vol. 2, no. 2, pp. 32–41, 2015.

[25] S. Monfared, H. Bannazadeh, and A. Leon-Garcia, “Software defined wireless access for a

two-tier cloud system,” in IFIP/IEEE International Symposium on Integrated Network

Management (IM). IEEE, 2015, pp. 566–571.

[26] H. El-Rewini and M. Abd-El-Barr (Apr 2005). Advanced Computer Architecture and

Parallel Processing. John Wiley & Son. p. 63. ISBN 978-0-471-47839-3. Retrieved Oct 2013

[27] G. BDW, “Big data analytics for security intelligence,”

https://downloads.cloudsecurityalliance.org/initiatives/bdwg/Big_Data_Analytics_for_Security_I

ntelligence.pdf, accessed: 2015-12-01.

[28] P. Yasrebi, S. Monfared, H. Bannazadeh, and A. Leon-Garcia, “Security function

virtualization in software defined infrastructure,” in IFIP/IEEE International Symposium on

Integrated Network Management (IM). IEEE, 2015, pp. 778–781.

[29] V. Alfred, “Algorithms for finding patterns in strings,” Algorithms and Complexity, vol. 1,

p. 255, 2014.

102

[30] L. Yang, R. Dantu, T. Anderson, and R. Gopal, “Forwarding and control element separation

(forces) framework,” RFC 3746, April, Tech. Rep., 2004.

[31] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S.

Shenker, and J. Turner, “Openflow: Enabling innovation in campus networks,” SIGCOMM

Comput. Commun. Rev., vol. 38, no. 2, pp. 69–74, Mar. 2008. [Online]. Available:

http://doi.acm.org/10.1145/1355734.1355746

[32] P. Gupta and N. McKeown, “Packet classification on multiple fields,” in ACM SIGCOMM

Computer Communication Review, vol. 29, no. 4. ACM, 1999, pp. 147–160.

[33] C. Cui, H. Deng, D. Telekom, U. Michel, H. Damker, T. Italia, I. Guardini, E. Demaria, R.

Minerva, and A. Manzalini, “Network functions virtualisation.”

[34] G. ETSI, “001,” Network Functions Virtualisation (NFV): Use Cases, vol. 1, 2013.

[35] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield,

“Live migration of virtual machines,” in Proceedings of the 2nd conference on Symposium on

Networked Systems Design & Implementation-Volume 2. USENIX Association, 2005, pp. 273–

286.

[36] J.-M. Kang, H. Bannazadeh, H. Rahimi, T. Lin, M. Faraji, and A. Leon-Garcia, “Software-

defined infrastructure and the future central office,” in IEEE International Conference on

Communications (ICC) Workshops, 2013. IEEE, 2013, pp. 225–229.

[37] J.-M. Kang, T. Lin, H. Bannazadeh, and A. Leon-Garcia, “Software-defined infrastructure

and the savi testbed,” in International Conference on Testbeds and Research Infrastructures.

Springer, 2014, pp. 3–13.

[38] M. Ghaznavi, N. Shahriar, R. Ahmed, and R. Boutaba, “Service function chaining

simplified,” arXiv preprint arXiv:1601.00751, 2016.

[39] S. Roschke, F. Cheng, and C. Meinel, “Intrusion detection in the cloud,” in Eighth IEEE

International Conference on Dependable, Autonomic and Secure Computing, 2009. DASC’09.

IEEE, 2009, pp. 729–734.

103

[40] S. N. Dhage and B. Meshram, “Intrusion detection system in cloud computing

environment,” International Journal of Cloud Computing, vol. 1, no. 2-3, pp. 261–282, 2012.

[41] A. M. Lonea, D. E. Popescu, and H. Tianfield, “Detecting ddos attacks in cloud computing

environment,” International Journal of Computers Communications & Control, vol. 8, no. 1, pp.

70–78, 2013.

[42] T. Alharkan and P. Martin, “Idsaas: Intrusion detection system as a service in public

clouds,” in Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud

and Grid Computing (ccgrid 2012). IEEE Computer Society, 2012, pp. 686–687.

[43] S. T. Zargar, H. Takabi, and J. B. Joshi, “Dcdidp: A distributed, collaborative, and data-

driven intrusion detection and prevention framework for cloud computing environments,” in 7th

International Conference on Collaborative Computing: Networking, Applications and

Worksharing (CollaborateCom), 2011. IEEE, 2011, pp. 332–341.

[44] C.-C. Lo, C.-C. Huang, and J. Ku, “A cooperative intrusion detection system framework for

cloud computing networks,” in 39th International Conference on Parallel Processing Workshops.

IEEE, 2010, pp. 280–284.

[45] C. Mazzariello, R. Bifulco, and R. Canonico, “Integrating a network ids into an open source

cloud computing environment,” in Sixth International Conference on Information Assurance and

Security (IAS). IEEE, 2010, pp. 265–270.

[46] A. Bakshi and Y. B. Dujodwala, “Securing cloud from ddos attacks using intrusion

detection system in virtual machine,” in Second International Conference on Communication

Software and Networks, 2010. ICCSN’10. IEEE, 2010, pp. 260– 264.

[47] H. Hamad and M. Al-Hoby, “Managing intrusion detection as a service in cloud networks,”

International Journal of Computer Applications, vol. 41, no. 1, 2012.

[48] IETF Network Working Group (2007, mar). The Intrusion Detection Message Exchange

Format. [Online]. Available: https://www.ietf.org/rfc/rfc4765.txt

104

[49] J. Hoagland, S. Staniford et al., “Viewing ids alerts: Lessons from snortsnarf,” in DARPA

Information Survivability Conference & Exposition II, 2001. DISCEX’01. Proceedings, vol.

1. IEEE, 2001, pp. 374–386.

[50] A. Hutchison and M. Welz, “Ids/a: An interface between intrusion detection system and

application,” in Recent Advances in Intrusion Detection, Third International Workshop,

RAID2000, Toulouse, France,

http://www.Raidsymposium.org/raid2000/Materials/Abstracts/21/21. pdf. Citeseer, 2000, p. 13.

[51] Designing Engineers: An Introductory Textbook (S. McCahan, P. Anderson, M. Kortschot,

P. Weiss, and K.Woodhouse )

[52] Provos, Niels. "A Virtual Honeypot Framework." In USENIX Security Symposium, vol.

173, pp. 1-14. 2004.

[53] Prevelakis, Vassilis, and Diomidis Spinellis. "Sandboxing Applications." In USENIX

Annual Technical Conference, FREENIX Track, pp. 119-126. 2001.

[54] C. Cui, H. Deng, U. Michel, and H. Damker. "Network Functions Virtualization. An

Introduction, Benefits, Enablers, Challenges & Call for Action". [Online] Available:

http://course.ipv6.club.tw/SDN/nfv_white_paper.pdf

[55] R. Jain, and P. Subharthi. "Network virtualization and software defined networking for

cloud computing: a survey." IEEE Communications Magazine 51, no. 11 (2013): 24-31.

[56] "Network Functions Virtualisation (NFV); Terminology for Main Concepts in NFV" (PDF).

Retrieved July 2016.

http://www.etsi.org/deliver/etsi_gs/NFV/001_099/003/01.02.01_60/gs_nfv003v010201p.pdf

[57] M. Roesch et al., “Snort: Lightweight intrusion detection for networks.” in LISA, vol. 99,

no. 1, 1999, pp. 229–238.

[58] B. Pfaff, J. Pettit, T. Koponen, E. Jackson, A. Zhou, J. Rajahalme, J. Gross, A. Wang, J.

Stringer, P. Shelar et al., “The design and implementation of open vswitch,” in 12th USENIX

symposium on networked systems design and implementation (NSDI 15), 2015, pp. 117–130.

105

[59] Fortinet Inc. Fortigate 111C Quick Start Guide. [Online]. Available:

http://docs.fortinet.com/uploaded/files/845/FortiGate-111C_QuickStart_Guide_01-30007-0469-

20090415.pdf

[60] SANS Institute (2002). Using Snort For a Distributed Intrusion Detection System. [Online].

Available: https://www.sans.org/reading-room/whitepapers/detection/snort-distributed-intrusion-

detection-system-352

[61] Gartner Inc. (2016) Gartner fortinet firewall market share 2016. [Online]. Available:

http://www.slideshare.net/zztop_2764/gartner-fortinet-firewall-market-share-2016

[62] F. Belqasmi, R. Glitho, and C. Fu, “Restful web services for service provisioning in next-

generation networks: a survey,” IEEE Communications Magazine, vol. 49, no. 12, pp. 66–73,

2011.

[63] D. A. Menascé, "Security performance." IEEE Internet Computing 7, no. 3 (2003): 84-87.

[64] M. Faraji, “Identity and access management in multi-tier cloud infrastructure,” Ph.D.

dissertation, Citeseer, 2013.

[65] Docker Inc. (2016) What is Docker? [Online]. Available: https://www.docker.com/what-

docker

[66] The OpenFlow Team (2011, feb). OpenFlow Switch Specification Version 1.1.0. [Online].

Available: http://archive.openflow.org/documents/openflow-spec-v1.1.0.pdf

[67] OpenStack Foundation. Heat. [Online]. Available: https://wiki.openstack.org/wiki/Heat

[68] “Express” [Online]. Available: https://www.npmjs.com/package/express

[69] “Bootstrap” [Online]. Available: http://getbootstrap.com, accessed: 2015-12-01.

[70] “TCPDUMP Manual” [Online]. Available: http://www.tcpdump.org/tcpdump_man.html

[71] “hping security tool – man page” [Online]. Available: http://www.hping.org/manpage.html

[72] “Speedometer” [Online]. Available: https://excess.org/speedometer, accessed: 2015-12-01.

106

[73] D. Katabi, M. Handley, and C. Rohrs, “Congestion control for high bandwidth-delay

product networks,” ACM SIGCOMM computer communication review, vol. 32, no. 4, pp. 89–

102, 2002.

[74] “Netcat user manual,” [Online]. Available: http://www.r3v0.net/docs/Delta/man/nc.html,

accessed: 2015-12-01.

[75] “IPFERF” [Online]. Available: https://iperf.fr