4 full chapter margin.pdf

1

CHAPTER I: INTRODUCTION

1.1 Overview

This project is about a network forensic that allow finding the details of

networking events after they happened and how to analyze VoIP attacked data

pattern by using WEKA, a data mining tool. WEKA is used to view network

traffic, in order to investigate network and security attacks or application

performance issues. From the data pattern, an investigation will be conducted to

reveal information about network and application interactions, user sessions, and

response time and latency metrics. It is also to get the information about the

source of the attacks, when the attacks happen, where the source of the attacks

comes from and what type of attacks that are found and track down a hacker is to

keep vast records of activity on a network with the help of an intrusion detection

system.

From the gathered data, it will help to find a solution for each attack to

prevent them from happening again in the future. From the data analysis, it also

reveals who communicated with whom, when, and how often. This information

gained could be used as evidences to the victims for them to take further action

on the parties that committed network crimes on them.

1.2 Project Objective

The main objective of this project is to analyze the pattern of attack data

from the captured data. In which case, the data will indicate the condition of the

network events. Hence the source of attacks or other problem incidents will be

discovered. It helps in identifying unauthorized access to a computer system, and

searches for evidence of other types of threats of attack occurrence.

2

The second objective is to convert the pcap data to arff data file that will

recognize by the WEKA data mining tool. The first objective cannot be

conducted if the second objective is failing to apply.

1.3 Project Scope

This project will focus on VoIP and attacked data pattern by using

WEKA, a data mining tool. The Denial-of-service attack (DoS), Spam over

Internet Telephony (SPIT), and Man-in-the-middle (Mitm) attacks are the three

main focuses of this project.

1.4 Problem Statement

The growth in networking connectivity, complexity and activity has

increased the number of crimes committed within networks. An emerging

application like VoIP has worsened the situation. Knowing the attacked patterns

allows network administrators to fence their network.

VoIP is one of the newest technologies that are being rapidly embraced

by the market as an alternative to the traditional Public Switched Telephony

Network (PSTN). The common VoIP threats are network-based DoS,

eavesdropping, signaling protocols, spam and etc. These attacks can make

conversations unintelligible due to malicious people that can listen in others

conversations, network overloaded, and packet loss or network congests that

caused a network down. In addition the bandwidth for each application on the

network will be less since they will be shared amongst the applications.

3

1.5 Problem Solving

The solution to the problem can be solved by any network tools. WEKA

which is a data mining tool will be used in this project to view network traffic

history to investigate the attacked and identify the source of attacks.

4

1.6 Chapter Organization

This chapter contains the detailed description of the project proposed

which is VoIP data Forensic using WEKA a data mining tool. In this chapter, we

have described the surface of how the VoIP data Forensic work and how the

attacks had given an impact on VoIP application. More details about VoIP data

Forensic using WEKA a data mining tool will be described in Chapter 4.

Chapter 2 discusses the literature review that is used in the project. The

literature review describes all the research and findings that related to this

project.

Chapter 3 will discuss on the research methodology that will give specific

research methods used to design the project. In this chapter, there had

explanations on the methods and specifications that used in this project and also

prepare budgets and costing.

Chapter 4 will discuss on the testing and implementation of the

project. This chapter will give the explanation on how the project will be

implemented.

Chapter 5 will discuss on the project verification. This chapter will give a

result from the project implementations or experiments. From this chapter, user

will understand on how the system running and the final output of the system.

Chapter 6 is the conclusion of the project. Any other suggestions or

enhancements will be listed in this chapter for future reference.

5

CHAPTER II: LITERATURE REVIEW

This chapter consists of discussion on several subjects that related to this project.

The reviews start with a definition and concept of VoIP, Data Mining and Network

Forensic. In addition, the existing VoIP protocol and VoIP issues will be one of the

researches. Then some work by other researchers that related to the area of study will be

study so that it can be included in a literature review.

2.1 Background

2.1.1 Voice over Internet Protocol (VoIP)

VoIP known as IP Telephony, which is using an Internet Protocol over an

IP network. With the growth in popularity and bandwidth, VoIP allows phone

calls to be routed over the Internet rather than Public Switched Telephone

Network (PSTN). VoIP converts the voice signal into digital signals that travel

over the Internet. The voice signal is packetized and sent over the network one-

by-one. The processes of packetization involved with a callers voice signal

being compressed, then transfer it over the IP network, and it is then

decompressed at the end [1]. So VoIP can achieve on any data network that uses

IP like Local Area Networks (LAN), Internets and Intranets [1].

There are several reasons why VoIP telephony is becoming very

attractive to telecommunication providers and users rather than PSTN. The

decreased call cost is one of the main reasons. It is relatively cheap to make a

long distance call through a VoIP service rather than PSTN. This is because

network resources such as bandwidth, router CPU and memory are shared

between applications in the Internet [2]. When using a PSTN line, users had to

pay for each minute that spend on the phone. The Internet is a backbone of VoIP,

6

the cost that the user has to pay is a monthly bill to an Internet service provider

(ISP). The other reason is VoIP services can be used for conference calls as

appose to the phone line whereby only two persons can speak at a time. With

VoIP, a conference can be setup with a whole team, communicating in a real

time.

Figure 2.1 shows a simple VoIP process. To send data over the internet,

the voices or the data are compressed into small packets to reduce amount of

transmission space. These packets are sent in different order and the packets are

then streamed line at the other end. Generally packet loss can happen during the

transmission. To recover from the loss, there is a mechanism in order to cover up

the loss and building up the data by collecting the pieces of information [3].

There are also other potential problems with VoIP such as increased

security risks and lower Quality of Service (QoS) and Denial of Service (DOS)

[4]. In the PSTN, a circuit or dedicated channel was set up between two points

for the call duration. These telephony systems are based on copper wires carrying

analog voice data over the dedicated circuits [5]. A set amount of bandwidth is

Figure 2.1: VoIP Processing [3]

7

reserved when a call is established between the callers for the time the

connection is active. One of the main problems with PSTN technology is that

the 64 kbps of bandwidth is reserved even when there is no data being sent and

the entire bandwidth is not needed. The actual requirement for bandwidth is

usually only a small amount of what is reserved [4].

VoIP telephony relies upon methods and various protocols to establish

calls and transmit data. Most VoIP implementations however use Session

Initiation Protocol (SIP) and Real-time Transport Protocol (RTP). The SIP

protocol is a text based application is used for a call teardown, call initiation, and

other call related data sent during the conversation [4]. Besides using SIP for

teardown and initiating calls, SIP is also used for integrating more users into a

conference call. VoIP that uses SIP relies on a SIP proxy server which to

authenticate the users login credentials. This proxy also used for signaling the

data and to route the call and acts as a registrar which is used to locate other

users [4].

RTP is used for generic transport capabilities for real-time multimedia

applications that support both steaming applications and conversational such as

video conferencing, video-on-demand, internet telephony, internet radio and

music-on-demand. RTP is transported with a Datagram Protocol (UDP) packet to

reduce overhead to get a greater transmission speed or a better call quality.

2.1.2 VoIP Attacks

VoIP is for sure gaining advantage over PSTN but there is a major concern for

the VoIP community which is its security. An increasing security mechanism

would have a poor VoIP performance service. On the other hand, without

security mechanisms, VoIP services would be open to threats and attacks [2].

Man-in-The-Middle (MiTM), Denial of Service (DoS) and Spam over Internet

Telephony (SPIT) are among the VoIP attacks.

8

MiTM attack is the attacker inserts himself between two communicating

parties that the he can delete or modify the communications. MiTM attack is a

real threat to the security. For example, the MiTM which in the VoIP signaling

or a media path can easily divert, wiretap, and even hijack selected VoIP calls

[6]. Such MITM attacks on VoIP could cause a serious effect to the targeted

VoIP users. For example, attackers are able to collect sensitive information such

as bank account number, credit card number, PIN number and etc. of the victims.

MITM is such a problem in Internet communication because there s no way to

recognize someone's face and voice. Electronic communications are tools that the

attackers are easy to discover because they would not be able to answer quickly

when victims are suspicious about the caller, they might question the attackers

about a shared history moment for example as a test [7]. That is why MiTM

attacks work against web-based systems because the web is not synchronous [7].

The attacker could simply pass and get through to the end of the

communications.

DoS attack is an attack that denies a service or connectivity on a network

or devices, or bringing down the servers offering such services because it can

overload the devices internal resources of the network. DoS attacks can be

carried out by flooding a target with unnecessary SIP call-signaling messages.

This can cause calls to drop prematurely and halts call processing [8]. The DoS

attacks goal is to cause the service inoperable for as long as possible. By

targeting victims computer and network of the site victims are trying to use, the

attacker may be able to prevent victims from accessing websites, or other

services. Floods are a common type of DoS attack. Floods happened when the

attacker overloads the server with a request so it cannot process the victims

request and they cannot access that site. The attacker can use spam email

messages on victims email account. The email account services have assigned

one account with a specific quota which one account has a limited amount of

data at any given time [9]. The attacker can collect victim quota, preventing them

9

from receiving legitimate messages by sending large, or many email messages to

the account [9].

If SPAM is for email, SPIT is for VoIP which is an unwanted bulk calls

or voicemails that sent over VoIP networks [10]. SPIT may be a bigger problem

to deal with to compare with SPAM. SPIT might cause a bandwidth problem that

will increase the bandwidth bills for several times. This is because voice

messages carry up more bytes than emails which only a few kilobytes apiece.

SPIT attacks are different with SPAM. SPAM can be detected before it interfere

the recipient meanwhile in SPIT, there is too late for prevention of SPIT if the

phone rings and the phone rings immediately after session initiation [10]. This

will disturb the users current activity.

2.1.3 Network Forensic

VoIP is an application resides within the Internet environment. As the increasing

number of people using the Internet, the number of illegal activities such as

identity theft, data theft and etc. also increases drastically. Network forensics

deals with the recording, capture or analysis of network events. With network

forensics, it is able to analyze historical network traffic in order to conduct

investigations for security attacks [11]. From the gathered information, it will

help in identifying an unauthorized access to the system, and searches a solution

to prevent them happening in future. This information can be used as for

evidence in case of such an occurrence.

The main goal of network forensics is to provide evidence that is

sufficient to allow the criminal perpetrator to be successfully prosecuted [12].

Network forensics require two steps, first gathering a complete network activity

data and then interpreting the data. Network activity data build a necessary

foundation for a network forensics investigation which interpreting forensic

10

network data could range from extracting files and reconstructing web sessions

to tracing data leakage and detecting advanced persistent threats [13].

2.1.4 WEKA Data Mining Tool

Data mining is the process of analyzing data from different corners and

summarizing it into useful information [14], and it is one of the analysis tools

software for analyzing data. Data mining could be separate into two parts,

directed and undirected. In directed data mining, it is trying to predict a particular

data point, but in undirected data mining, it is trying to find patterns in existing

data, or creates groups of data [15]. Data mining has dozens of techniques and

procedures that used to examine and transform data. The data mining is to

create a model that can improve the way to read and interpret the existing data

and the future data [15].

Waikato Environment for Knowledge Analysis (WEKA) is one of the

data mining tools software and is open source software. WEKA is a collection

of machine learning algorithms for data mining tasks and it is the product of the

University of Waikato, New Zealand [15]. The software is written in the Java

language. It contains tools for data preprocessing, regression, clustering,

classification, association rules and visualization [16]. WEKA uses a flat text file

describing the data and it can work with a variety of data files including its own

file formats, Attribute Relation File Format (ARFF) and C4.5 file formats. ARFF

is the WEKA default file type that use for data analysis, but the data also can be

imported from a various formats [17]. The data can also be read from a

Structured Query Language (SQL) database or from Uniform Resource Locator

(URL).

11

2.2 Previous Work

2.2.1 Skype Forensics in Android Devices

In this research paper, Mohammed I. Al-Saleh and Yahya A. Forihat did some

investigation on the evidences of Skype calls and chats in the Android devices.

Smartphones, have a bit of capabilities similar to that of PCs which can store a

large of data and different categories of information. Smartphone which is

having an Android-based device is getting more popular because there are a lot

of varieties of mobile Applications (Apps) that were developed to extend the

functionality of the phones. VoIP Apps are extensively used that provided the

usage for their wide availability and cheap prices and Skype is one of the popular

VoIP Apps.

Figure 2.2: Investigation Model [18]

12

This research paper might assume that Skype is one of the ways that

helps in committing cybercrimes. Digital Forensics may be conducted on mobile

devices, computers, and networks, in order to detect the cyber-criminal activities

and prove them guilty under the law. Fig. 2.2 is an investigation models

researchers designed. The figure summarized that the criminal starts a call

conversation session with the victim. The conversation sessions from the

criminals device need to be extracted by the investigator to extract evidences by

inspecting both RAM and NAND flash memories [18].

After doing several experiments, the pattern for each experiment had

shown there were no differences between the call conversation patterns. The

result of chat messages is found in both memories and have decreased the

average number of occurrences for the different time durations. This means, chat

messages were stuck for a long time in the flash memory without redundancy.

The remaining number of messages still can be used as evidence. The researchers

concluded that Skype conversation patterns and chat messages can be found in

both of the RAM and NAND flash memories for a long time and regardless of

deleting calls and chat histories and signing out of the Skype [18].

2.2.2 Network Forensics Models for Converged Architectures

A pattern is a solution to a problem that can be used to guide evaluation

of systems or the design. The concept of forensic pattern is introduced by

illustrating them using Unified Modeling Language (UML) object oriented

models. Attack patterns are a description of the objectives and steps of an attack.

From these attack patterns, it can obtain useful information to analyze a ways to

stopping the attacks. Forensic pattern is a systematic approach to network

forensic collection and data analysis. By using these forensic patterns,

investigators or forensic teams will have a structured method to search, collect

and analyze network forensic data.

13

Firewalls and Intrusion Detection System (IDS) is a general security

mechanism that unable to detect and stop the attacks at a higher level. To stop it

in the future, some details about the attackers activities need to collect and send

them to be analyzed. Sensors with examination capabilities for collection of

evidence are a way of collecting data which help reduce human intervention

were used. These sensors are to capture all entering or leaving the system of

voice packets. The evidence collector starts collecting forensic data if there is

notifications alert of alarm that detect the against VoIP components. After

collecting the forensic data, the evidence collectors will the data to the network

forensics server. These data are used to discover and rebuild the attacking

behaviors. The forensics server will perform the corresponding forensics

analysis.

Log correlation and normalization are one of the techniques to analyze

forensic database and files. The evidence analyzer will presents results to the

forensic investigator. This result will include such information as the IP address,

the topology of the network, the MAC address, and possibly the geographic

location of the IP. In Fig. 2.3, Juan C. Pelaez and Eduardo B. Fernandez

described how a forensic system and IP telephony integrate. The model

represented the three primary components, the forensic server, the evidence

Figure 2.3 Class diagram for a VoIP network forensics system [19]

14

collector and the network investigator. The advantages using the forensic pattern

are; automated evidence analyses will reduce response times of the forensic

investigators, the analyzer can provide information about logs and for tracing

back the attackers, and can determine the call history, when a user is using the

VoIP device, and with whom the user communicates [19].

2.2.3 Security Patterns for Voice over IP Networks

The authors[REFERENCE REQUIRED], discuss the security attacks

and related them to the ways the system is used and provided some defense

mechanisms. Four security patterns are presented which provide good practices

for VoIP in identifying and understanding the mechanisms needed. The patterns

include VoIP Tunneling, Network Segmentation, Secure VoIP Call, and Signed

Authenticated Call. Unified Modeling Language (UML) was used to make easier

for the implementation of the patterns. There are three different types of

connections when using the IP protocol. PC-to-PC, PC-to-Telephone, and

Telephone-to-Telephone. VoIP uses the Real-Time Protocol (RTP) for transport,

Real-Time Transport Protocol (RTCP) for reporting Quality of Service (QoS),

and SIP, H.323 Media Gateway Control Protocol (MGCP) for signaling.

In this journal[REFERENCE REQUIRED], there are several attacks that

the authors presented. Theft of service, IP Spoofing, and Denial-of-service

(DoS), masquerading, call interception, repudiation, call hijacking, and brute

force is one of the presented attacks that against the VoIP network. The authors

have made some detail analyzed of these attacks using the concept of attack

pattern by considering the forensic aspects.

15

Fig.2.4 shows the relation between VoIP security patterns and related

cryptographic patterns. The double box represented the patterns. In the Network

Segmentation pattern, it will minimize disruption in the attack event and critical

voice traffic wont impact. The VoIP Tunneling pattern uses encryption to ensure

data integrity and confidentiality in VoIP networks. Tunnels will secure the VoIP

traffic transport over the external network and eliminates the risk of exposing a

network. The Signed Authenticated Call provides a suitable way for

authentication of messages in VoIP and the best countermeasure for theft of

service attacks. In Secure VoIP call, encryption and decryption of VoIP calls

were used to provide good confidentiality.

It concludes that, use VPNs and encrypt all voice traffic are the best

security approach in VoIP. This would ensure that the critical voice traffic

would be unaffected if an attack did occur on the data network [20]. To enhance

the security in VoIP, filtering and firewalls can be implemented to control the

traffic between the data VPN and the voice [20].

Figure 2.4: Relationships between VoIP security patterns [20]

16

2.2.4 Enhancing Forensic Investigation In Large Capacity Storage Devices Using

WEKA: A Data Mining Tool

This research project focuses on large sets of data that can be handled by

a data mining system. WEKA data mining tools are studied to demonstrate the

data mining methodology and thus obtain the data. The WEKA tool kit is easily

extendable and flexible. WEKA is written in Java and makes it easy to use and

easily portable. It allows modeling techniques and data preprocessing.

WEKA is a user friendly which provides a large set of functions and tools

included attribute selection, pre-processing filters, data clustering, classification

and selection of data, data visualization of data and association discovery.

WEKA is open source free software that is available to all users and it can be

used to run individual experiments. There are various data formats WEKA

supported. These files are ARFF, Comma Separated value (CSV), Decision

induction algorithm acceptable format etc.

Fig. 2.5 present the flow of data mining that used in WEKA. Data is

classified based on the attribute selection, and data are then divided into clusters

based on the types of grouping that the user selects. The output obtained after

clustering gives the accuracy of data when the data is clustered which can be

Figure 2.5: Flow of Data Mining Methodology in WEKA [21]

17

used for future predictions. Finally regression analysis describes how regression

can be applied and results can be visualized.

This research project used a bank data to import into WEKA and

implement it in 4 modules that represents data mining process stages. The source

file can be in one of the formats which are either .arff or .csv. Fig. 2.6 is a

WEKA preprocessing window with the bank data. The data are saved to bank-

data-final.arff after the parameters are set up. The project was implemented in

four modules which represents various stages and each task of data mining.

Association, classification, clustering and regression are the four stages of data

mining process [21].

Figure 2.6: Preprocessing window [21]

18

2.3 Critical Analysis

The following table is a review of the differences in the literature review.

Table 2.1: Critical Analysis

JOURNAL

JOURNAL 1

[REFERENCE

REQUIRED],

JOURNAL 2

[REFERENCE

REQUIRED],

JOURNAL 3

[REFERENCE

REQUIRED],

JOURNAL 4

[REFERENCE

REQUIRED],

RESEARCH

DATA

Skype

Converged

Network

Converged

Network

Bank

Employee

TOOLS

SOFTWARE

HARDWARE

X

X

VOIP ATTACKS

DoS

X

X

SPIT

X

X

X

X

MiTM

X

X

X

X

PROTOCOL

SIP

X

X

X

RTP

X

X

X

19

CHAPTER III: RESEARCH METHODOLOGY

This chapter will cover the detail explanation of methodology that is being used

to make this project complete and working well. The method is used to achieve the

objective of the project that will accomplish a perfect result. Subsequently, section 3.1

introduces the methodology that be used in this project. In section 3.2 the resources of

the hardware and software are listed. The budget and costing of the tools are listed in

Section 3.3. Section 3.4 and Section 3.5 the Work Breakdown Structure (WBS) and the

project timeline, Gantt chart was developed which consists of activity duration

estimation and the development of the project schedule.

3.1 Rapid Application Development (RAD) Methodology

Rapid Application Development (RAD) methodology is selected to be

used as a methodology model because it is a suitable process for software

development and it used to replicate the flow of each work related to this project.

This methodology is based on an iterations approach and prototype. Since this

project involves with the existing data, comprises analysis and reporting of the

data, RAD process works best in cases where the data is known, the

requirements can be defined and kept unchanged during the development and the

functional requirements can be met within a short time frame [22]. In this

project the RAD methodology based on 6 phases which consist of Initiation

phase, Planning phase, Design phase, Testing and Implementation phase,

Verification phase and the last phase is Documentation phase.

20

RAD methodology is designed with advantages. Quality and speed are

the primary advantages of this methodology. RAD increased the speed of

development and decreased delivery time, which focuses on converting

requirements to code as quickly as possible [23]. Increased quality is a RAD

primary focus, which is defined as both the degree to which a delivered

application meets the needs of users as well as the degree to which delivered

systems has low maintenance costs and provide a considerable reduction in the

errors due to the use of automation tools and prototyping. Errors and omissions

are detected in the early stages of development, thereby preventing any extra

effort or cost. [24].

3.1.1 Initiation

An initiation or feasibility study is conducted after getting an approval

from the FYP supervisor. During the first of these phases, the initiation phase,

the project objective, project scope and current problem statement are identified.

A feasibility study is conducted to gather all the findings and data that related to

the project. The findings include all the sources of the information from internet,

books, journal, articles and previous study which is similar to this project or

systems. From the research literature, it can spot various gaps in the literatures

Figure 3.1: RAD model methodology

21

which can formulate a research question based on the research gaps and discuss

how these projects are likely.

3.1.2 Planning

The next phase, the planning phase, all of the work to be done is identify

where is the hardware and software resource requirements, and research model is

identified, along with the strategy process to implement the project. A project

plan is created outlining the activities, tasks, dependencies and timeframes and

identified a project budget by providing cost estimates for the equipment and

materials costs. The budget is used to monitor and control cost expenditures

during project implementation. The project plan can be referred at Fig.3.3 and

Fig.3.4 on pages 7 and 8.

3.1.3 Design

During the third phase, the design phase, the hardware and software are

defined, and .pcap data files collections are collected in this phase. The system

architecture, topology is well designed in this phase, which show the process of

project work and the process of converting the .Pcap data files into a format that

will be recognized by WEKA. Fig.3.2 shows the architecture of the project.

22

The .Pcap data files are the most available file format for logging network

traffic and can be used by almost any network analysis tool which displays huge

amounts of data that need to go through to find problems with the network. To be

recognized by WEKA,. Pcap data files are converted into a temporary .csv data

file format using a tshark Wireshark command line. Then the .csv data files will

convert into .arff data files format that supported by WEKA using a simple txt

notepad file and saved it as .arff file.

3.1.4 Testing & Implementation

In this phase, the project architecture is being tested in order to identify

the effectiveness the test techniques that apply by converting the .pcap data files

into a format that will be recognize by WEKA. The implementation will be

started when the .pcap data file is successfully converted, and the hardware and

software requirements are all gathered. All the installing and software setup is

completed in this phase. Refer Section 4.2 in chapter four on page 32 . The

collected data will be imported into WEKA that needs to be analyzed to get a

result. The data collected from the company are subject to our worked with time.

Figure 3.2: Architecture Topology

23

3.1.5 Verification

The fifth phase is the verification. This is where the result in fourth phase

will be verified in order to identify whether the data and the design implemented

meets the requirements of the project or not. If there is failure in testing phase,

there will be some modification to this system until it will run successfully. The

conclusions can be made based on the correctness and completeness of

development and operation in Testing phase process.

3.1.6 Documentation

The last phase is documentation where is the preparation of documented

all the information and result that related to the project as a final report including

the corrections and amendments the report before submission.

3.2 Project Resources

The project requires the following hardware and software. Table 3.1

shows the hardware and Table 3.2 shows the software specifications. These are

the minimum requirement needed to ensure the success of the simulator.

3.2.1 Hardware Specifications

No

.

Device Quantit

y

Specifications

1 Laptop 1 ASUS brand

Processor : Intel inside CORE i3

RAM : 6.00 GB

OS : Microsoft Windows 7

Table 3.1: Hardware Requirement

24

3.2.2 Software Specifications

3.3 Budget/Costing

The following is review of the budget and costing of the hardware and software

requirements. Table 3.3 shows the hardware and Table 3.4 shows the software

estimated budget and costing.

3.3.1 Hardware Estimated Budget

Table 3.2: Software Requirement

No

.

Software Descriptions

1 WEKA

Version: 3.7.10(Latest version)

License/Price: Free

OS: Windows 7,8,XP,Vista,2000

Programming Language: Java

Size: 25.9 Mb

2 Wireshark

Version: 1.10.1 (64-bit)

License/Price: Free

OS: Windows 7,Vista,XP

Networking Software Tools

Table 3.3: Project costing for hardware

No. Equipment Quantity Price(RM) Remark

1 Laptop 1

1800

Students properties

25

3.3.2 Software Estimated Budget

3.4 Work Breakdown Structure (WBS)

The following figure is WBS which is contains level of the work breakdown

structure that provides further definition and detail.

No

.

Equipment Quantity Price(RM) Remark

1 WEKA 1

-

Open source

2 Wireshark 1

-

Open source

Table 3.4: Project costing for software

26

Figure 3.3: Work Breakdown Structure (WBS)

27

3.5 Project Timeline

Project timeline in Fig.3.4 shows the time duration that is taken to accomplish

this project. It shows every phase of the project development and schedule of the

project to make sure the project will meet.

Figure 3.4: Gantt chart

28

CHAPTER IV: TESTING AND IMPLEMENTATION

This chapter explains the project testing and implementation stages. Section 4.1,

testing stage will discuss on a conversion of the pcap files into arff format files. The

testing stage is divided into two subsections. Section 4.1.1 introduces the conversion of

the pcap files into csv files format, meanwhile in section 4.1.2 introduces the conversion

of the csv files into arff files format. In section 4.2 will discusses on an ethical matters

and in section 4.3 will discuss on a ways to analyze the data.

4.1 Testing Stage

This section defines the testing method on the project architecture

topology. The project architecture is set as shown in Figure 3.2 on page 22. This

stage is important to ensure that the test techniques that apply by converting the

pcap data files into a format that will be recognized by WEKA is in a systematic

manner.

4.1.1 Pcap To Csv Conversion

There is no direct conversion of pcap to arff formats. The csv file is an

intermediate file between pcap and arff files. Wireshark will be used for

converting pcap files into csv files.

29

Run the pcap files using wireshark and on File menu choose an Export

Packet Dissection. This menu item allows exporting some of the packets in the

capture file to file. In this case, choose CSV (Comma Separated Values packet

Figure 4.1: Wireshark Export Packet Dissections

Figure 4.2: Wireshark Export File

30

summary) as shown in figure 4.1 on pages 25. Then save the files as csv files

format as shown in figure 4.2.

4.1.2 CSV to Arff conversion

This is a step to convert CSV to Arff using WEKA. First of all, itll need to

install WEKA. It can be downloaded from http://www.cs.waikato.ac.nz/ml/weka.

It is a free source. WEKA windows will look like Figure 4.3. An ArffViewer

option under the Tools menu is to load or open the csv files into WEKA as

shown as in figure 4.3 on pages 22.

Figure 4.3: Weka GUI Chooser

31

Open the csv file by change files of types become CSV data files (*.csv) as

shown in figure 4.4.

Figure 4.4: ARFF-Viewer windows

Figure 4.5: Weka Save Windows

32

Then save as the file in the file name delete ".csv" and change it to ".arff" like in

figure 4.5, then the data files already finished converting csv file to arff file.

4.2 Ethical Matters

The ethical matter is pertaining to the data gathering that we collected

from a third party company. We mentioned it here as to protect the companies

and ourselves from legal action taken in the future if the data leaks. The first

company that we approached is a security company through its employee that

was one of our speakers during the UniKL Security Talk day. However, the

company was unable to release the data due to the sensitivity of the data. The

official letter sent to the company as in Appendix X

The second attempt was through the Malaysian Computer Emergency Response

Team (MyCERT), CyberSecurity Malaysia. After a few trials on phone calls and

weeks, we got a response from one of the officer who is in charged on our

request. We then sent a formal letter, refer to Appendix X, in order to conduct an

interview with the officer. We also asked if the company could supply the data

that are related to our project. Unfortunately the company did not keep data type

that relates to network attacks. On the other hand, they provide advisories on

what to do when an attack happens

The third attempt was to set an interview with Vigilnet Company, which

provided VoIP analysis. The person in charge was outstation for a few weeks,

though the company agreed to supply the data. At the end the company supplied

us with the VoIP data, however the data were clean data and with no trace of

network or individual attacks on the data. Nevertheless, we still use this data as

one of the analyses.

33

4.3 Analysis Stage

Towards understanding and improving forensics analysis processes, in this stage

an analyzing experiment were conducte on collected VoIP attack data for

analysis. This stage enables to mark or discovers the source of security attacks or

other problem incidents.

4.3.1 Analyze Using WEKA

This stage was focused on some common attack types of DoS attack which is

ICMP Echo flood, UDP flood, TCP SYN flood, and a data from reliable sources

by using WEKA Explorer preprocessing, classification, clustering, and attribute

selection.

4.3.1.1 ICMP Echo Flood

4.3.1.1.1 Preprocessing

The file was loaded into WEKA in the Preprocess window as shown in Fig.4.8

by click on Open file button and choose the .arff file from the local file

system.

Figure 4.8: Weka Open File

34

Once the data is loaded, WEKA recognizes attributes that are shown in the

Attribute window.

Left panel of Preprocess window shows the list of recognized attributes:

No.: number that identifies the order of the attribute as they are in the

data file.

Selection tick boxes: allow to select the attributes for working

relation.

Name: name of an attribute as it was declared in the data file.

During the scan of the data, WEKA computes some basic statistics on each

attribute. The following statistics are shown in Selected attribute box on the

right panel of Preprocess window:

Name: is the name of an attribute.

Type: is most commonly Nominal or Numeric.

Missing: is the number percentage of instances in the data for which

this attribute is unspecified.

Distinct: is the number of different values that the data contains for

this attribute.

Unique: is the number percentage of instances in the data having a

value for this attribute that no other instances have.

Figure 4.9: Weka Selected Attribute Box

35

No. is numeric. Therefore, the following frequency statistics for this attribute in

the Selected attributes window:

Missing: 0 means that the attribute is specified for all instances (no

missing values).

Distinct: 6 means that number. has six connections communication

Unique: 6 means that other instances do have the same value as number.

has.

Time is a Numeric value. The statistics describing the distribution of values in

the data - Minimum, Maximum, Mean and Standard Deviation. Minimum = 1 is

the lowest time, Maximum = 2.075 is the longest time, mean and standard

deviation. By comparing the result with the attribute table destunreachble.csv,

the numbers in WEKA match the numbers in the table. Figure 4.11 showed the

visualization of all attributes.

Figure 4.10: Matched Attribute

36

4.3.1.1.2 Classification

Classifiers in WEKA are the models for predicting nominal or numeric

quantities.

Figure 4.11: Attributes Visualization

Figure 4.12: Classify Tab Windows

37

In the Fig.4.13, C4.5 algorithm and J48, decision tree learner is used to analyze

the data sample. The C4.5 algorithm was chosen because of it can handle

numeric attributes.

Figure 4.13: Weka J48 Algorithm Tree

Figure 4.14: Classifier Test Option

38

In this data sample, the classifier will be evaluated based on how well it predicts

66% of the tested data. The Percentage split radio-button was checked and

keeps it as default 66%. Percentage splits evaluate the classifier on how well it

predicts a certain percentage of the data, which is held out for testing. The

amount of data held out depends on the value entered in the % field. When the

options have been specified, the learning process will be started by click on the

Start button.

4.3.1.1.3 Clustering

Clustering in WEKA is for finding groups of similar instances in a dataset.

Figure 4.15: Cluster Tab Windows

Figure 4.16: Weka Gui Generic Object Editor Window

39

Once the cluster scheme SimpleKMeans is selected, a

weka.gui.GenericObjectEditor screens came up by right-click on the algorithm

as shown in Fig.4.16. The value in numClusters box was set to 7 because it has

seven clusters in the .arff file.

When training set is completed, the Cluster output area on the right panel of

Cluster window is filled with text describing the results of training and testing.

A new entry appears in the Result list box on the left of the result.

4.3.1.1.4 Attribute selection

Attribute selection searches through all possible combinations of attributes in the

data and finds which subset of attributes works best for prediction.

Figure 4.17: Cluster Output

40

In Fig.4.18, the CfsSubsetEval and BestFirst search method was set up to

search through all possible combinations of attributes in the data and find which

subset of attributes works best for prediction. The results of selection are shown

on the right part of the window when the attribute selection process is finished as

shown in Fig.4.19.

Figure 4.18: Select Attribute Tab Windows

Figure 4.19: Attribute Selection Output

41

The implementation of the other data which are UDP Flood, TCP SYN

Flood and the data from reliable source were not shown because it have same

steps as shown by ICMP Flood data, so the results on each data will analyze on

next chapter, Chapter V: Result and Analysis.

42

CHAPTER V: RESULT AND ANALYSIS

This chapter discusses the results of the experiments conducted as described in

Chapter 4. There are four discussed results regarding to attacks. The results were

separated into each section according to the attacks. Section 5.1 discusses on data that

got from reliable sources and Section 5.2 discusses on ICMP Flood attack data. In

section 5.3, TCP SYN Flood attack data will be discussed.

5.1 Reliable Data

The protocols involved in the pcap can be viewed in the protocol

classifier tree. SIP, RTCP, RTP, and HTTP were the protocol which involved as

shown in run information below:

= = = Run information = = =

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2

Relation: reliabledata

Instances: 4447

Attributes: 7

No.

Time

Source

Destination

Protocol

Length

Info

Test mode: split 66.0% train, remainder test

43

= = = Classifier model (full training set) = = =

J48 pruned tree

------------------

Time 201.325642

Length 202: RTP (3003.0/15.0)

Number of Leaves: 10

Size of the tree: 19

Time taken to build model: 0.06 seconds

44

SIP is a signaling protocol used for controlling multimedia

communication sessions, like voice or video calls over IP. The protocol can be

used for modifying, creating and terminating two-party or multiparty sessions

consisting of one or several media streams. In this capture file, SIP is used to

create and tear down VoIP sessions.

RTP defines a standardized packet format for delivering audio and video

over the Internet. RTP is usually used in conjunction with the RTCP. When in

conjunction, RTP is usually originated and received on even port numbers,

whereas RTCP uses the next higher odd port number. In this capture file, RTP is

used as the media protocol to transport voice.

RTCP partners with RTP in the delivery and packaging of multimedia

data, but does not transport any media streams itself. RTCP itself does not

provide any flow encryption or authentication methods.

HTTP is a request-response protocol standard for client-server

computing. In this capture file, HTTP is used to communicate with the GUI

frontend of the SIP PBX.

= = = Run information ===


Relation: reliabledata

Instances: 4447

Attributes: 7

No.

Time

Source

Destination

Protocol

45

Length

Info


= = = Classifier model (full training set) = = =

J48 pruned tree

------------------

Protocol = SIP

| Source = 172.25.105.43: Request: OPTIONS sip:[email protected] | (1.0)

| Source = 172.25.105.40

| | Length 574

| | | No. 1298: Status: 401 Unauthorized | (2.0)

| Source = 172.25.105.3

| | No. 1302: Request: ACK sip:[email protected] | (3.0/1.0)

| | No. > 1: User-Agent: Asterisk PBX 1.6.0.10 | FONCORE

At the beginning of the attack, the attacker 172.25.105.43 sent a SIP

OPTIONS request for extension 100 at 172.25.105.40. Luckily, 172.25.105.40

responded to the request with a 200 OK response. The information that is useful

for the attacker is the User-Agent message header field of the response. Given

this information, the attacker now knows that he/she is facing an Asterisk PBX

and FONCORE Tribox family distribution. With these clues in hands, the

attacker tried to connect to the box with HTTP.

46

5.2 ICMP Echo Flood

Internet Control Message Protocol (ICMP), which enables users to send

an echo packet to a remote host to check whether its alive. These packets

request reply from the victim and this results in saturation of the bandwidth of

the victims network connection.

=== Run information ===


Relation: icmp

Instances: 6

Attributes: 7

No.

Time

Source

Destination

Protocol

Length

Info


=== Classifier model (full training set) ===

J48 pruned tree

------------------

Source = 10.2.10.2: Echo (ping) request id=0x0200, seq=9472/37, ttl=32

[ETHERNET FRAME CHECK SEQUENCE INCORRECT] (3.0/2.0)

Source = 10.2.99.99: Destination unreachable (Host unreachable) [ETHERNET

FRAME CHECK SEQUENCE INCORRECT] (3.0)

47

Number of Leaves : 2

Size of the tree : 3

Time taken to build model: 0.01 seconds

=== Evaluation on test split ===

Time taken to test model on training split: 0 seconds

=== Summary ===

Correctly Classified Instances 0 0 %

Incorrectly Classified Instances 2 100 %

Kappa statistic 0

Mean absolute error 0.5

Root mean squared error 0.6124

Relative absolute error 133.3333 %

Root relative squared error 141.4214 %

Coverage of cases (0.95 level) 0 %

Mean rel. region size (0.95 level) 50 %

Total Number of Instances 2

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area

PRC Area Class

0.000 0.000 0.000 0.000 0.000 0.000 ? ? Echo

(ping) request id=0x0200, seq=9472/37, ttl=32 [ETHERNET FRAME CHECK

SEQUENCE INCORRECT]

48

0.000 0.000 0.000 0.000 0.000 0.000 ? 1.000

Destination unreachable (Host unreachable) [ETHERNET FRAME CHECK

SEQUENCE INCORRECT]

0.000 1.000 0.000 0.000 0.000 0.000 ? ? Echo


SEQUENCE INCORRECT]

0.000 0.000 0.000 0.000 0.000 0.000 ? ? Echo


SEQUENCE INCORRECT]

Weighted Avg. 0.000 0.000 0.000 0.000 0.000 0.000 0.000

1.000

=== Confusion Matrix ===

a b c d

49

5.3 TCP SYN Flood

The SYN flooding attacks exploit the TCPs three-way handshake

mechanism and its limitation in maintaining half-open connections. When a

server receives a SYN request, it returns a SYN/ACK packet to the client. Until

the SYN/ACK packet is acknowledged by the client, the connection remains in

half-open state for a period of up to the TCP connection timeout.

=== Run information ===


Relation: tcp

Instances: 9

Attributes: 7

No.

Time

Source

Destination

Protocol

Length

Info


=== Classifier model (full training set) ===

J48 pruned tree

------------------

Source = 192.168.0.1: boinc-client > neod2 [ACK] Seq=1 Ack=1 Win=8760

Len=0 [ETHERNET FRAME CHECK SEQUENCE INCORRECT] (3.0/2.0)

50

Source = 192.168.0.2: [TCP Retransmission] neod2 > boinc-client [PSH, ACK]

Seq=5841 Ack=1 Win=8760 Len=648 [ETHERNET FRAME CHECK

SEQUENCE INCORRECT] (6.0/1.0)

Number of Leaves: 2

Size of the tree: 3

Time taken to build model: 0 seconds

=== Evaluation on test split ===

Time taken to test model on training split: 0 seconds

=== Summary ===

Correctly Classified Instances 1 33.3333 %

Incorrectly Classified Instances 2 66.6667 %

Kappa statistic 0.1429

Mean absolute error 0.2667

Root mean squared error 0.483

Relative absolute error 84.6154 %

Root relative squared error 116.1347 %

Coverage of cases (0.95 level) 33.3333 %

Mean rel. region size (0.95 level) 26.6667 %

Total Number of Instances 3

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area

PRC Area Class

51

0.000 0.000 0.000 0.000 0.000 0.000 0.500 0.333

boinc-client > neod2 [ACK] Seq=1 Ack=1 Win=8760 Len=0 [ETHERNET

FRAME CHECK SEQUENCE INCORRECT]

0.000 0.000 0.000 0.000 0.000 0.000 0.500 0.333

neod2 > boinc-client [PSH, ACK] Seq=5841 Ack=1 Win=8760 Len=648

[ETHERNET FRAME CHECK SEQUENCE INCORRECT]

0.000 0.333 0.000 0.000 0.000 0.000 ? ?



0.000 0.000 0.000 0.000 0.000 0.000 ? ?



1.000 0.500 0.500 1.000 0.667 0.500 0.750 0.500

[TCP Retransmission] neod2 > boinc-client [PSH, ACK] Seq=5841 Ack=1

Win=8760 Len=648 [ETHERNET FRAME CHECK SEQUENCE

INCORRECT]

Weighted Avg. 0.333 0.167 0.167 0.333 0.222 0.167 0.583

0.389

=== Confusion Matrix ===

a b c d e neod2 [ACK] Seq=1 Ack=1 Win=8760 Len=0


0 0 0 0 1 | b = neod2 > boinc-client [PSH, ACK] Seq=5841 Ack=1 Win=8760

Len=648 [ETHERNET FRAME CHECK SEQUENCE INCORRECT]

0 0 0 0 0 | c = boinc-client > neod2 [ACK] Seq=1 Ack=2921 Win=8760 Len=0


0 0 0 0 0 | d = boinc-client > neod2 [ACK] Seq=1 Ack=5841 Win=8760 Len=0


52

0 0 0 0 1 | e = [TCP Retransmission] neod2 > boinc-client [PSH, ACK]

Seq=5841 Ack=1 Win=8760 Len=648 [ETHERNET FRAME CHECK

SEQUENCE INCORRECT]

From the information above, the file begins with standard TCP ACK

packets sent between 192.168.0.1 and 192.168.0.2. When TCP sends a packet to

a destination and does not get a reply, it waits a specified amount of time then

retransmits the original packet. If a response is still not received, the source

(transmitting) computer doubles the amount of time it waits for a response before

sending another retransmission. Once the retransmission attempts have failed, the

connection has completely failed and the data in the transmission is lost.

53

CHAPTER VI: CONCLUSION

This chapter contains a conclusion and some recommendation and suggestion

that are made for future improvement and enhancing the project that conclude after

testing and result. The essence of the study is to analyze VoIP traffic trace using WEKA

a data mining tool. We believe that the objective is achieved.

6.1 Project Accomplishment

In the early days of VoIP, there was no big concern about security issues

related to its use. People were mostly concerned with its cost, functionality and

reliability. Now that VoIP is gaining wide acceptance and becoming one of the

mainstream communication technologies, security has become a major issue. To

overcome a major problem, the network forensic is prepared to the monitoring

and analysis of computer network traffic for the purposes of information

gathering, legal evidence, or intrusion detection.

This project started with converting the pcap (Packet Capture) into

Attribute-Relation File Format (arff) which format that WEKA recognize and

learned how to analyze the data by using WEKA Explorer preprocessing,

classification, clustering, and attribute selection before getting the data from

company who provide VoIP analysis.

We believed that the objectives set for this project are met. The first

objective is to analyze the pattern of attack data from the captured data. In which

case, the data indicates the condition of the network events.

54

The second objective is also achieved. It is to convert the pcap data to arff

data file so that the input will be recognized by the WEKA data mining tool. It is

important to state that and the first objective depends on this second objective.

We have some hiccup in getting the right data for our analysis since many

companies are tied with the legality that refrain them from sharing their data with

us. However, we still get data from a simulated data from other related project

conducted by another student in UniKL. Otherwise, our research will produce

more interesting findings.

6.2 Future Recommendation

For the future recommendation, there are few aspects that can be

further enhanced by expanding a few features and criteria to make the

analysis more firm and strong.

Suggestion for Improvement Current Project Situation

Improve Data Set or create a traffic

simulation program to collect the

required data

As thedata in this project in not

related to VoIP attack due to a

certain problems, the pure collected

data that related to VoIP attack can

be analyzed for the future

enhancements.

Include more type of attacks that are

related to VoIP. Different type of VoIP

attacks such as Vishing (VoIP Phishing),

Eavesdropping, and Identity and service

theft can also be used in order to find the

different result.

Only looking for SPIT and MiTM

attacks

Expand the analytical knowledge by

using WEKAs Simple CLI interface.

Analysis using the GUI interface is

user friendly, but would not speed

55

Scripts can be written to allow the data

processing to be executed automatically.

up the process.

As a conclusion we would like to highlight that the issues with VoIP security are one

of the concerned raised by the VoIP community. Although the problem is still under

control the system admin currently is not equipped with the right tools to detect the

VoIP attacks as earliest as possible. In most cases Wireshark or other network sniffer

is used to determine the condition of the network. We are trying to provide

alternative tools to the system admin by providing report pattern produced by a data

mining tool like WEKA.

56

REFERENCES

[1] A Brief History of VoIP Document One - The Past. Hallock, Joe. 2004.

[2] AmnaSaad. Secure VoIP Performance Measurement. 2013.

[3] How Does VoIP Work? discusstech.org. [Online] [Cited: November 17,

2013.] http://discusstech.org/2011/05/how-does-voip-work/.

[4] Voice over IP: Forensic Computing Implications. MatthewSimon. 2006.

[5] The Difference Between VoIP and PSTN Systems. webopedia.com. [Online]

[Cited:November 17, 2013.]

http://www.webopedia.com/DidYouKnow/Internet/2008/VoIP_POTS_Differ

ence_Between.asp.

[6] On the Feasibility of Launching the Man-In-The-Middle Attacks on VoIP

from Remote Attackers. Ruishan Zhangy, Xinyuan Wangy, Ryan Farleyy,

Xiaohui Yangy, Xuxian Jiang. 2009.

[7] Man-in-the-Middle Attacks. schneier.com. [Online] July 15, 2008.

[Cited:November 18, 2013.]

http://www.schneier.com/blog/archives/2008/07/maninthemiddle 1.html.

[8] Security Threats In VoIP. voip.about.com. [Online] [Cited: November 18,

2013.] http://voip.about.com/od/security/a/SecuThreats.htm.

[9] Understanding Denial-of-Service Attacks. us-cert.gov. [Online] [Cited:

November 18, 2013.] http://www.us-cert.gov/ncas/tips/ST04-015.

[10] SPIT: Spam Over Internet Telephony. asteriskblog.com. [Online] [Cited:

November 19, 2013.] http://www.asteriskblog.com/spit-spam-over-internet-

telephony.

57

[11] Network Forensics 101: Finding the Needle in the Haystack. WildPackets

white paper.

[12] Network Forensic. cyberforensics.in. [Online] [Cited: November 20, 2013.]

http://www.cyberforensics.in/(A(cos8NMWQywEkAAAAODMwODM4Y

WMtNWFmZC00ZWNhLThkNDEtNTlhMWM3MGE5MzA5hkCziwldj9ts

_CCtkjYQI68akds1))/Research/NetworkForensics.aspx?AspxAutoDetectCoo

kieSupport=1.

[13] Network Forensics & Packet Capture Analysis. ipcopper.com. [Online]

[Cited: November 20, 2013.] http://www.ipcopper.com/data_analysis.htm.

[14] Data Mining: What is Data Mining? anderson.ucla.edu. [Online] [Cited:

November 20, 2013.]

http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palac

e/datamining.htm.

[15] Data mining with WEKA,Part 1: Introduction and regression. [Online]

[Cited: November 20, 2013.]

http://www.ibm.com/developerworks/library/os-weka1/.

[16] Weka - Modified for Data Mining Course at WPI. [Online] [Cited:

November 21, 2013.] http://davis.wpi.edu/~xmdv/weka/.

[17] Introduction to Weka - A Toolkit for Machine Learning.

[18] Skype Forensic in Android Devices. Forihat, Mohammed I. Al-Saleh &

Yahya A. 2013.

[19] Network Forensics Models for Converged Architectures. Fernandez, Juan

C. Pelaez & Eduardo B. 2010.

[20] Security patterns for Voice over IP Networks. Eduardo B. Fernandez, Juan

C. Pelaez and Maria M. Larrondo-Petrie. 2007.

58

[21] Enhancing Forensic Investigation in Large Capacity Storage Devices using

WEKA: A Data Mining Tool. Lanka, Shravya. 2011.

[22] The Rapid Application. Issam J Zeinoun Cambridge Technology

Enterprises, Inc. 2005.

[23] Rapid Application Development. Core Partners Inc. s.l.:

www.corepartners.com.

[24] Advantages of Rapid Application Development. buzzle.com. [Online] 200-

2013. [Cited: December 12, 2013.]

http://www.buzzle.com/articles/advantages-of-rapid-application-

development.html