researchdirect.westernsydney.edu.auresearchdirect.westernsydney.edu.au/islandora/object/uws:8960... ·...

D

CO

SCHOOL

DETECTI

M

Thesi

OLLEGE OF H‐‐‐

L OF COMPU

ION OF

Eliezer

Master of S

is supervis

Ju

HEALTH AND‐‐‐‐‐‐‐‐‐‐‐‐‐‐UTING AND

BYPASS

IDJALAHO

Science (H

sors: Dr Ho

Dr Ew

une 2011

D SCIENCE

MATHEMAT

SING TR

OUE

onours)

on CHEUN

wa HUEBN

TICS

RAFFIC

NG

NER

TABLE O

1 Introd

1.1 Pr

1.2 Sc

1.3 Re

1.4 Th

2 Backgr

2.1 Ba

2.1

2.1

2.1

2.2 W

2.2

2.2

2.3 By

2.3

2.3

2.3

2.4 Ri

OF CONTEN

uction........

roblem defin

cope and lim

esearch met

hesis overvie

round.........

asic concept

1.1 Networ

1.2 Firewal

1.3 Types o

2.1.3.1 Tr

2.1.3.2 D

2.1.3.3 Ci

2.1.3.4 A

2.1.3.5 D

Web proxy se

2.1 Web pr

2.2 Proxy f

ypassing a w

3.1 Definiti

3.2 Bypass

3.3 Bypass

2.3.3.1 En

2.3.3.2 A

2.3.3.3 CG

sks of bypas

TS

..................

nition.........

mitations.....

thod...........

ew..............

..................

ts................

rk security o

lls and pack

of firewalls a

raditional Pa

ynamic pac

ircuit level g

pplication le

istributed fi

erver and filt

roxy............

iltering mec

web proxy..

ion of bypas

ing mechan

ing techniqu

ncrypted tu

nonymizer o

GI proxy ser

ssing a firew

..................

..................

...................

..................

...................

..................

..................

overview....

ket filtering t

and their we

acket filters

ket filters...

gateways....

evel gatewa

irewalls......

tering mech

..................

chanisms....

..................

ssing...........

nism............

ues.............

nnels..........

or bypassing

rver............

wall.............

.................

...................

..................

..................

..................

.................

..................

...................

techniques.

eaknesses t

s..................

..................

..................

ays..............

...................

hanisms......

..................

..................

...................

...................

...................

..................

...................

g software..

...................

..................

De

.................

...................

..................

..................

..................

.................

..................

...................

...................

to stop bypa

...................

..................

..................

..................

...................

..................

..................

..................

...................

...................

...................

..................

...................

...................

...................

..................

etection of

.................

...................

..................

..................

..................

.................

..................

...................

...................

assing traffic

...................

..................

..................

..................

...................

..................

..................

..................

...................

...................

...................

..................

...................

...................

...................

..................

bypassing t

Pag

.................

...................

..................

...................

..................

.................

...................

...................

..................

c …………….…

..................

..................

..................

...................

...................

..................

..................

...................

...................

..................

...................

..................

...................

...................

...................

...................

raffic

ge | ii

... 1

.....2

....4

....5

....6

... 7

....7

.... 7

....8

……11

....12

....14

....14

....15

.....16

....17

....17

....18

.....19

....19

.....19

....21

....22

.....23

.....25

....26

Detection of bypassing traffic

Page | iii

2.4.1 Financial impact..................................................................................................27

2.4.2 Productivity loss.................................................................................................27

2.4.3 Shortage in resources.........................................................................................28

2.4.4 Privacy concerns.................................................................................................28

2.5 Summary.......................................................................................................................29

3 Previous works..................................................................................................................30

3.1 Encrypted tunnels.......................................................................................................30

3.2 Anonymizers or bypassing software...........................................................................32

3.3 CGI proxy servers........................................................................................................33

3.4 Summary.....................................................................................................................34

4 Goals and experiments............................................................................................... 35

4.1 Goals...........................................................................................................................35

4.2 Experiments................................................................................................................36

4.3 Summary.....................................................................................................................37

5 Design and implementation..............................................................................................39

5.1 Network profiles.........................................................................................................40

5.1.1 Detection parameters……………………………..…………………………………………………...41

5.1.1.1 Size of embedded objects........................................................................41

5.1.1.2 Inter‐arrival time......................................................................................44

5.1.1.3 TCP flows ……………………………………………………………….............................. 45

5.2 Implementation of the testing network.....................................................................46

5.2.1 Topology of the testing network.......................................................................46

5.2.2 Hardware and configuration.............................................................................48

5.2.2.1 Physical machine......................................................................................48

5.2.2.2 Virtual machine 1: Proxy Firewall............................................................49

5.2.2.3 Virtual machine 2: Blocked server............................................................50

5.2.2.4 Virtual machine 3: Bypassing proxy.........................................................51

5.2.2.5 Virtual machine 4: Routing server............................................................52

Detection of bypassing traffic

Page | iv

5.2.2.6 Virtual machine 5: Client computer.........................................................53

5.2.3 Software............................................................................................................53

5.2.3.1 VMWare Workstation..............................................................................53

5.2.3.2 ISA server 2004........................................................................................55

5.2.3.3 XAMPP for windows.................................................................................55

5.2.3.4 Wireshark.................................................................................................56

5.2.3.5 Fiddler 2................................................................................................ 56

5.2.3.6 Glype proxy script....................................................................................57

5.2.3.7 Traffic generator......................................................................................57

5.3 Summary.....................................................................................................................60

6 Findings: Results and Analyses.....................................................................................61

6.1 Initial experiment: Profile building.............................................................................61

6.2 Single webpage results...............................................................................................62

6.3 Aggregated results.....................................................................................................64

6.4 Summary.....................................................................................................................67

7 Additional experiements……….....................................................................................69

7.1 Physical network for accuracy evaluation................................................................. 69

7.2 Accuracy evaluation script.........................................................................................70

7.3 Accuracy evaluation of the detection approach........................................................73

7.3.1 Building phase of network profiles.................................................................. 73

7.3.2 Frequency distribution of the size of embedded objects………........................ 73

7.3.3 Frequency distribution of the header size of embedded objects.....................76

7.3.4 Frequency distribution of the payload size of embedded objects...................79

7.3.5 Inter‐arrival time.............................................................................................. 80

7.3.6 Number of TCP flows........................................................................................82

7.4 Detection rules............................................................................................................83

7.5 Results of the accuracy of the detection approach....................................................84

7.5.1 Results of HTTP bypassing mode…...................................................................85

7

8 Concl

8.1 C

8.2 F

Referenc

Appendi

7.5.1.1 F

7.5.1.2 F

7.5.1.3 F

a

.5.2 Result

7.5.2.1 F

7.5.2.2 F

7.5.2.3 F

a

usion.........

Contribution

Future work

ces.............

ix...............

Frequency d

Frequency d

Frequency d

and the num

ts of HTTPS

Frequency d

Frequency d

Frequency d

and the num

..................

n.................

..................

..................

.................

distribution

distribution

distribution

mber of TCP

bypassing m

distribution

distribution

distribution

mber of TCP

..................

...................

..................

..................

..................

of the size o

of payload

of payload

flows.........

mode…........

of the size o

of payload

of payload

flows.........

...................

...................

...................

.................

..................

De

of payload…

combined w

combined w

...................

...................

of payload…

combined w

combined w

...................

...................

...................

...................

.................

..................

etection of

……………………

with inter‐a

with inter‐a

...................

...................

……………………

with inter‐a

with inter‐a

...................

...................

...................

...................

.................

..................

bypassing t

Pag

…................

rrival time…

rrival time

..................

...................

…................

rrival time…

rrival time

..................

...................

...................

...................

.................

..................

raffic

ge | v

... 85

……86

...

88

.....90

... 90

……91

...

92

....94

.... 94

....94

....95

....101

TABLE O

1.1 Plet

1.2 CGI

2.1 Illus

2.2 Stru

2.3 Part

2.4 Byp

2.5 Ano

2.6 Imp

4.1 Mod

5.1 Ret

5.2 Inte

5.3 Illus

5.4 Det

5.5 Flow

6.1 Sing

6.2 web

6.3 web

6.4 web

6.5 web

6.6 web

7.1 Top

7.2 Flo

7.3 Com

a w

7.4 Com

with

OF FIGURES

thora of arti

bypassing v

stration of a

ucture of ne

ties involved

passing thro

onymizer: si

plementatio

del for dete

rieval of a w

er‐arrival tim

stration of T

ail topology

w chart of th

gle webpage

bpage 1 resu

bpage 2 resu

bpage 3 resu

bpage 9 resu

bpage 10 re

pology of the

w chart of t

mparison of

web page in d

mparison of

hin a web pa

icles and tut

vs. SSH tunn

a firewall.....

etwork pack

d during a b

ugh an encr

ngle‐point v

n of a CGI p

ecting CGI pr

web page re

me illustratio

TCP flows an

y of the virtu

he traffic ge

e results.....

ults……………

ults……………

ults……………

ults……………

sults…………

e physical n

the evaluati

f the freque

direct acces

the frequen

age in direct

torials on th

nel bypassin

...................

ets across t

bypassing sc

rypted tunn

vs. networke

proxy to byp

roxies’ traff

quiring mul

on...............

nd inter‐arri

ual network

enerator……

...................

……………………

……………………

……………………

……………………

……………….…

etwork for t

on of the ef

ency distribu

ss, HTTP byp

ncy distribu

t access, HT

he Internet o

ng................

..................

he layers……

cenario........

el...............

ed design...

pass firewall

ic................

tiple GET to

...................

ival time of

k..................

…………………

...................

…………………

…………………

…………………

…………………

…………………

the accurac

fficiency of t

ution of the

passing acce

tion of the h

TTP bypassin

De

on bypassin

...................

..................

…………………

...................

..................

..................

l restrictions

...................

o fetch each

...................

each flow d

...................

……………………

...................

…………….………

…………….………

…………….………

…………….………

…………….……

cy evaluatio

the detectio

size of emb

ess and HTT

header size

ng access an

etection of

ng a proxy fi

...................

..................

…………….……

...................

..................

..................

s.................

...................

h object.......

...................

during a sess

...................

…………………

...................

…………………

…………………

…………………

…………………

…………………

n………………

on approach

bedded obje

TPS bypassin

of embedd

nd HTTPS by

bypassing t

Pag

irewall........

..................

..................

…………………

...................

..................

..................

.................

...................

...................

...................

sion............

...................

……………………

...................

…………..…..…

…………..…..…

…………..…..…

…………..…..…

…………..…..…

………..…..…

h…..…..…..…

ects within

ng access.….

ded objects

ypassing

raffic

e | vi

2

. 4

..9

… 10

..20

..23

. 24

26

.38

..43

.44

46

.48

…59

..63

….65

….65

….66

….66

….67

. 70

….72

.

75

acc

7.5 Com

obje

byp

7.6 Ret

7.7 Eva

dist

7.8 Eva

freq

7.9 Eva

freq

in H

7.10 Eva

dist

7.11 Eva

freq

7.12 Eva

freq

in H

ess……………

mparison of

ects within

passing mod

trieval of ww

aluation of t

tribution in

luation of th

quency distr

luation of th

quency distr

HTTP bypass

luation of th

tribution in

luation of th

quency distr

luation of th

quency distr

HTTPS bypas

……………..….…

the frequen

a web page

des………………

ww.uws.edu

the accuracy

HTTP bypas

he accuracy

ribution and

he accuracy

ribution, the

sing mode…

he accuracy

HTTPS bypa

he accuracy

ribution and

he accuracy

ribution, the

ssing mode…

…………………

ncy distribu

e in direct ac

…………………

u.au throug

y of the det

ssing mode.

y of the dete

d the inter‐a

y of the dete

e inter‐arriv

…………………

y of the dete

assing mode

y of the dete

d the inter‐a

y of the dete

e inter‐arriv

…………………

…………………

tion of the p

ccess, HTTP

……………………

h the CGI pr

ection appr

…………………

ection appro

arrival time

ection appro

val time and

……………………

ection appro

e.………………

ection appro

arrival time

ection appro

val time and

……………………

De

……………………

payload size

bypassing a

…………………

roxy www.g

roach accord

……………………

oach accord

in HTTP byp

oach accord

d the numbe

…………………

oach accord

…………………

oach accord

in HTTPS by

oach accord

d the numbe

…………………

etection of

…………….……

e of embedd

and HTTPS

…………………

glypeproxy.c

ding to the f

…………………

ding to the

passing mod

ding to the

er of TCP flo

……………………

ding to the f

……………………

ding to the

ypassing mo

ding to the

er of TCP flo

…………………

bypassing t

Page

………..…..……

ded

……………………

com ..…..….

frequency

………………….

de…………….

ows

…………………

frequency

………………..

ode…………..

ows

………………….

raffic

e | vii

…78

…

80

82

..

86

.

88

.

89

.

90

92

.

93

TABLE O

4.1 Tota

5.1 Des

5.2 Des

5.3 Des

5.4 Des

5.5 Des

5.6 Des

6.1 Traf

6.2 Det

7.1 Rep

em

acc

7.2 Rep

size

acc

7.3 Rep

size

mo

7.4 Rep

dire

7.5 Rep

com

7.6 Det

OF TABLES

al number o

scription of t

scription of t

scription of t

scription of t

scription of t

scription of t

ffic profile o

ection cond

partition of w

bedded obj

esses..........

partition of w

e of embedd

esses..........

partition of w

e of embedd

des.............

partition of w

ct access co

partition of w

mpared to H

ection rules

of accesses f

the physica

the proxy se

the blocked

the bypassin

the Routing

the client co

of initial acc

ditions of a C

web pages i

ects in direc

...................

web pages i

ded objects

...................

web pages i

ded objects

...................

web pages i

ompared to

web pages i

HTTP and HT

s of bypassin

for the expe

l machine...

erver (virtua

d or blacklist

ng proxy (vi

g Server (virt

omputer (vi

ess.............

CGI proxy tr

n relation to

ct access co

..................

n relation to

in direct ac

..................

n relation to

in direct ac

..................

n relation to

HTTP and H

n relation to

TTPS bypass

ng traffic……

eriments.....

..................

al machine)

ted server (v

rtual machi

tual machin

rtual machi

...................

raffic...........

o the perce

mpared to

..................

o the perce

cess compa

..................

o the perce

cess compa

..................

o the inter‐a

HTTPS bypas

o the numb

ing accesse

……………………

De

...................

..................

..................

virtual mach

ine).............

ne)...............

ne).............

...................

...................

ntage of ma

HTTP and H

..................

ntage of ma

ared to HTTP

..................

ntage of ma

ared to HTTP

..................

arrival time

ssing access

ber of TCP fl

s.................

…………………

etection of

...................

..................

..................

hine)...........

...................

...................

...................

...................

..................

atches of th

HTTPS bypas

..................

atches of th

P and HTTPS

..................

atches of th

P and HTTPS

..................

e of the pack

ses..............

ows in direc

...................

……………………

bypassing t

Page

..................

..................

...................

...................

...................

...................

..................

...................

..................

he size of

ssing

..................

he header

S bypassing

..................

he payload

S bypassing

..................

kets in

..................

ct access

...................

…….……………

raffic

| viii

..37

.. 49

.. 50

.. 51

.. 51

.. 52

.. 53

.. 62

.. 68

.

76

..

78

..

80

..

81

.

83

…. 84

LIST OF

CGI

DNS

DoS

FTP

HTML

HTTP

HTTPS

IP

ISA Serv

NIC

NTFS

OSI

P2P

PHP

RAM

SSH

SSL

SSL

SOCKS

TCP

TELNET

URL

VoIP

VPN

ACRONYMS

: Co

: Do

: De

: Fil

: Hy

: Hy

: Hy

: Int

ver : Int

: Ne

: Ne

: Op

: Pe

: Hy

: Ra

: Se

: Se

: Se

: Ab

Cli

fir

: Tra

T : Ne

: Un

: Vo

: Vir

S

ommon Gate

omain Name

enial of Serv

e Transfer P

ypertext Ma

yper Text Tr

yper Text Tr

ternet Proto

ternet Secu

etwork Inter

ew Technolo

pen System

eer‐to‐Peer

ypertext Pre

andom Acce

cure Shell

cure Socket

cure Socket

bbreviation

ient/Server

ewalls

ansmission

etwork Virtu

niform Reso

oice over Int

rtual Private

eway Interfa

e Service

vice

Protocol

arkup Langu

ansfer Proto

ansfer Proto

ocol

rity and Acc

rface Card

ogy File Syst

Interconne

eprocessor

ss Memory

t Layer

t Layer

from SOC

application

Control Pro

ual Termina

ource Locato

ternet Proto

e Network

ace

age

ocol

ocol Secure

celeration Se

tem

ction

CKetS. Inte

n to commu

otocol

l Protocol

or

ocol

De

erver

ernet Proto

unicate tran

etection of

ocol that e

nsparently t

bypassing t

Pag

enables

through

raffic

e | ix

ABSTRA

The inte

business

are some

increase

has also

maliciou

boundar

risk of co

security

restrictio

the sec

investiga

network

This Mas

The det

embedd

packets

system is

virtual n

blocked

web pag

sequenti

between

of the m

network

used for

ACT

ernet throug

ses as well a

e of the serv

d through t

o been obs

s programs

ry of private

omputers g

policies. H

ons of many

urity polici

ated in this

and thus av

ster’s thesis

ection mod

ed object o

and the n

s tested on

etwork rep

web server

ges stored o

ial accesses

n direct acce

model in a v

to evaluat

the experim

gh the year

as our daily

vices relying

the decades

served. Ma

s by installi

e networks.

etting infec

However, s

y proxy firew

ies of the

s thesis, ho

void the byp

s covers the

del is built

of a webpag

umber of T

a virtual ne

roduces the

r, a CGI pro

on the block

s are made

ess, HTTP a

irtual netwo

e the efficie

ments is arti

rs has beco

life. Emails

g on the Inte

s, a huge su

ny networ

ng antivirus

The two m

cted by mal

sophisticate

walls, grant

ir organisa

ow to detec

passing of se

e design and

from four

ge, inter‐arr

TCP flows e

etwork in or

e bypassing

oxy and the

ed server in

to each we

and HTTPS b

ork, bypass

ency of the

ificially gene

me, not a

s, social net

ernet to ope

rge of virus

k specialist

s programs

main functio

icious progr

ed tools ha

ting unlimite

ation. Here

ct traffic em

ecurity polic

d evaluation

r non‐paylo

ival time of

emulated b

der to evalu

scenario inv

client. An

n order to c

ebpage in H

bypassing a

ing experim

e model in a

erated due t

De

luxury asse

tworking, on

erate. As th

es and spyw

ts respond

and deplo

ons of proxy

rams as we

ave been

ed access to

in lays t

mulated by

cies.

n of a detec

oad propert

f inbound p

by a brows

uate the cor

volving fou

initial test

reate netwo

TTP and HT

ccesses. Aft

ments are th

a more rea

to the lack o

etection of

et, but a ne

nline shopp

he popularit

ware circula

to the th

oying proxy

y firewalls a

ll as definin

developed

o internal u

the fundam

y CGI proxi

ction mode

ties of IP p

packets, ave

ing session

rrectness of

r parties: a

is run by ac

ork traffic p

TTPS to find

ter proving

hen conduct

listic situati

of physical u

bypassing t

Pag

ecessary too

ping and ban

y of the Inte

ating on the

reats pose

firewalls a

are to lowe

ng and enfo

to bypass

users contra

mental pro

es on a pr

l of CGI pro

packets: siz

erage size of

. The dete

f the model

proxy firew

ccessing dir

profiles. Two

d the correl

the correct

ted in a phy

ion. The da

users.

raffic

ge | x

ol for

nking

ernet

e web

d by

t the

r the

orcing

s the

ary to

blem

rivate

oxies.

ze of

f TCP

ction

. This

wall, a

rectly

o sub

ation

tness

ysical

taset

The wor

as ackn

rk presented

owledged in

ful

ST

d in this the

n the text. I

ll or in part,

TATEMENT O

sis is, to the

hereby dec

for a degre

Eliezer

OF AUTHEN

e best of my

clare that I h

ee at this or

r IDJALAHO

De

NTICATION

y knowledge

have submit

any other i

UE

etection of

e and belief

tted this ma

nstitution.

bypassing t

Pag

f, original ex

aterial, eithe

raffic

e | xi

xcept

er in

ACKNOW

First of a

to him fo

Secondly

valuable

research

of this th

I would

support

I would

care. Thi

Thanks a

for proof

Also tha

thousand

breath to

Lastly, I

spending

WLEDGMEN

all, I am gra

or leading m

y, I owe my

feedback. M

h. His experi

hesis.

also like to

available du

like to show

is work coul

also to Reini

f reading m

anks to my

ds of kilom

o my life an

extend my

g hours with

TS

teful to GO

me through t

deepest gra

My thanks g

ience in this

o thank Dr

uring the fir

w my gratit

ld not have

ier VEERMA

y thesis.

family, for

metres sepa

d offer me

y sincere g

h me to imp

D for his inf

the years in

atitude to m

go to him fo

s discipline

Ewa HUEB

st half of m

tude to Car

been comp

AN for spons

r always pr

rate us, yo

new hope.

ratitude to

plement a w

finite love t

reaching th

my superviso

or motivatin

has contrib

BNER, my se

y research.

olle AKPAKA

leted witho

soring my st

raying for m

our emails a

Harris TCH

working platf

De

towards me

his level of m

or Dr Hon C

ng and enco

buted treme

econd supe

A, my wife

out her patie

tudies in Au

me and sup

and phone

HABOSSOU

form for my

etection of

. May all th

my studies.

CHEUNG for

uraging me

endously to

ervisor, who

for her con

ence and un

ustralia and

pporting m

calls alway

ANANI, CI

y experimen

bypassing t

Page

he glory be g

r his support

throughou

o the comple

o has made

ntinual love

nderstandin

Michael JO

me. Even th

ys bring a

SCO expert

nts.

raffic

e | xii

given

t and

t this

etion

e her

e and

g.

OSSEP

ough

fresh

t, for

Chapter 1 ‐‐ Introduction

Page | 1

CCHHAAPPTTEERR 11

IINNTTRROODDUUCCTTIIOONN

Nowadays, the Internet has become a key instrument in the expansion and performance of

many companies. Organisations such as hospitals, universities and government institutions

are proceeding with the migration of their activities to the web platform allowing the

acceleration of transactions and making information easy to access for their employees as

well as for their targeted population. E‐commerce, web banking, online conferencing and

social networking are just some of the services relying on the Internet to operate. However,

the plethora of inter‐connected computer networks around the globe has triggered a

massive hunt for sensitive information and copyrighted materials by cyber criminals. In the

same way, the popularity of the Internet, through the years, saw an exponential growth of

viruses, spyware and Trojan horses. Therefore, the privacy and the protection of private

networks have emerged as a big concern for computer specialists. According to [1], an

unprotected computer can be infected by malicious programs in less than five minutes after

connecting to the Internet.

In the late 80’s [2], a new technology known as a firewall was introduced to combat threats

posed by cyber criminals and malicious programs. Through the years, firewalls evolved from

inspecting packets to highly sophisticated proxies performing complex tasks such as Deep

Packet Inspection (DPI), Intrusion Detection System (IDS), Network Address Translation

(NAT) and caching [1, 3]. Firewalls are deployed at the boundary of private or corporate

networks to create a security perimeter between the Internet and the private network.

They enhance the security of the network by inspecting and analyzing inbound and

outbound traffic. In other words, a firewall can automatically detect and block some attacks

originating from the Internet. In addition, firewalls can be used as a tool to limit access to

some res

gained a

huge am

minimise

which m

mechani

to websi

exist to b

to the w

“bypass

returned

figure 1.

Figure

11..11 PPR

A previo

private

Anonym

metrics w

packets

sources on

lot of mom

mounts of m

e the risk o

malicious co

ism for web

ites conside

bypass the s

web. By the

a censorshi

d more tha

1).

1.1: Plethor

RROOBBLLEEMM DD

ous researc

network re

izers (lozdo

which were

retransmitt

the Interne

mentum in n

money ever

of attacks. T

odes are in

b traffic is m

ered safe an

security pol

e time this

p” or “bypa

n a million

ra of articles

EEFFIINNIITTIIOONN

h [5], com

esulting fro

odge) and C

e: throughpu

ted, the fo

et by enforc

network sec

y year to a

The World

ntroduced i

mostly imple

nd productiv

icies of corp

research w

ass a proxy”

articles, tu

s and tutori

pleted in 2

om the use

CGI proxies

ut, amount

ormat of th

ing an Acce

curity with a

acquire the

Wide Web

into private

emented in

ve for the c

porate firew

as conducte

” introduced

utorials and

als on the In

2008, invest

e of three

[5]. This w

of data rec

e URLs and

ess Control

a large num

latest firew

(WWW) is

e networks

a lot of fire

company. H

walls and the

ed, the key

d into the se

d forum dis

nternet on b

tigated the

bypassing

as achieved

ceived and s

d the aver

Chapter

Policy (ACP

ber of instit

wall techno

s the main

s [4]. There

walls to res

owever, sev

ereby gain u

ywords “byp

earch engin

scussions on

bypassing a

anomalies

technique

d by choosi

sent by a cl

age time r

1 ‐‐ Introdu

Pag

). Firewalls

tutions inve

logy in ord

service thr

efore a filt

strict access

veral techni

unlimited ac

pass a firew

e “google.c

n the topic

proxy firew

observed

s: SSH tun

ng five net

lient, numb

equired for

ction

ge | 2

have

esting

er to

rough

ering

s only

iques

ccess

wall”,

com”,

(see

wall.

on a

nnels,

work

ber of

r the


Page | 3

retrieval of a webpage. From the experiments carried out throughout this research, it was

discovered that the use of SSH tunnels to bypass a firewall generated a high amount of data

sent, a low throughput and a high average time to complete an HTTP session [5]. As for the

anonymizer (in this case Lozdodge), anomalies were only observed with the average time

and the throughput [5]. Furthermore, the experiments on CGI proxies outlined anomalies

related to the throughput and the amount of data sent [5]. Yet, the anomalies detected

with the different bypassing techniques were not enough to implement a detection system.

Viruses and spyware can generate similar anomalies which would cause the detection

system to trigger a lot of false alerts, also known as a “false positive”. A more specific

investigation, focused on the properties of the TCP packets exchanged during a bypassing

session instead of aggregated statistics of the session, was then necessary to increase the

robustness of the proposed detection system.

In this research, the investigation is narrowed down to CGI proxies especially those

implemented with both HTTP and HTTPS protocols. This decision is motivated by the fact

that CGI proxies are free, easy to use and the most popular technique. In fact, thousands of

CGI proxies are accessible on the Internet free of cost to bypass proxy firewalls. On one

hand, the SSH and anonymizer techniques require the installation of software and in some

cases an advanced knowledge in networking such as the mastering of port forwarding, the

configuration of an SSH server and the modification of a web browser settings to use a

SOCKS proxy. On the other hand, with the CGI proxy bypassing technique, a user can bypass

a proxy firewall by simply typing the URL of the CGI proxy into his web browser. In other

words, no software installation and no configuration are required for the CGI proxy

technique (see figure 1.2).

The main problem addressed in this research is how to find patterns related to the use of

CGI proxies on private networks. Hence, the goal of this thesis is to test the correctness of a

detection model for the fingerprinting of CGI proxies’ traffic that can be used to increase


Page | 4

the efficiency of proxy firewalls. This investigation will design and test a detection

mechanism of proxy firewall bypassing traffic in a virtual network. A real world platform

made of the different parties involved during a bypassing technique will be reproduced in a

virtual network. This thesis shall investigate possible patterns related to CGI proxies through

network profiles built from:

The size of the objects embedded within a webpage

The inter‐arrival time of inbound packets

The number of TCP flows

The average size of the packets

Figure 1.2: CGI bypassing vs. SSH tunnel bypassing

11..22 SSCCOOPPEE AANNDD LLIIMMIITTAATTIIOONNSS

Many techniques exist to bypass proxy firewalls but the most popular techniques are

encrypted tunnels, anonymizers and CGI proxies. Through the past decades, many

investigations, with good detection accuracy, have been conducted on identifying

encrypted tunnels circumventing unauthorized traffic and anonymizers. However, few

(a) CGI Proxy Bypassing

User No configuration Use CGI proxy URL

Blocked

Server

CGI Proxy

(b) SSH tunnel bypassing

User Configure SSH client Use Sock proxy Port forwarding Configure web browser

Blocked

Server

SSH Proxy SSH server installed Open Port on router Configure SSH server


Page | 5

investigations have been done on CGI proxies. This thesis will provide background

information on the three main bypassing techniques. However, the investigation will focus

more on CGI proxies rather than encrypted tunnels and anonymizers. The experiments will

be carried out in a virtual network. However, the accuracy of the proposed detection

approach will tested in a real network.

11..33 RREESSEEAARRCCHH MMEETTHHOODD

In [6], Gordana identified three main scientific methods to approach computer science

problems:

Theory: This scientific approach is based on logic and sound mathematics to build

theories as well as proving or deriving theorems. In this branch of computer science,

researchers seek to design and evaluate the performance of new algorithms,

understand computational problems and investigate solutions.

Experiment: In this discipline, scientific experiments are conducted on computation

phenomena with the aim to verify hypotheses or create new models.

Simulation: In this approach, scientists investigate a real world situation or

computational phenomena by conducting experiments in virtual laboratories instead

of building physical ones. In this quest, applied mathematics as well as

experimentation and applied theory investigation are intensively used by

researchers for simulation [6].

Prior to using one of the scientific approaches mentioned above, modelling is always

applied to the computational phenomena. During the modelling process, the phenomena is

analyzed, simplified and reduced to an understandable model that can be studied. In this

thesis, the main approach has been on the simulation of a detection model of bypassing

traffic in a virtual network. Nonetheless, theory is used to formulate the problem while

experimentation contributed to the evaluation of the performance of the detection model.


Page | 6

11..44 TTHHEESSIISS OOVVEERRVVIIEEWW

This chapter has described the network breach investigated in the present research and

outlined the scope and limitations of the study. Moreover, both the methodology adopted

for the investigation and the aims of the thesis are defined in this chapter. The rest of the

thesis is organized as follows. Chapter 2 will present some relevant background information

on firewalls and provide an overview about web proxy and content filtering concepts. That

chapter will also explain how the bypassing of proxy firewalls is achieved and will discuss

the potential threats posed by CGI proxies to corporate networks. Related work in the

discovery and blocking of the bypassing of proxy firewalls will be presented in Chapter 3. In

Chapter 4, the goals of this work are clearly outlined. Chapter 5 will provide more details

about the design of the detection model and the implementation of the virtual network

used for testing. The results of the correctness of the detection model will be presented and

discussed in Chapter 6. Chapter 7 has been added to the research to evaluate the accuracy

of the detection model in a more realistic environment and a large dataset. Finally, Chapter

8 will provide a conclusion to the investigation and identify further work.

Chapter 2 ‐‐ Background

Page |

7

CCHHAAPPTTEERR 22

BBAACCKKGGRROOUUNNDD

This chapter presents some relevant background information. Section 2.1 introduces basic

information on firewalls while Section 2.2 focuses on the concepts of web proxy and

content filtering. Section 2.3 explains the mechanism utilised to perform the bypassing of a

proxy firewall. Finally, Section 2.4 outlines the risks posed by bypassing traffic to private and

corporate networks.

22..11 BBAASSIICC CCOONNCCEEPPTTSS

To help in the understanding of the work presented in this thesis, an overview of firewalls

will be explained in this section. In addition, firewall classifications according to OSI

architecture and the main purpose of each type of firewall will be outlined. Also in this

section, the concepts of a web proxy, content filtering and blacklists are also clarified.

2.1.1 Network security overview

The Internet was created to provide connectivity between computers and offer an

infrastructure for resource and service sharing [2]. Rapidly, a lot of communities such as

corporations, universities, schools, government institutions, hospitals, banks and private

users joined the Internet to speed up or improve their activities. The openness of the

Internet to various communities also means the internet is open to more sinister

communities which consist of hackers and other cyber criminals. The exponential growth of

users, throughout the years, also resulted in a dramatic increase of the number of attacks

and infections [7, 8, 9]. This can be easily explained by the fact that security mechanisms


Page |

8

were not implemented in the technology [9] supporting the Internet such as protocols. The

Internet was initially developed to share information between trusted parties. Therefore,

security was not considered an issue as the Internet was not a public tool. TCP/IP for

instance, has many weaknesses that can lead to attacks such as DoS attacks, impersonation

of a trusted party through IP spoofing and the interception of messages [9]. Through the

years, many filtering techniques have been developed to overcome some of the

weaknesses of the TCP/IP protocol. Some of these techniques involve the scanning of

inbound traffic for viruses, Trojan horses and spyware, deep packet inspection, the blocking

of vulnerable services and intrusion detection systems.

2.1.2 Firewalls and packet filtering techniques

Firewalls were introduced in the late 1980’s [2] to prevent and minimise the attacks on

private networks as well as increasing a user’s privacy. Routers were the first tools used to

separate networks [2] before the introduction of firewalls but rarely provided any method

of security.

According to [7], a firewall can be defined as:

“A network security product that acts as a barrier between two or more network

segments. The firewall is a system that provides an access control mechanism

between your network and the network(s) on the other side of it.”

A firewall is a computer system made of software, hardware or a combination of both [10]

that enforces an access control policy between a trusted network and an untrusted network

such as the Internet as seen in Figure 2.1. Mostly, a firewall is located at the outer boundary

of a private or corporate network and controls the outgoing and incoming traffic. Many

corporations and organizations deploy a firewall as the first line of defence against attacks.

A firewa

some re

pass thro

is blocke

access in

Furtherm

informat

(URLs) v

time and

that a fi

investiga

Figure 2

this figur

The OSI

presenta

4 main l

one han

from a h

when th

correspo

seen in

mechani

all offers th

sources acc

ough a firew

ed. The filter

n a firewall s

more, a fire

tion about

isited by a

d IP addres

irewall can

ations after

.1: Illustrati

re the firew

architectur

ation and ap

ayers [11]:

d, a header

higher layer

e packets a

onding to ea

Figure 2.2,

isms enforc

e possibiliti

cording to a

wall using p

ring of port

security pol

wall is also

security bre

user, the a

ses of the c

record. Th

an attack h

ion of a fire

all sits at th

re is made

pplication. In

application

r is appende

to a lower l

re entering

ach layer as

a packet is

ed on a net

ies, to a ne

security po

ort 80 while

numbers is

icy.

a good too

eaches [7,

amount of d

computers

is informat

as been det

ewall placed

he edge bou

of 7 layers

n practice, t

n layer, tran

ed to existin

layer when

a network,

s the packet

made of tw

twork by eit

etwork adm

olicy. For in

e FTP traffic

one of the

ol for monit

3]. For exa

data receive

connected

ion is very

tected on a

d between a

ndary of the

: physical, d

the 7 layers

nsport layer

ng data at e

packets are

the existing

ts travel fro

wo main pa

ther scrutin

inistrator, t

stance, HTT

c which is c

keys to sett

toring a net

mple, the U

ed or sent

to a server

important

network.

a trusted ne

e private ne

data link, n

of the OSI

r, internet la

each layer a

e exiting a n

g data is stri

om a lower

arts: a head

nising the in

Chapter

to block or

TP traffic ca

commonly fo

ting the allo

twork and l

Uniform Re

during a se

r are some

when perf

etwork and

etwork.

network, tra

architecture

ayer and ph

as the packe

etwork. On

ipped from

layer to a h

der and a p

nformation c

r 2 ‐‐ Backgr

Page

allow acce

an be allowe

ound on po

owed/disallo

ogging sens

esource Loc

ession, the

of the stat

forming for

the Interne

ansport, ses

e are reduce

hysical laye

et is transm

the other h

the header

higher layer

ayload. Filt

contained in

ound

| 9

ess to

ed to

ort 21

owed

sitive

ators

date,

tistics

ensic

et. In

ssion,

ed to

r. On

mitted

hand,

data

r. As

ering

n the


Page |

10

header of packet or inspecting the payload of the packet for keywords or anomaly

signatures. However, this is only achievable if the packets are not encrypted.

Figure 2.2: Structure of network packets across the layers.

Overall, four filtering mechanisms can be identified:

Physical layer filtering: At this stage, the filtering of network packets is performed

by inspecting the Ethernet header of network packets. Therefore, the only

parameter of interest at the physical layer is the MAC address. An attack is possible

at this layer if the attacker has a direct access to a private network device such as a

user’s computer or a router. ARP spoofing and packet sniffing can then be used to

collect network traffic and scan for vulnerabilities such as user accounts and

passwords.

Internet layer filtering: The source IP and destination IP are two main parameters

for filtering network traffic at this layer [12, 11]. These parameters are contained in

the IP header of the packets. Most often, a blacklist and a whitelist are enforced on

the perimeter of a private network to restrict and allow access to some IP addresses,

respectively. An Intrusion Detection System (IDS) can fingerprint policy violations by

inspecting the source IP and destination IP of the packets [13] derived from the IP

header of the packets. In addition, spoofed IP and inconsistent IP headers are more

likely to be detected by an IDS at this layer.

Payload (Plaintext) TCP Header IP Header

Payload (Plaintext)

Payload (Plaintext) TCP Header

Payload (Plaintext) TCP Header IP Header Ethernet Header

Application layer

Transport layer

Internet layer

Physical layer


Page |

11

Transport layer filtering: At this layer, security policies are established based on the

source port and destination port contained in the TCP header. The port number is a

reliable parameter to identify the service running on a host machine. For example,

Port 80, 443 and 22 are default ports for HTTP, HTTPS and SSH traffics, respectively.

Therefore, at the transport layer, security policies rely on the source and destination

port numbers to block or allow services on a private network. Furthermore, some

network attacks are traceable by an Intrusion Detection System (IDS) using the

source port and destination port of TCP packets. These attacks range from DoS

attack, SYN attack to port scanning [13].

Application layer filtering: The application layer is the highest layer in the OSI

architecture. At this stage, the filtering is applied on the payload of the packets. As a

result, the payload is searched for keywords in the access control policies. In

addition, the payload can be scanned for viruses, spyware and Trojan horses.

Malware and suspicious code is detected either by using signatures or by detecting

anomalies related to the malware. Intrusion Detection Systems implemented at this

layer are able to detect anomalies and attacks related to protocols such as HTTP,

DNS and FTP [13].

2.1.3 Types of firewalls and their weakness to stop bypassing traffic

William et al. [8] classifies firewalls in five types according to OSI architecture. As seen in

Figure 2.2, each type of firewall is implemented at a specific layer of OSI architecture. These

five types of firewall are:

Packet filters;

Application level filtering;

Circuit level gateways;

Dynamic packet filters;

Distributed firewalls.


Page |

12

2.1.3.1 Traditional Packet filters

Packet filtering firewalls are implemented at the network layer of the OSI architecture [7,

3]. They allow or prevent a packet to enter or exit a network based on information

contained in the IP header of the packet [8, 14]. The source IP address, destination IP

address, source port, destination port and transport level protocol [2] are some of the

properties found in the IP header that are used for filtering a packet. The blocking

mechanism of packet filtering firewalls is faster but the filtering rules are difficult to

implement [2]. This mechanism is present in a lot of routers through a program

incorporated into the hardware [8]. Although, the use of packet filters on private networks

is widespread, this does not stop malicious traffic, such as bypassing traffic, from crossing

the security perimeter.

Packet filters are unable to stop bypassing traffic because of the immutability of some fields

of the IP header such as the IP address. The weaknesses of packet filters are:

Spoofing: This attack is achieved by changing the source IP address of a packet to a

random or spoofed IP address. Spoofing is tricking a user or a computer into thinking

that a packet is originating from a trusted source while it is not [15]. This is easily

achievable because the authentication of other parties is not implemented with TCP

[16]. Spoofing is mostly used to bypass the access control policies of packet filters

and is difficult to detect for many firewalls and intrusion detection systems [16]. In

the CGI bypassing scenario, the CGI proxy spoofs the packets received from a

blacklisted server by changing the source IP of the packets with the its IP address.

Source routing: Source routing is a technique used to specify the route a packet

must take through a network [17, 18]. The route of a packet is either specified by the

sender (source) or the network device receiving the packet. If the path of a packet is


Page |

13

not defined by the sender, the router or network device receiving the packet will

decide on which route to forward the packet. Source routing is an ideal tool utilized

by hackers to bypass firewalls [17] and access computers which are normally blocked

by the firewall. This technique is similar to the spoofing attack during the CGI

bypassing technique. By adding a CGI proxy between a packet filter firewall and a

blocked server, a user is able to go around the access control rules.

Fragmentation attacks: The transmission of large packets is enabled by the Internet

protocol (IP) through a mechanism called fragmentation. This mechanism consists

into splitting a large packet into small packets each containing an offset for

reassembling. These fragments are transmitted through a network and reassembled

at the other end to reconstruct the original packet [12]. Packet filtering firewalls

check the authenticity of the first fragmented packet of the original packet and

allow the remaining fragmented packets to pass through if the header data of the

first packet complies with the access control policies [15]. By doing this, a firewall

can permit unauthorized traffic to enter the network. Fragmentation attacks are

categorized in two main groups: tiny fragment attack and overlapping attack [53,

15]. In a tiny fragment attack, the TCP header information is sent to a packet filtering

firewall in three small fragments [52]. Packet filtering firewalls will fail to block the

first fragmented packet from entering the network due to the first packet not

containing all the TCP header information necessary to authenticate the packet. The

TCP header data being split into the three smaller fragments [51, 15], therefore, the

filtering mechanism is unable to check the legitimacy of the following incoming

packets. The overlapping fragment attack is achieved by sending a zero offset packet

containing incomplete data or a legitimate TCP header complying with the firewall

rules. Additional non‐zero offset packets are then transmitted to modify the TCP

header data during the reassembling process [15] resulting in a malicious packet.


Page |

14

2.1.3.2 Dynamic packet filters

Dynamic packet filters are another method of preventing attacks on private networks. They

are also referred to as stateful packet inspection firewalls. They are implemented at the

transport layer of OSI architecture [7, 17] and offer transparency to users while applying

security measures. Dynamic packet filters apply security mechanisms during the

establishment of a connection by recording session information such as source IP address,

destination IP address, source port and destination port. This allows them to maintain an

array of active and authorized connections in order to monitor the traffic. All incoming

packets are then analysed against the active connections table to determine whether the

packet is legitimate or unwanted. Dynamic packet filters offer a higher security level [8]

than packet filtering firewalls by keeping track of the state of open connections and

matching them with inbound traffic to detect unwanted traffic.

Dynamic packet filters have similar weaknesses to traditional packet filters. Spoofing the IP

address of incoming packets will allow the packets to enter the private network. The

connection to the blocked server being established through the CGI proxy, dynamic packet

filters are powerless to detect traffic originating from a blocked server.

2.1.3.3 Circuit level gateways

Circuit‐level gateways work at the transport layer of OSI architecture. During the

establishment of TCP connections, they create a virtual circuit between the source and the

destination by acting like a relay host or man in the middle [8]. A TCP connection initiated

by a client is terminated at the circuit level gateway which establishes another TCP

connection with the external server in order to handle the user’s requests [18]. Contrary to

the packet filters, this type of firewall does not allow packets to flow from end to end. The

IP address of the clients and other connection information are concealed by the circuit level


Page |

15

gateway. For example, an external server will only see the IP address of the circuit level

gateway instead of the client’s IP address. At the early age of the Internet, circuit level

gateways were used to bridge two networks [8]. This type of firewall hides the topology of

private networks and provides authentication, audit and logging mechanisms. By doing

that, circuit level gateways provide a higher security environment to private networks

compared to simple packet filters. In addition, statistics recorded by circuit level gateways

are very useful for forensics investigations to reconstruct the source of an attack.

As for the two previous types of firewalls, circuit level gateways are vulnerable to IP

spoofing. Therefore, this vulnerability can be exploited by the World Wide Web (WWW)

community to circumvent illegal traffic by routing the traffic through a CGI proxy.

2.1.3.4 Application level gateways

Application level gateways are the most advanced type of firewalls. They are implemented

at the application layer of the OSI architecture. Mostly, they are deployed on private

networks and act as an intermediate [7, 15] between internal users and the Internet during

TCP sessions. All requests made by internal users go through the application level gateway.

Authorized requests are then appended with the identification information of the

application level gateway and forward to the intended server. This transformation protects

internal users and hides the topology of the private network. Apart from acting as an

intermediate, an application level gateway can perform deep packet inspections on

inbound and outbound traffic. In other words, they can scrutinise the payload of incoming

and outgoing packets because they are working at the application layer of the OSI model.

For instance, an application level gateway acting as a web proxy can block all HTTP requests

containing the word “hackers” or “virus” or “download spyware”. In addition, the

monitoring and logging [7, 10, 18] of users’ activities are easily achievable by application

level firewalls. For example, they can record the URLs visited by users, the attempt of


Page |

16

connections made to a server and the date and time of the sessions. Network

administrators can use this information to identify the source of an attack or investigate the

entrance point of infections.

Contrary to the other types of firewalls, application level gateways perform a deep

inspection of the packets by analysing the content of the payload. In other words, the

payload is searched for keywords categorised as illegal in the access control policies.

However, the bypassing of application level gateways is still achievable by using an

encrypted channel to hide payload and therefore defeat the content filtering of the firewall.

2.1.3.5 Distributed firewalls

A distributed firewall is a new concept and implementation of a firewall system. This form

of firewall is a newer technology and is more secure and efficient than the more traditional

types of firewalls which have been operating for decades. With this type of firewall, the

client is responsible for enforcing the security policies which are provided by the main

firewall. The main firewall’s role is to provide the security rules and supervises the client’s

enforcement [8] of these rules. With this type of firewall, the enforcement of the security

policies is decentralised to the clients [8]. A distributed firewall operates according to the

server/client concept. On one hand, the server representing the central firewall maintains a

database of security rules. This central firewall is responsible for providing the security rules

to each client and ensuring that these rules are enforced. On the other hand, the client

drops a packet or allows a packet to be transmitted on the network in accordance with the

access control policies.

Distributed firewalls are more efficient compared to the other types of firewalls. More

specifically, network traffic is analysed on the server and the client side of the firewall to

detect policy violations. However, distributed firewalls are also vulnerable to IP spoofing.


Page |

17

22..22 WWEEBB PPRROOXXYY SSEERRVVEERR AANNDD FFIILLTTEERRIINNGG MMEECCHHAANNIISSMMSS

This section clarifies the misconception made between a firewall and a web proxy. The aim

of a web proxy is to filter web traffic while a firewall is limited to a specific function

depending on its type as described in Section 2.1. A censorship mechanism is mostly

installed on a firewall to filter web traffic by inspecting web protocols such as HTTP, HTTPS,

DNS and FTP. In general, a web proxy is a combination of hardware and software. First, the

hardware is used to separate two networks and to define the rules for allowing traffic in

and out of the private network. This hardware can be a router or a computer. Then, a piece

of software is then installed on this hardware to handle web traffic and restrict access to

some resources on the World Wide Web.

2.2.1 Web proxy

A proxy server or proxy firewall is a computer system or application, located between two

networks or two computers which acts like an intermediate. An example would be, the

employees of a company using a proxy server to connect to the Internet. This is done to

ensure the security of the private network and improve the network performance through

caching.

A web proxy is aimed at the filtering of web traffic. In other words, a web proxy applies

filtering mechanisms on web related protocols such as HTTP, FTP and HTTPS [19]. User

authentication enforcement is also part of the role performed by a web proxy [20]. A web

proxy receives requests from users in the form of a Uniform Resource Locator (URL). All

requests conforming to the access control policies are completed while unauthorized

requests are simply rejected by the web proxy. In most cases, a web proxy is the point of

entrance and exit of web traffic. So, it is the best place to control the browsing activities of

users by applying filtering mechanisms.


Page |

18

2.2.2 Proxy filtering mechanisms

Many filtering mechanisms are implemented in web proxies to restrict the access to

external web servers classified as unsafe for private networks. According to Michael E.

Whiteman [21], a basic proxy server has at least two filtering mechanisms:

URL filtering: This filtering technique can be achieved in two modes [19]. In the first

mode, a network administrator creates and regularly updates a list of forbidden

websites called a blacklist to perform URL blocking. Blacklists are also

commercialised by web filtering companies and can be acquired by network

administrators and upload onto the proxy firewall. Access to a website is denied by

using a domain name such as www.example.com. In other words, when a domain

name is blacklisted all the URLs referring to this domain are automatically banned by

the proxy server. URLs are mostly categorised according to the content of the

website such as news or games, and stored in plain‐text [21]. In this mode, a user

has access to all the websites except those listed in the blacklist. The first mode is

also referred as “block some and allow the rest” filtering. The concept of the second

mode is opposite to the first mode. Instead of blocking some URLs and allowing the

other URLs, in the second mode, network administrators use a list of authorized

URLs called a whitelist and block all the rest of the websites.

Content filtering: In this filtering technique, network traffic is deemed legal by

scrutinising the payload of the packets. That is to say that a deeper inspection is

performed on the packet in search of keywords explicitly classified by the network

administrator as harmful. For example, universities and schools block website

containing the keywords drugs to avoid students being exposed to illicit drugs.

In addition to the two filtering mechanisms described above, Ari Luotonen [21] outlined

that the filtering can also be applied to the headers of the packets. For instance, headers


Page |

19

containing users’ credentials such as username and password should not be forwarded to

the Internet without removing this information [21].

22..33 BBYYPPAASSSSIINNGG AA WWEEBB PPRROOXXYY

This section clarifies the concept of bypassing a firewall. It also explains the mechanism

used to bypass a firewall. The three techniques commonly used on the Internet are also

presented in this section.

2.3.1 Definition of bypassing

The bypassing of a firewall is a breach of security policies enforced on a private network. It

is the routing of unauthorized traffic through a bypassing proxy in order to get around the

access control rules of a firewall. This is achieved by circumventing illegal traffic through

encrypted tunnels or CGI web servers with the purpose of fooling the firewall into believing

that the traffic is originating from a trusted source.

2.3.2 Bypassing mechanism

Three main parties are involved in the bypassing of a firewall. These are as follows:

An anonymous user: A computer inside a private network is called an anonymous

user. This computer connects to the bypassing proxy and redirects all its requests to

this server instead of connecting directly to an external server on the Internet. An

anonymous user is also called the client or source of a request.

Unauthorized server: This is a computer located outside a private network, generally

on the Internet. An unauthorized server is a server blocked by a firewall. Access to


Page |

20

this server is disallowed to users in order to prevent infections by viruses and

spyware. An unauthorized server is also known as the destination of a request.

A bypassing server: This server is located on the Internet and acts as an

intermediary system between an anonymous user and an unauthorized server. The

requests received by a bypassing proxy from a client are signed with the

identification information of the bypassing proxy such as the IP address and then

forwarded to the intended server. After the retrieval of the data from the intended

server, the packets are updated once again with the identification information of the

bypassing proxy and relayed back to the client. Figure 2.3 outlines the different

phases and transformations performed on the packets during a bypassing process.

Figure 2.3: The three parties involved in the bypassing scenario of the firewall.

As can be seen in Figure 2.3, the packets originating from an anonymous user goes through

a modification process before being forwarded to the unauthorized server. The same

applies to the packets returned by the unauthorized server. During the bypassing process,

the anonymous user is not directly connected to the unauthorized server but to the

bypassing server. The unauthorized server will never see the anonymous server because the

requests arriving to it are originating from the bypassing server.

PRIVATE NETWORK

Unauthorized server

Anonymous user Bypassing server

Source IP: IP 1 Destination IP: IP 2




IP address: IP 1 IP address: IP 2 IP address: IP 3


Page |

21

2.3.3 Bypassing techniques

Nowadays, three common techniques are used to bypass firewalls. These techniques are

easy to perform because several ports are opened by network administrators on firewalls in

order to communicate with other networks mainly with the Internet. A good firewall must

allow a corporation to conduct its activities on the Internet while guarding the private

network of the corporation from all sorts of attacks. To achieve this, many network

administrators allow HTTP (port 80), HTTPS (port 443), SSH (port 22) and DNS (port 53)

traffic. Ports 80 and 443 are mostly opened for users to browse and retrieve information

from the Internet. In addition, the encryption, authentication and integrity mechanisms

offered by the SSH protocol [22] make it one the favourite tools for network administrators

to access remote servers.

The SSH protocol is preferred to the TELNET protocol because TELNET lacks the three

mechanisms offered by SSH. That is to say that TELNET can easily be exploited to perform

attacks. DNS traffic is authorised for the resolution of addresses and is necessary for

accessing resources on the Internet. Some of the protocols listed above can be exploited to

encapsulate other traffic. For example, a user can exploit the forwarding and encapsulation

possibilities of SSH to convey HTTP traffic.

All three bypassing techniques require an external server located on the Internet. This

external server must not be blocked in the access control policies of the firewalls. The most

common techniques used to get around censorship are:

Encrypted tunnels;

Anonymizers or bypassing software;

CGI proxies.


Page |

22

2.3.3.1 Encrypted tunnels

To overcome the lack of encryption and authentication mechanisms in the TCP protocol,

tunnelling has been introduced to secure the traffic transmitted between two computers.

Tunnelling allows two computers to communicate or exchange data through an encrypted

channel. Protocols such as SSH and VPN offer the possibilities to users around the world to

access their organisation’s servers through a secure tunnel [22, 23, 24]. For instance, SSH

tunnels are used by many university students to access their files from home while

protecting their privacy through encryption and authentication. However, legal and illegal

traffic can both be transmitted through encrypted tunnels. The firewalls are unable to

perform their filtering functions because the payload of the packets transmitted over an

SSH or VPN tunnels are encrypted.

SSH and VPN are built according to the Client/Server architecture. The client connects to

the server and sends requests which are handled by the server. To bypass a firewall using

the tunnelling technique, a user needs the following tools:

A tunnelling client: A tunnelling client refers to a piece of software used to

communicate with a tunnelling server. For example, Putty [23] is an SSH client

developed to exchange data with an SSH server. The tunnelling client is mostly

installed and configured on the anonymous user’s computer and resides inside the

private network.

A tunnelling server: The tunnelling server is the server version of a tunnelling client.

In other words, it is a program on a specific port which listens for commands sent by

a client. To bypass a firewall, the tunnel server must be implemented outside the

private network. A home computer running an SSH server or VPN server is mostly

designated as the bypassing proxy.


Page |

23

An open port: the communication between a tunnelling client and a tunnelling

server is only possible if the firewall allows this traffic through an open port. This

port must be explicitly allowed in the access control policies.

A typical scenario showing access to an unauthorised server through an SSH tunnel is

illustrated in Figure 2.4. A user performs the bypassing of his organisation’s firewall by

firstly installing an SSH server on his home computer. This server must be connected to the

Internet and configured to accept connections from SSH clients on a dedicated port. The

user can also host the SSH server with an Internet Service Provider (ISP) or pay to use the

service of one of the dedicated SSH servers present on the Internet. An SSH client such as

Putty, can then be used by the client to connect to the SSH server from his organisation’s

network and forward the traffic on a local port dynamically. Finally, the user must configure

the web browsers to use the tunnel instead of accessing the Internet through the proxy

firewall. This is achieved through a SOCKS proxy, implemented with many web browsers.

Figure 2.4: Bypassing performed by the use of an encrypted tunnel. The filtering mechanism

of the firewall is defeated by the encryption traffic flowing between the client side and the

server side of an encrypted tunnel.

2.3.3.2 Anonymizer or bypassing software

An anonymizer is a piece of program designed for the bypassing of proxy firewalls. This

bypassing tool is installed on a computer and acts like a proxy server between the Internet

PRIVATE NETWORK

INTERNET

UNAUTHORIZED SERVER

INTERNAL USER BYPASSING SERVER

SSH, SSL, VPN Tunnel (Encrypted)


Page |

24

and end‐users. The identity of users is concealed by the proxy server making their online

activities untraceable. A large variety of bypassing software is freely accessible by users on

the Internet. Anonymizers are classified in two main groups [25]:

Single‐point design: As seen from Figure 2.5 (a), an anonymizer implemented with

the single‐point design routes all the traffic through a single machine [25]. The

requests of the user travel through a single bypassing server and then hit the

intended server. The response from the server is also relayed to the user through

the same bypassing server. Lozdodge [26], a popular bypassing tool, is implemented

on the single‐point design.

Networked design: Contrary to the single‐point design, this design is more complex

and offers more privacy. In this design, the user’s requests go through a network of

computers before reaching their target. A random path is defined each time the user

makes a new request. From Figure 2.5 (b), the packets exchanged between the client

and the server travel through computer A, D, C, G and H. Tor [24] is a widespread

application using the networked approach for proxy bypassing. A user joins the Tor

network by installing a Tor client on his home computer to avoid his company or

school firewall.

User

A

DC

G H

B

FE

Server

(b) Networked design

User Server

(a) Single‐Point


Page |

25

Figure 2.5: Anonymizer: Single‐Point vs. Networked design

2.3.3.3 CGI proxy server

The Common Gateway Interface (CGI) is a standard that specifies the way web servers and

client programs such as web browsers interact [27]. The Common Gateway Interface

enables a web browser to send requests to a web server. At the same time, this mechanism

allows a web browser to dynamically extract, process and forward the requested data in a

proper manner to a web browser [27]. CGI scripts are executable programs installed on web

servers to perform a specific task. For example, when a student requests the average of his

marks, a CGI script is executed on the web server to extract his marks, sum them up and

compute the average. This computation is automatically executed on the web server and

transparent to the student.

Through the years, many CGI scripts have been developed including bypassing scripts. A

bypassing script is a piece of code mostly written in Perl or PHP which is used by other

computers to retrieve web pages. A CGI proxy server or a web based proxy server is a

computer hosting a bypassing script accessible from the Internet in order to get around the

restrictions of proxy firewalls. Many CGI scripts, which allow users to bypass their

company’s firewalls, are available on the Internet. Contrary to the other bypassing

techniques mentioned above, this technique does not require the installation of software

nor does it require a deep understanding of the Internet protocols. Thousands of free CGI

proxies are advertised online and can easily be found using a search engine. After receiving

a request from a user, a CGI proxy retrieves the web page from the Internet and stores it

locally. The URLs of the objects contained in the webpage such as images, frames and

videos are then modified to point to the CGI proxy instead of the source server. Finally the

modified webpage is sent back to the client. Figure 2.6 depicts the use of a CGI proxy to

access a banned server indirectly.


Page |

26

Figure 2.6: Implementation of CGI proxy to bypass firewall restrictions. The traffic in blue is

blocked by the firewall. The traffic is red represents the use of a CGI proxy to access the

banned server illegally.

22..44 RRIISSKKSS OOFF BBYYPPAASSSSIINNGG AA FFIIRREEWWAALLLL

Bypassing a firewall can have serious consequences for a private or corporate network. In

recent years, the Internet went from being a safe platform for sharing data to an

environment infested with threats. Hundreds of malicious tools called “crimeware” have

been developed by cybercriminals to conduct their attacks. As defined by Aaron Emigh et al.

[28], “Crimeware is software that performs illegal actions unanticipated by a user running

the software; these actions are intended to yield financial benefits to the distributor of the

software”. In many cases, computers are infected by crimeware during their online

activities on the Internet. Malicious codes are embedded in emails or transmitted through

malicious URLs or social networking websites. The 2009 report of SOPHOS [29] on security

threats clearly states that every 4.5 seconds a malicious webpage is detected, which leads

to 7,008,000 new threats every year. According to the same report [29], a large number of

anonymizers or CGI websites present on the Internet are infected with malware. The

consequences posed by proxy bypassing are classified in four main groups: financial impact,

User Blocked Server

Bypassing proxy + CGI Script


Page |

27

productivity loss, shortage of resources and privacy concerns. This section presents these

four consequences.

2.4.1 Financial impact

In 2006, computer economics estimated to US$ 13.2 billion the damage caused by malware

[30]. In the same way, the Federal Bureau of Investigation (FBI) estimated in 2005 a loss of

US$ 67.5 billion by national organisations due to cybercrime [31]. Bypassing traffic is a good

route to introduce malware into private networks. Once installed, malware can propagate

to the rest of the network and leak private information such as copyright materials valued

at thousands of dollars. For example, in mid‐December 2009, Google lost sensitive data due

to the breach of their network. The malware responsible of the leakage of information was

embedded in an email as a link pointing to a malicious website. Further investigations

identify 20 other companies including Adobe as being victims of the same infection [32].

Apart from disclosing sensitive data, malware can also carry DoS attacks on private

networks and disrupt some services. DoS attacks are also very costly to companies such as

online stores, banks and universities. The recovery of an attacked system is time consuming

for network administrators and costs thousands of dollars to organisations and corporations

[33].

2.4.2 Productivity loss

In general, a proxy firewall is referred to as a tool for security enhancement but it can also

help to increase the productivity of employees in workplaces. News, sports, social

networking, emails and video streaming websites are common websites visited by the

majority of people. At the present time, the majority of internet users have an email

account with some having a social networking account in addition. According to the


Page |

28

statistics of Compete Inc. [34], Facebook, Youtube and Myspace were some of the most

visited websites in 2009. Many employees do not limit their browsing habit to their home

or personal computers. They are more likely to visit the same websites even when in their

workplace. They tend to access their emails or post a message on social networking

websites while at work. Blocking access to these websites is a good way for employers to

decrease the loss of concentration during working hours and in the same way to maximise

productivity. Instead of giving unlimited access to employees during working hours, a

restricted list of websites relevant to the activities of the organisation is enforced on the

proxy firewall. By doing this, network administrators minimise the risk of infection by

malware while increasing the productivity of employees.

2.4.3 Shortage in resources

The third possible consequence of proxy bypassing is the shortage in network resources. As

mentioned earlier, bypassing activities maximise the risk of malware infections and hence

can cause the disruption of network resources and services. In many cases, the bandwidth

of many organisations is illegally exploited by employees in downloading large files such as

movies, music files or games instead of conducting their organisation’s activities. An

organisation with a limited bandwidth can experience a shortage in network resources if

illegal usage is made of it by users. All in all, the deployment of a restricting policy on the

corporation firewall can help to ensure that a good usage of network resources is made by

internal users.

2.4.4 Privacy concerns

A bypassing proxy helps a user to browse the Internet while remaining anonymous. The

privacy of the user is concealed by the fact that the bypassing proxy removes the identifying

information of the user before forwarding his requests to others servers. The same task is


Page |

29

performed by the bypassing proxy on the data received from other servers before passing

the data back to the requesting user. The bypassing process is completely controlled by the

proxy server which raises privacy concerns. In fact, an untrustworthy proxy can log all the

traffic exchanged between a user and other servers. Authentication credentials and

personal information such as usernames, passwords, credit card, driver’s licence and bank

account numbers are susceptible of being disclosed by bypassing proxies [35]. Additionally,

a bypassing proxy is a good tool for phishing. As an intermediate between a user and the

other servers, a bypassing proxy can collect personal information on users by displaying

illegitimate web forms. For example, a user trying to access his bank account online through

a bypassing proxy can be asked by the bypassing proxy to enter his bank account number,

password and additional information such as date of birth and home address.

22..55 SSUUMMMMAARRYY

In this chapter, background information was provided to explain the different concepts

examined in this thesis. The risks to private networks posed by the bypassing of a proxy

firewall are seen to range from financial losses to the disclosure of sensitive information. In

addition, the operational mechanism of CGI proxies, encrypted tunnels and anonymizers

has been clarified in this chapter. The encryption capabilities of some web protocols and

the intermediary role played by bypassing proxies contribute a lot to avoid censorship and

evade the restrictions of proxy firewalls.

Chapter 3 – Previous Work

Page |

30

CCHHAAPPTTEERR 33

PPRREEVVIIOOUUSS WWOORRKK

In this chapter, related work in the discovery and blocking of the bypassing of proxy

firewalls will be presented. The three following sections will cover three common

techniques utilised for proxy bypassing: encrypted tunnels, anonymizers and CGI proxy

servers. Section 3.1 will present the relevant investigations that have been performed

through the years to detect the illegal use of encrypted tunnels to circumvent bypassing

traffic. Section 3.2 will explore the different approaches to minimizing the use of

anonymizers on private networks. Finally, the last section will focus on related work for

detecting CGI proxy servers.

33..11 EENNCCRRYYPPTTEEDD TTUUNNNNEELLSS

The increasing number of filtering mechanisms and access control infrastructures has been

closely accompanied by the intensive use of encrypted tunnels to bypass these restrictions.

Encrypted tunnels, operating mostly at the application layer of OSI architecture, are

commonly utilised to circumvent illegal traffic. This is done by protocol encapsulation.

Encapsulation is wrapping one protocol inside another protocol. This enables unwanted

traffic to cross through firewalls and enter private networks. More specifically, a trusted

and authorised protocol is used in many cases as an envelope to carry illegitimate traffic in

and out of the security perimeter.

Much research has been done through the years to fingerprint encrypted tunnels that

deviate from their normal use. In general, an illegitimate use of an encrypted tunnel is

detectable by looking at the payload or non‐payload statistics of the traffic generated by


Page |

31

the tunnel. One of the major researches in this field has been achieved by Manuel et al. [36]

who designed and implemented a technique to detect, with a high accuracy, illegal HTTP

and SSH flows crossing a network. By analysing the inter‐arrival time, the size and order of

the packets transmitted during a session, their proposed statistical mechanism can predict

with high accuracy the encapsulation of other protocols in HTTP and SSH traffic. In previous

research [37], the conclusion was reached that by monitoring the behaviour of the three IP

properties mentioned earlier (size, inter‐arrival time and order of the packets), it was

possible to derive the protocol used for the data exchange. This discovery was then applied

on encrypted tunnels by collecting legitimate HTTP and SSH profiles and then comparing

them to a dataset made up of sessions of encapsulated protocols inside HTTP and SSH

protocols as well as sessions recorded from acceptable use of these protocols. Their

approach detected with high accuracy encapsulated protocols inside HTTP and SSH

sessions. Their detection mechanism was later tested on real network traffic with great

success.

In [38] Jeffrey Horton and Rei Safavi‐Naini investigated the inappropriate use of SSH tunnels

to hide unwanted traffic. They approached the problem by investigating the deviation

between a normal and an abnormal SSH session based on two Internet Protocol (IP)

properties: size and inter‐arrival time of IP packets. After analysing a large dataset of SSH

sessions, they established a direct relationship between the size of the packets and the use

of SSH protocol. Their investigation pointed out that during a normal SSH session such as

remote access or an interactive session with an SSH server, smaller IP packets are

exchanged between the SSH client and the SSH server. However, protocols such as HTTP

and FTP, respectively encapsulated inside SSH tunnels for web browsing and file transfer,

emulated larger IP packets. A similar investigation to [38] has also been conducted by Riyad

et al. in [39] using two supervised learning algorithms: AdaBoost and RIPPER. The goal of

their study was to classify SSH traffic and non‐SSH traffic using the two algorithms and thus

utilise the most efficient algorithm to predict the service running behind the SSH traffic. A


Page |

32

capture of network traffic emulated by many protocols, including SSH, was used as a

dataset. RIPPER was shown to be the best classifier with a 99% prediction accuracy of the

protocol encapsulated inside SSH traffic [39]. Finally, Kevin et al. identified HTTP tunnels

concealing illegitimate traffic generated by spywares and viruses in [40]. The approach

included four non‐payload properties.

33..22 AANNOONNYYMMIIZZEERRSS OORR BBYYPPAASSSSIINNGG SSOOFFTTWWAARREE

Many firewalls incorporate a filtering mechanism to control the connectivity of internal

programs executed by users to the Internet. This mechanism is designed to enhance the

security of private networks by enforcing a list of appropriate software to be run by users.

The detection of unsafe programs on a private network has been intensively investigated

during the recent decades. In [41], Liang et al. investigated the detection of Skype traffic

circumvented through proxy firewalls. Skype is a popular program use for chatting but

mostly for Voice over IP (VoIP). Therefore, Voice over IP is made possible through Skype by

maintaining a continuous traffic stream of data between the caller and the receiver. Their

investigation outlines that Skype traffic even though encrypted can be detected by

analysing payload as well as non‐payload statistics of the data transmitted.

A more significant study, of blocking specific applications to access the Internet, has been

conducted on P2P traffic by Subhabrata et al. [42]. Signatures related to five common P2P

applications were collected and matched to real network traffic. The designed classifier

identified with a high accuracy not only P2P traffic but the application emulating the traffic.

Although much progress has been made to detect unwanted traffic generated by illegal

programs, there are a few or even no investigations focuses on the detection of

anonymizers within a private network. All in all, an application can be blocked by a proxy

firewall if a detection system, based on signatures or patterns inherent to the application, is

deployed.


Page |

33

33..33 CCGGII PPRROOXXYY SSEERRVVEERRSS

Little research has been done in identifying the usage of CGI proxies to bypass proxy

firewalls. Many investigations have concentrated on detecting the encapsulation of a

protocol inside an encrypted tunnel and the fingerprinting of traffic generated by a specific

program. The traffic emulated by CGI proxies is very similar to normal web traffic which

makes it difficult to detect. Contrary to encrypted tunnels, CGI proxies are not real‐time

applications and do not need to maintain a tunnel for the data exchange. Data is

transmitted between the CGI proxy and the client in chunk of packets for a short period of

time. CGI proxies implemented through HTTP protocol can be blocked with deep packet

inspection, to a certain extent. However, the encryption possibilities offered through the

Secure Socket Layer (SSL) and exploited by many CGI proxies nullify the efficiency of content

based filtering mechanisms. In addition, the absence of a bypassing program on user’s

computers increases the complexity of characterising and detecting CGI proxy traffic. A

previous investigation [5] to classify CGI proxy traffic based on non‐payload statistics only

revealed low throughput, high amount of data sent and irregular URL format associated to

their use on a private network.

In[54], Heyning Cheng et al. investigated the source of a web page, retrieved with the

HTTPS protocol, based on the size of received objects. Their investigation was conducted on

a local mirror of an external website. Even though the traffic was encrypted, their

investigation revealed that the origin of a web page can be derived by scrutinising the size

of the objects sent by a web server to a browser. A similar investigation on HTTP traffic was

carried out by Andrew Hintz [55] on 5 web pages using the same parameter as in [54]. As in

[54], the size of received objects proved to be a reliable parameter for fingerprinting a web

page. Another study on HTTPS traffic was also conducted by Qixiang Sun et al. [56] on

100,000 web pages. In this case, the detection parameters were: the size of objects


Page |

34

received and the number of objects received. A large number of web pages were

successfully fingerprinted using these two parameters.

Overall, traffic analysis is a useful tool to infer the source of the data whether the traffic is

encrypted or not. However, the approach was not applied on bypassing traffic especially

those performed by the Glype script [49]. This research covered the analysis of HTTP and

HTTPS bypassing traffics to fingerprint the banned web pages by looking at the size of

received objects, the inter‐arrival time of packets, the number of TCP flows and the average

size of the packets.


Many investigations have been carried out through the years to eradicate circumventing

traffic. However, the widespread use of encryption algorithms and the complexity of

network topologies require more investigation to keep up with existing and new bypassing

techniques. This chapter presented previous work conducted to fingerprint and detect

illegal traffic carried by encrypted tunnels, CGI proxies and dedicated bypassing programs.

The evidence suggests that CGI proxies are favoured for getting around restrictions because

of their flexibility and their vast number.

Chapter 4 – Goals and experiments

Page |

35

CCHHAAPPTTEERR 44

GGOOAALLSS AANNDD EEXXPPEERRIIMMEENNTTSS

This chapter starts by defining the goals and expectations of this investigation. It also

provides a brief description of the experiments which were formulated to attempt to solve

the problem investigated in this thesis.

44..11 GGOOAALLSS

CGI proxies are efficient tools to bypass censorship. Filtering mechanisms based on the IP

address of inbound packets were acceptable over the past years of networking and the

Internet but these mechanisms lost their momentum with the onset of CGI proxies. The IP

address, on its own, is not enough to deem inbound traffic trustworthy or legitimate. IP

spoofing is indirectly performed by CGI proxies on packets forwarded to a client and thus

making the packets appear as if originating from a trusted source instead of a banned

server. Payload properties are essential in many cases to detect anomalies and illegal

activities. However, the approach, adopted in this investigation, will use non‐payload

properties of inbound packets. These detection properties are: the size of embedded

objects within a web page, the inter‐arrival time of the packets, the number of TCP flows

emulated during browsing activity and the average size of the packets. These properties are

derived just by observing network traffic exchanged between a client and a server during

web page retrieval.

An experimental setup of a network in a virtual environment will be used to investigate

the correlation between the direct access, HTTP and HTTPS bypassing accesses in terms of

size of objects embedded within a webpage, inter‐arrival time of the packets and number


Page |

36

of TCP flows. A successful classification of CGI bypassing traffic will prevent unwanted

traffic from entering private networks.

The self‐designed network profiles are built from the non‐payload parameters mentioned

previously. In addition, it is assumed that a blacklist is clearly defined on the web proxy.

Therefore, prior to launching the detection of illegal access to any entry of the blacklist,

each entry of a blacklist is accessed in order to build network profiles for each blacklisted

domain or URL.

44..22 EEXXPPEERRIIMMEENNTTSS

The experiments conducted are focused on a small dataset. This is justified by the lack of

physical users to emulate large traffic. Training physical users and allowing them to bypass

the proxy server of the University of Western Sydney (UWS) was considered unsafe and

compromising the security of the network. Moreover, the experiments could be expanded

to a large dataset if the proposed detection mechanism proves efficient. The correctness of

the detection model, if proven through the experiments, will lead to evaluating the

detection prototype on a large dataset or orientate the investigation toward other aspects

of bypassing traffic. The studied dataset was made of 10 heterogeneous web pages each

containing objects of different size. Each web page is accessed three separate times and the

resulting network traffic recorded.

A direct access to each web page is performed followed two subsequent accesses in HTTP

and HTTPS bypassing modes. For each webpage, the network profile in direct mode is

compared to those in HTTP and HTTPS bypassing mode in order to find out the correlation

between direct access HTTP and HTTPS bypassing accesses related to the detection

parameters investigated in this study. If significant correlations are established between the

three accesses, this could lead to the detection of bypassing traffic. The total number of


Page |

37

times a web page is accessed to carry out the experiment is described in the following table

(Table 4.1). The evaluation of the correctness of the proposed model, if successful, would

lead to the expansion of this study by carrying out additional experiments in a more realistic

environment to confirm the results obtained in the virtual network.

Table 4.1: Total number of accesses for the experiments

Single webpage Overall (10 web pages)

Direct Access 1 10

HTTP bypassing access 1 10

HTTPS bypassing access 1 10

Total 3 30


The proposed detection mechanism is illustrated in Figure 4.1. During phase one, the

blacklisted URLs and domains are provided to the detection mechanism installed on a

computer. A web browser is then used to access each entry of the blacklist during phase

two. The statistics, generated by the traffic from phase two for each blocked URL or

domain, are then collected and stored in a data structure (phase three). The aggregate,

made of the size of embedded objects within a web page, the inter‐arrival time of the

packets and two characteristics of the TCP flow in particular the number of flows,

constitutes the pre‐built profiles for detecting further accesses to the same web pages. The

detection rules are established during the initial experiments by investigating the

correlation between bypassing traffic and direct access traffic. In other words, the initial

experiment investigates if the size of the object retrieved in direct access mode is similar to

those of the objects fetched in two bypassing modes. The same comparison is then made


Page |

38

for the inter‐arrival time of the packets and the number of TCP flows between the traffic in

direct access and the traffic in the two bypassing modes.

Web traffic generated during the browsing activity of a user (phase four) is matched to the

pre‐built profiles in phase 5. Bypassing traffic is then fingerprinted in phase 6 if live traffic

matches a pre‐built profile according to the rules established during the building phase of

the profiles. The efficiency of the proposed detection mechanism will be examined through

the results presented in the chapters to come.

Figure 4.1: Model for detecting CGI proxies’ traffic

Bypassing traffic

detection

Network Profiles

Web Traffic

1

3 4

5

6

2

Blacklisted Websites

Chapter 5 – Design and Implementation

Page |

39

CCHHAAPPTTEERR 55

DDEESSIIGGNN AANNDD IIMMPPLLEEMMEENNTTAATTIIOONN

This chapter will present the parameters investigated in this research and the

implementation of the testing platform. Anomaly detection systems rely on patterns or

signatures to perform their task. These patterns or signatures are mostly derived from the

IP headers of network packets containing information such as the source IP address, the

destination IP address, the source port and the destination port. In the same way, the

detection model, proposed in this research, is based on some properties of the IP headers

of the packets exchanged between the bypassing server and the client.

In this approach, bypassing traffic is fingerprinted by comparing pre‐built profiles to live

HTTP and HTTPS sessions. To obtain pre‐built profiles, each entry of the proxy firewall

blacklist is retrieved in direct access mode and the detection parameters related to each

entry are stored in a text file. These parameters are the size of the objects embedded

within a webpage, the inter‐arrival time of the packets, the number of TCP flows required to

fetch each webpage and the average size of the packets. A positive alarm is triggered by the

detection system if a live HTTP or HTTPS session matches only one of the pre‐built profiles.

A virtual network, composed of the different parties involved in a bypassing scenario, was

set up to test the correctness of the proposed detection model.

Section 5.1 will describe the network metrics involved in the building of the network

profiles. Section 5.2 will focus on the description of the hardware, software and network

configuration of the testing platform. The different programs written in python and Jscript

to carry out the experiments are also provided in section 5.2.


Page |

40

55..11 NNEETTWWOORRKK PPRROOFFIILLEESS

This section will describe the network metrics chosen to investigate a way to detect CGI

bypassing traffic on a private network. The metrics investigated in the course of this

research are: the frequency distribution of the size of the objects embedded within a web

page, the inter‐arrival time of the packets and the number of TCP. The detection system

relies on pre‐built network profiles. The process for building network profiles involves two

main steps, the first one being the data collection phase and the other the validation of the

correctness of this detection approach.

During the first phase, a preliminary experiment is run by fetching a list of web pages and

the statistics related to the network metrics investigated are collected. This phase is very

deterministic to prove the correctness of the proposed detection model. The purpose of

this phase is to establish the correlation between normal traffic and bypassing traffic. This is

achieved by retrieving a list of blocked web pages in three different modes:

1‐ Direct Access: In this mode a webpage is retrieved directly from the source server. In

other words, no proxy server is intercepting the request and relaying it to the source

server and sending the replies back to the client.

2‐ Bypassing access through HTTP protocol: Most Common Gateway Interface (CGI)

proxy servers are implemented through the Hypertext Transfer Protocol (HTTP). In

this mode, the objects contained in a webpage such as images, videos, CSS files and

scripts are transferred unencrypted. However, the IP address of the objects of the

webpage is modified to point to the bypassing server instead of the source server.

3‐ Bypassing access through HTTPS protocol: This mode is similar to the previous

mode. However, encryption is enabled through the use of the Secure Sockets Layer

(SSL) to enhance the privacy of the user and conceal the nature of the data exchange

between the client and the bypassing server.


Page |

41

Each web page from the blacklist is accessed three separate times corresponding to each of

the three specific modes previously mentioned. A web page in the blacklist is represented

by a URL. The statistics of the metrics related to the direct access of each webpage are then

extracted from the IP headers of the packets and stored in a data structure to obtain a

profile. Once all the network profiles from direct access mode are obtained, each web page

is retrieved two more times: one in HTTP bypassing mode and the other in HTTPS bypassing

mode. The network profiles derived from the two subsequent accesses are then compared

to the preliminary profiles to establish the rules of the proposed detection approach.

The second phase is the evaluation of the correctness of the detection approach. To do this,

the size of the object embedded within each web page retrieved in HTTP and HTTPS

bypassing modes are compared with the network profile obtained during the direct access

mode. The aim is to find out for each web page the percentage of object matches between

bypassing mode and direct access mode in terms of the size of embedded objects. A high

percentage of object matches would be an accurate indicator of the real source of webpage

accessed in HTTP or HTTPS bypassing mode. The next step is to compare the inter‐arrival

time of the packets and the number of TCP flows in bypassing mode and direct access

mode. The expectation is to observe a high inter‐arrival and number of TCP flows in

bypassing mode compared to direct access mode. This can be justified by the fact that a

bypassing server adds one or more hops between the client and the source server.

5.1.1 Detection parameters

5.1.1.1 Size of embedded objects

Definition: The size of a webpage object is defined as the amount of bytes, kilobytes or

megabytes occupied by this object. A webpage is a collection of objects, such as graphics,

scripts, Cascading Style Sheet (CSS) files and Hypertext Markup Language (HTML) files,


Page |

42

accessible to a web browser through the Internet. A client retrieves web pages from a web

server by sending requests to the hosting web server. The usual method to do this is for a

user to click on hyperlink or enter the URL of a web page address into a web browser. The

request is then submitted to the hosting web server through the GET or POST methods

implemented in the HTTP protocol [43]. The GET and POST methods are used to request

web pages and send data to a web server, respectively. Mostly, the retrieval of a web page

requires the use of subsequent HTTP requests to fetch the different objects that are present

in the webpage. It can be seen from Figure 5.1 that a request for the web page

www.example.com created four other requests to download object 1, 2, 3 and 4. The

frequency distribution of the size of embedded objects within a webpage is an array made

of two columns. The first column of each entry in the array represents a distinct size of

objects contained within a webpage while the second column corresponds to the frequency

or number of objects matching the same size.

Justification: The size of the objects contained within a web page can be a key element to

identifying the source of a web page even if the traffic is encrypted. Sophisticated CGI

proxies use HTTP over Secure Sockets Layer to defeat the deep packet inspection

mechanism of proxy firewalls. Moreover, many CGI proxies wrap web pages fetched from a

source server with some proprietary information (mostly the CGI script) making the

detection of blacklisted web pages harder. For example, to retrieve the web page

www.example.com (see Figure 5.1), a CGI proxy will return four consecutive objects to the

client with respective size of 150kB, 100kB, 50kB and 150kB.

By creating a profile based on the size of the objects embedded within a webpage for each

entry of a blacklist and then monitoring the size of the different objects as they are received

by the client during a live HTTP or HTTPS session, the proposed detection system, will try to

identify if a web page transmitted is blacklisted or not. However, this rule will be more

accurate on heterogeneous web pages. In other words, web pages received from different

sources,

different

sizes. Th

video file

Figur

but contain

tiate. The d

ese web pa

es, flash ani

re 5.1: Retri

ning identic

ataset of th

ages are ma

mation files

eval of a we

cal objects w

his research

ade of a larg

s, plaintext f

eb page req

with nearly t

is generate

ge range of

files and scr

uiring multi

Chapter 5 –

the same si

ed by retriev

f object suc

ripts.

iple GET to f

– Design and

ize, will be

ving web pa

h as PDF fil

fetch embe

d Implement

Page

more difficu

ages of diffe

es, graphic

dded object

tation

| 43

ult to

erent

files,

ts


Page |

44

5.1.1.2 Inter-arrival time

Definition: The inter‐arrival time between packet (n) and packet (n+1) is defined as the

difference (see Figure 5.2) between the arrival time of packet (n+1) and packet (n).

Justification: This metric, derived from the IP properties of two consecutive packets, has

been used as an anomaly detection parameter in many cases [37, 38] with satisfactory

results. However, no major experiment has been performed on CGI proxies using the inter‐

arrival time as an anomaly detection parameter. With this in mind the expectation is to

detect anomalies related to the inter‐arrival time of packets transmitted by CGI proxies. In

fact, the use of a CGI proxy to bypass a firewall adds one or more additional hops to the

route of the packets transmitted between the source of a web page and the destination.

The intermediary role performed by a bypassing proxy may impact the inter‐arrival time of

the packets. Fingerprinting the source of a web page based on the size of the objects

received by the client would be a step forward. Futhermore, the discovery of a CGI proxy on

the route of incoming packets through the identification of a correlation between the use of

CGI proxies and the inter‐arrival time will increase the accuracy of the detection

mechanism.

Figure 5.2: Inter‐arrival time illustration

Source (IP, Port)

Destination (IP, Port)

Packet 4 Packet 3

Packet 1 Packet 2

Inter‐arrival

Time 1 = T2 –T1

Time

T1 T2 T3T4

Inter‐arrival Time 2 = T3 –T2

Inter‐arrival Time 3 = T4 –T3


Page |

45

5.1.1.3 TCP Flows

Definition: During a TCP session, one or more streams of packets are exchanged between

two processes located on two separate machines. The transmission of the data is achieved

through a TCP socket implemented on both the client and the server side of the

communication. A TCP socket is a pair made of <IP address, Port number>. A TCP flow is a

unique TCP stream made of the combination of the client and server sockets that occurs

during a TCP session. In other words, a TCP flow is identified by the four‐tuple consisting of

<Source IP, Source Port, Destination IP, Destination Port>. Therefore, one or more TCP

flows are needed to retrieve a web page from a web server depending on its structure.

Justification: During an HTTP or HTTPS session, the web browser creates threads to handle

the transfer of data received from the server hosting the web page. Once the first three

way handshake is completed and the connection established, the web browser sends the

first GET to retrieve the home page. The response from the web server is then parsed by

the web browser in order to identify embedded objects. Depending on the number and

type of the embedded objects, the web browser decides to spawn more threads to

download each object or reuse currently opened connections. A web browser can initiate

consecutive connections to the web server in order to speed up the transmission of the

data (see Figure 5.3). However, when the web server is heavily loaded, a limited number of

connections will be opened by the web server to transfer the data to the client. A CGI script,

commonly written in Java or PHP, plays the role of a web server by servicing the requests of

a client during a CGI bypassing technique. By inspecting the TCP flows generated during an

HTTP session, the expectation is to prove the presence of a CGI proxy in the route of

transmitted packets. More precisely, the aim is to find out the difference between the TCP

flows produced during the direct access and bypassing access of a web page in terms of the

number of flows required to fetch the web page and the average size of the packets

transmitted by the bypassing server.


Page |

46

Figure 5.3: TCP flows emulated during a TCP session and inter‐arrival time of each flow

55..22 IIMMPPLLEEMMEENNTTAATTIIOONN OOFF TTHHEE TTEESSTTIINNGG NNEETTWWOORRKK

5.2.1 Topology of the testing network

The implementation of a testing platform, comprising the parties involved in a bypassing

scenario, was a key element to this study. The purpose of the testing network is to:

Validate the three parameters identified in this study to detect HTTP and HTTPS

bypassing traffics and establish the rules of the detection model by comparing the

statistics of these parameters in direct access mode, HTTP and HTTPS bypassing

modes.

Evaluate the correctness of the detection model proposed in this research. The

testing network will evaluate the correctness of the detection model based on the

lower bound situation. As can be seen in Figure 5.4, only one router separated the

bypassing server, the blocked server and the proxy firewall.

Source (IP, Port)

Destination (IP, Port)

Start Flow 1 Start

Flow 2 Start Flow 3 Start

Flow 4 Start Flow 5

End Flow 1 End

Flow 2 End Flow 3 End

Flow 4 End Flow 5


Page |

47

The avoidance of censorship involves four different parties:

A client or internal user is a computer from which the bypassing process is initiated.

This computer represents the first endpoint of the bypassing traffic.

A proxy firewall is either a hardware device or software or a combination of both. Its

main function is to prevent the users within a private network from accessing

resources explicitly categorized as contrary to the security policies of that network.

The access control policies are enforced either by blocking some services such as

FTP, SSH and TELNET or filtering the traffic according to some predefined rules.

A bypassing server or external proxy is the second endpoint of the bypassing traffic.

This endpoint is generally a web server hosting a bypassing program such as a CGI

script.

A blocked server or blacklisted server represents a web server to which access is

restricted for the private network users.

A virtual network environment was reproduced for this investigation (see Figure 5.4). This

environment contained the four different parties involved in the bypassing of a proxy

firewall. More specifically, the testing network was made of four virtual machines: a proxy

server, a client computer, a bypassing server and a blocked. The virtual machines were

created using VMware [50]. In addition, a total of 10 web pages, hosted on the blocked

server, were blacklisted on the proxy firewall and therefore represented the blacklist of the

bypassing environment. The role of each machine is described in the next section of this

chapter. Internal traffic originating from the private network was routed through the proxy

server (192.168.1.1) which acted as a default gateway for internal hosts. The

communication between the proxy server, the bypassing server (172.168.17.2) and the

blocked server (172.168.18.2) was enabled through the routing server.


Page |

48

Figure 5.4: Detailed topology of the virtual network

5.2.2 Hardware and Configuration

5.2.2.1 Physical machine

The testing platform was set up and run on a single physical machine. The description of the

different hardware resources of the physical computer are outlined in Table 5.1. It can be

seen from Table 5.1 that Windows® Vista® Professional was installed on the physical

computer along with VMWare workstation [50] which was used to create virtual machines

in order to reproduce the different actors involved in a bypassing scenario. No additional

software was installed on the physical machine to avoid interference and therefore allocate

all the hardware resources to the virtual machines.


Page |

49

Table 5.1: Description of the physical machine

Physical Machine

Processor Intel ® Core ™2 Duo CPU E7200 @ 2,53 GHz

Physical Memory (RAM) 4 GB

File system size 500 GB

File system type NTFS

Operating system Windows® Vista® Professional

Network interface cards Intel ® 82566DM‐2 Gigabit Network Connection

2 x VMware Virtual Ethernet Adapter

Software & Services VMware Workstation 6.5.2

5.2.2.2 Virtual machine 1: Proxy firewall

The first virtual machine was the proxy server. Two network cards were implemented on

the proxy server. The first network interface card was connected to the internal network

(192.168.1.0) while the second card allowed the proxy to communicate with external

networks (172.168.16.1). The proxy server played a key role in the investigation by

restricting access to the web pages hosted on the blocked server. Microsoft Internet

Security and Acceleration Server (ISA) 2004 was installed on the proxy server and used as

the firewall application. Web traffic through HTTP and HTTPS protocols was the only traffic

allowed by the proxy server. Moreover, the proxy server was configured to automatically

reject inbound connections to internal hosts while at the same time scrutinising outbound

connections to ensure their compliance with the security policies. Access to the blocked

server as well as the web pages hosted on it were explicitly denied to the private network

clients in the firewall rules. Wireshark was installed on the proxy server to sniff network

traffic. The description of the hardware and configuration of the proxy server is outlined in

Table 5.2.


Page |

50

Table 5.2: Description of the proxy server (virtual machine)

Virtual Machine: Proxy Server


Physical Memory (RAM) 512 MB



Operating system Microsoft® Windows® Server 2003 R2 EE SP2

Network interface cards Intel ® PRO/1000 MT Network Connection

VMware Accelerated AMD PCNet Adapter

Software & Services Microsoft ISA Server 2004

Wireshark 1.2.4

5.2.2.3 Virtual machine 2: Blocked server

The blocked server represents the unauthorized server accessed by a user using a bypassing

server. A virtual machine running a web server application (XAMPP) was implemented in

the testing platform to impersonate the banned server. The blocked server hosted web

pages blacklisted by the proxy server but accessible to the bypassing server. That is to say

that the blocked server was unreachable by hosts located on the private network

(192.168.1.0). The web pages hosted on the blocked were accessible by the internal using

the bypassing server in mode modes: HTTP bypassing or HTTPS bypassing mode. This was

made possible by the enable the openSSL library embedded within the web server

application XAMPP. Apart from XAMPP, no additional programs were installed on the

blocked server to maximise its performance. A virtual network interface card with the IP

address 172.168.18.2 was connecting the blocked server to the rest of the network

topology. Detailed information about the configuration of the blocked server is provided in

Table 5.3.


Page |

51

Table 5.3: Description of the blocked or blacklisted server (virtual machine)

Virtual Machine: Blocked Server






Network interface cards VMware Accelerated AMD PCNet Adapter

Software & Services XAMPP for Windows Version 1.7.3

5.2.2.4 Virtual machine 3: Bypassing Proxy

The configuration of the bypassing server was similar to the blocked server except that a

Common Gateway Interface (CGI) bypassing script was installed on the bypassing server for

the experiments (see Table 5.4). The bypassing server (176.168.17.2) was not blacklisted in

the proxy firewall security policies. This server was simulating the services offered by CGI

bypassing proxies available on the Internet to bypass censorship. A single virtual network

interface card was required to connect the bypassing server with the rest of the network.

Table 5.4: Description of the bypassing proxy (virtual machine)

Virtual Machine: Bypassing Proxy







Software & Services XAMPP for Windows Version 1.7.3 and Glype proxy v1.1


Page |

52

5.2.2.5 Virtual machine 4: Routing server

The routing server was simulating an Internet Service Provider (ISP) by routing the traffic

between the different networks (176.168.16.0, 176.168.17.0 and 176.168.18.0) except for

the internal network (192.168.1.0). The remote access and VPN functionalities of

Microsoft® Windows® Server 2003 were configured on this virtual machine to achieve the

routing. Three network interface cards were necessary on this virtual machine. It can be

seen from Figure 5.4 that the first card (172.168.16.2) was assigned to incoming and

outgoing traffic to and from the proxy server while the second network card (172.168.17.1)

and third card (172.168.18.1) were used for communicating with the bypassing server

(172.168.17.2) and the blocked server (172.168.18.2), respectively. By using only one router

to separate the different networks, the experiments will test the lower boundary of the

inter‐arrival time of the packets. In a real life scenario, bypassing packets will cross several

routers to reach their destination. Therefore, if a high inter‐arrival time is recorded in this

virtual network then these results can be applied to a more realistic situation. The hardware

configuration of the routing server is described in Table 5.5.

Table 5.5: Description of the Routing Server (virtual machine)

Virtual Machine: Routing Server






Network interface cards 2 x Intel ® PRO/1000 MT Network Connection

VMware Accelerated AMD PCNet Adapter

Software & Services Remote Access/VPN Server


Page |

53

5.2.2.6 Virtual machine 5: Client computer

The fifth virtual machine was the client computer (192.168.1.2) located inside the private

network. This computer was subjected to the restrictions of the proxy server. In other

words, the client computer was denied access to the blocked server but allowed to access

the bypassing server. This virtual machine was used for bypassing the access control rules of

the proxy firewall by accessing the blocked server. The hardware configuration of the

routing server is described in Table 5.6.

Table 5.6: Description of the client computer (virtual machine)

Virtual Machine: Client Computer





Operating system Microsoft® Windows® XP Professional SP2


Software & Services Fiddler2

Traffic Generator Script

5.2.3 Software

5.2.3.1 VMware Workstation

VMware Workstation is a virtualization application used to emulate multiple virtual

machines, also called guests, on a physical machine, also known as host [50]. The physical

machine can be a desktop, laptop or server computer. For this investigation a desktop was

used. The hardware resources of the physical machine such as the processor, RAM, network


Page |

54

interface card(s) and the hard drive are shared between the virtual machines [50]. In

addition, VMware Workstation is capable of mounting peripherals such as DVD‐CD ROM

drive, USB, serial and parallel ports. Furthermore, it is possible to create virtual networks

with VMware Workstation. Besides the virtual network adapters that can be assigned to

virtual machines, VMware Workstation provides a virtual network that allows virtual

machines to communicate with each other. The virtual network platform generated by

VMware is similar to a TCP/IP network interconnecting physical machines through a switch.

The connectivity of virtual machines to physical networks is achieved through various

techniques such as:

Creating a bridge between the NIC of the virtual machine and the NIC of the physical

machine;

Using the Network Address Translation to share the IP address of the host;

Using a virtual network which is connected to the logical or physical network

interface card of the host.

VMware Workstation is mainly used for the following purposes:

To run many operating systems on a single PC;

To implement a testing environment;

To develop and test software updates and patches;

To provide training assisted by computer;

5.2.3.2 ISA server 2004

Microsoft ISA Server 2004 is a layer 7 firewall that protects a private network against

threats from the Internet [44]. The ISA server also provides users with a secure method to

remotely access their data and applications. This is achieved by implementing a secure

channel between two separate networks.

Microsoft® ISA Server 2004 offers several features such as [44]:


Page |

55

Caching: ISA server accelerates web traffic by storing a local copy of web pages

accessed by users;

Advanced firewall functions: packet filtering, application filtering, content filtering,

access control rules and a web proxy;

Server publishing functions: secure web publishing, preservation of source IP

addresses in Web publishing rules and inspection of SSL packets;

VPN functions: implementation of a Virtual Private Network between two remote

sites, including filtering and inspection of VPN traffic, publishing of VPN servers and

IPSec tunnel mode for point‐to‐point VPN connections;

Intrusion detection system: detection of attacks such as ping of death, IP half scan,

UDP bomb and port scanning.

5.2.3.3 XAMPP for Windows

XAMPP is an Apache distribution, easy to install, designed for developers. XAMPP is an

acronym where X is for multi platform (such as Windows and Linux), A for Apache, M for

MySQL, P for PHP and Perl for the last P [45]. XAMPP is available on Linux platforms,

Windows, Mac OS X and Solaris. The distribution for Windows contains Apache, MySQL,

PHP + PEAR, Perl, mod_php, mod_perl, mod_ssl, OpenSSL, phpMyAdmin, Webalizer,

Mercury Mail Transport System for Win32 and NetWare Systems v3.32, Ming, JpGraph,

FileZilla FTP Server, mcrypt, eAccelerator, SQLite, and WEB‐DAV + mod_auth_mysql [46].

XAMPP is licensed under GNU and was not designed to be executed in a production

environment but in a development environment. As a result, the security configuration of

XAMPP is as open as possible for testing purposes. In this study, XAMPP was configured to

host the bypassing script and allowed both HTTP and HTTPS accesses to the script.


Page |

56

5.2.3.4 Wireshark

Wireshark is an open source “packet sniffer” that captures network packets and analyses

live network traffic or an image of network traffic that has been previously saved on a mass

storage [47]. According to [47], Wireshark is the most popular packet analyser used by

network professionals to troubleshoot network problems, understand protocols and

examine network traffic for security holes. In addition, spyware, virus activities and other

network anomalies are detectable using Wireshark [47]. In this research, Wireshark was

installed on the proxy firewall to capture network traffic to be used in later analysis.

5.2.3.5 Fiddler2

Fiddler2 [48] is an application that displays HTTP and HTTPS traffics generated by a web

browser. It is a tool which offers the ability to record and view HTTP/HTTPS interactions

between a web browser and a web server [48]. It is a useful application for debugging,

repairing, optimizing and verifying the safety of web sites. In addition, Fiddler2 can be used

to analyse the characteristics of web traffic such as HTTP headers, cookies, query strings

and the length of queries. As a result, Fiddler2 was used to capture direct access traffic,

HTTP and HTTPS bypassing traffics and to generate network statistics after the completion

of the user’s request.

Fiddler2 was essential in this research because of Jscript.NET scripting capabilities

embedded within this software. This feature allows users to write scripts to manipulate the

raw data captured by Fiddler2. Taking advantage of the capabilities offered by Fiddler2, a

script was written in JScript.NET to automatically compute the statistics of the three

parameters of the detection system. The output was saved in a Comma‐Separated Values

(CSV) file and the raw data was dumped in a file for further investigations. The source code

of this script is provided in Appendix (WebTrafficStats).


Page |

57

5.2.3.6 Glype Proxy Script

Glype Proxy Script is a free PHP script installed on a web server to bypass censorship [49].

As a web based script, Glype Proxy Script downloads the web page(s) requested by a client’s

computer and then transfers them back to the client. This service is offered by many online

web proxies and allows users to browse the Internet anonymously. In other words, the web

proxy hides the IP address of the client’s computer by using its own IP to access the server

hosting the requested web page/web pages. Contrary to other bypassing techniques such

as the use of SSH and VPN tunnels, Glype Proxy Script eliminates the need to modify the

web browser settings in order to bypass censorship [49]. That is to say that no software

installation is required on the client’s computer. After accessing the web based proxy by

entering its IP address or domain name in a web browser, the user can start immediately to

browse the web anonymously through that proxy.

5.2.3.7 Traffic Generator

A dataset of bypassing traffic was an important factor in the experiments. Due to the lack of

physical users, a script called “traffic generator” was written in python to simulate network

traffic by requesting web pages that have been blacklisted on the proxy firewall. A list of 10

URLs, each directing to a banned web page, was provided to the script in a form of a text

file. Each URL was then accessed sequentially by the script through Microsoft Internet

Explorer 8. The resultant traffic generated was captured on both the client’s machine using

Fiddler2 and the proxy firewall using Wireshark. The source of the traffic generator is

provided in the Annexes.

The different steps executed by the traffic generator are presented in Figure 5.5. The flow

chart can be divided in five main steps:


Page |

58

Initialisation: During this phase, the traffic generator fetches the file containing URLs

randomly selected from the internet. Also, Fiddler2 is started on the client machine

and is ready to capture web traffic.

Decision: After reading the first line of the URLs’ file, the script will decide to move

to the next step which is the retrieval of a URL if the End Of File (EOF) is not reached.

Otherwise the execution the script is stopped. This step allows the script to iterate

through the URLs.

URL retrieval: A URL can be retrieved in three modes as mentioned before (Section

5.1): direct access, HTTP bypassing access or HTTPS bypassing access. Before the

retrieval of a URL, the cache and the cookies of previous web accesses are deleted

with the purpose of ensuring that the data retrieved are not served to the web

browser from the cache but directly fetched from the source server. This is

deterministic for the investigation because caching can reduce the inter‐arrival time

of the packets and therefore compromise the results. In a real world situation,

caching is disabled by CGI proxy servers on the client computer to erase any traces

of bypassing activities. The next action is to create a Microsoft Internet Explorer

Object. In the direct access mode, the address bar of the newly created Microsoft

Internet Explorer is filled with the current URL. However, in HTTP and HTTPS, the

bypassing server URL is accessed first by the script through the Microsoft Internet

Explorer Object and the current URL is the passed to the bypassing server for

retrieval.

Statistics computation: The resulting web traffic generated during the access of a

URL is automatically captured by Fiddler2. Once the webpage is fully loaded, a

command written in Jscript.NET is executed on Fiddler2 to compute the statistics of

the parameters of the network profile for each URL.

Dumping of data: During this step, the raw data and the statistics are dumped

respectively in Fiddler format (SAZ) and Comma Separated Values (CSV) files. Finally,

the Microsoft Internet Explorer object is cleared to release the space in the RAM.

Figure 5..5: Flow chaart of the tra

Chapter 5 –

affic genera

– Design and

tor

d Implement

Page

tation

| 59


Page |

60


The simulation of a CGI bypassing scenario was fundamental to evaluate the accuracy of the

proposed detection approach. As a result, a bypassing environment was reproduced by

implementing a virtual network made of five virtual machines. The hardware configuration

and software necessary to reproduce the bypassing scenario were provided in this chapter.

Moreover, a description of the different parameters used to create a network profile and

the justification of the choice of each parameter are also explained in this chapter.

Successfully distinguishing and classifying CGI traffic in a virtual network is necessary to

expand the experiments to a large dataset generated from real world traffic.

Chapter 6 – Findings: Results and Analyses

Page |

61

CCHHAAPPTTEERR 66

FFIINNDDIINNGGSS:: RREESSUULLTTSS AANNDD AANNAALLYYSSEESS

This chapter presents the results of the experiments performed and the evaluation of the

efficiency of the proposed model. After creating a model to simulate a bypassing scenario

and implementing it in a virtual network, experiments were then carried out to determine a

possible way to detect CGI proxy bypassing traffic. Section 6.1 will describe the building

phase of the network profiles. After completing the building of initial network profiles, each

webpage is accessed randomly in HTTP and HTTPS bypassing modes. The live HTTP and

HTTPS bypassing traffic profiles are then compared with the pre‐built profiles to fingerprint

the webpage. Section 6.2 will present the comparison of the bypassing network profiles of

the first webpage with its corresponding pre‐built profile. The aggregation of the results of

all the web pages is then provided in Section 6.3 to validate the trends observed with a

single web page. Finally, the last section focuses on proposing a solution from the

aggregation of the different results.

66..11 IINNIITTIIAALL EEXXPPEERRIIMMEENNTT:: PPRROOFFIILLEE BBUUIILLDDIINNGG

In the first experiment, web pages hosted by the blocked server, were directly accessed

from the client computer to collect the size of embedded objects for each webpage, the

inter‐arrival time of the packets, the number of TCP flows and the average size of the

packets. The direct access to the web pages was not routed through the bypassing server.

The collected data was then manually analysed and classified to obtain a network profile for

each webpage. An example of a profile is described in Table 6.1. It can be seen from this

table that the first webpage under investigation contains 4 embedded objects occupying

respectively 551 bytes, 1613 bytes, 7602 bytes and 388 bytes. In addition, the inter‐arrival


Page |

62

time of the packets was on average 0.03ms. In total, the web browser emulated 4 TCP flows

to acquire the first web page. The average size of packets transmitted was 582 bytes. HTTP

was the protocol used to download the web pages while similar profiles were created for

the rest of the web pages.

Table 6.1: Traffic profile of initial access

Network Profile Text/ HTML 551 bytes Text/CSS 1613 bytes Image 1 7602 bytes Image 2 388 bytes Inter‐arrival time 0.03ms TCP flows 4 Average packet size 582 bytes

6.2 SSIINNGGLLEE WWEEBBPPAAGGEE RREESSUULLTTSS

Three network traffic profiles were created from the statistics of the traffic generated by

the 3 subsequent accesses to each web page. In this case, the bypassing server was used to

access each web page in HTTP and HTTPS bypassing modes. It was expected that the pre‐

built profiles from the direct access of each web page will match identically the profiles

collect through the bypassing server. The results obtained confirmed that even though a

web page is accessed securely with HTTPS through a bypassing proxy; it is possible to

predict with a high accuracy the source of the web page based on the size of embedded

objects.

It can be seen from Figure 6.1 that the direct access profile collected for web page 1 is

identical to those collected later in terms of the size of embedded objects. During the direct

access, the size of the four embedded objects was 551 bytes, 1613 bytes, 7602 bytes and

388 bytes. The CGI traffic produces comparable results in HTTP and HTTPS bypassing mode.


Page |

63

The HTTP bypassing access recorded four objects with the size 564 bytes, 1642 bytes, 7611

bytes and 391 bytes. The difference between the size of each object for direct access, HTTP

and HTTPS bypassing accesses is negligible. However, a higher inter‐arrival time was

observed while accessing web pages through the bypassing server. In addition, the

bypassing traffic initiated 5 TCP flows in bypassing mode instead of 4 required for the

retrieval of the webpage in direct access mode. By analysing the additional TCP flow, it was

discovered that this flow was initiated in order to download the bypassing script necessary

for the user for future bypassing accesses. As mentioned earlier, CGI proxies can wrap a

requested webpage with proprietary information making the total size of the web page

larger. Additional TCP flows are then needed to download the extra data that is appended

to the webpage. Overall, according to the results of the experiments, the size of embedded

objects contained in a web page is a reliable parameter to predict the origin of a web page

as long as a profile of the web page was obtained beforehand. Nonetheless, the aggregated

results of all the web pages investigated in the experiments, is crucial to confirm this

hypothesis.

Figure 6.1: Single webpage results

0

1000

2000

3000

4000

5000

6000

7000

8000

Text/ HTML Text/CSS Image 1 Image 2

Size(in

bytes)

Embedded objects

Web page 1

Direct Access

HTTP bypassing

HTTPS bypassing


Page |

64

6.3 AAGGGGRREEGGAATTEEDD RREESSUULLTTSS

The aggregated results of the experiments on five out of the ten web pages accessed are

shown in Figure 6.2, Figure 6.3, Figure, 6.4 Figure 6.5 and Figure 6.6. It can be seen from

these figures that the trends described by the single web page results are also observed for

the rest of the web pages. It was also observed during the simulation that the size of

embedded objects remained nearly constant for each access. Additionally, more TCP flows

are occurring while using the CGI proxy to access a web page. The inter‐arrival time of the

packets remained higher at 0.04 milliseconds for the bypassing traffic throughout the

experiments. The average size of the packets transmitted using the circumventing method

compared to the direct access of each web page was lower (see annexes). As seen from

Figure 6.2 to Figure 6.6, the similarity between the pre‐built profiles and the bypassing

traffic profiles, which were obtained by accessing each web page in direct access mode and

bypassing modes, is crucial in predicting the source of the web page. This is evident even

when the HTTPS protocol is being utilized to access the web page. The variation of the

inter‐arrival time, number of flows and average size of the packets will then enable the

detection system to confirm the presence of a CGI proxy.

The web pages being investigated in this simulation remained static throughout the

experiments. In addition the cache was cleared after each round in order to ensure that the

web pages are fetched from the blocked server and not served from the cache. In a real

world scenario, a monitoring mechanism would be necessary to track the updating of

blacklisted web pages and to re‐build network profiles. In others words, 100 updates of a

web page during a day will result in the web page being accessed 100 times by the

monitoring system to re‐build a new profile based on the new objects appended to the web

page. Furthermore, it can be seen from Figure 6.2 that web pages originating from the

bypassing proxy are easily detectable as blacklisted by matching them to the pre‐built

profiles. However, a conflict can occur between the data appended to a web page by a CGI


Page |

65

proxy and existing objects embedded to the webpage if the two types of objects are similar

in size. In this case, the detection system will mismatch web pages and raised a lot of false

alarms.

Figure 6.2: Web page 1 results


0

1000

2000

3000

4000

5000

6000

7000

8000

Text/ HTML Text/CSS Image 1 Image 2

Size(in

bytes)

Embedded objects

Web page 1

Direct Access

HTTP bypassing

HTTPS bypassing

1020

1040

1060

1080

1100

1120

1140

1160

1180

1200

1220

Image 1 Image 2 Image 3 Image 4 Image 5

Size(in

bytes)

Embedded objects

Web page 2

Direct Access

HTTP bypassing

HTTPS bypassing


Page |

66



1020

1040

1060

1080

1100

1120

1140

1160

1180

1200

Image 1 Image 2 Image 3 Image 4 Image 5

Size(in

bytes)

Embedded objects

Web page 3

Direct Access

HTTP bypassing

HTTPS bypassing

0

2000

4000

6000

8000

10000

12000

14000

16000

Size(in

bytes)

Embedded objects

Web page 9

Direct Access

HTTP bypassing

HTTPS bypassing


Page |

67



The results of the experiments outlined the necessity to implement two sub‐mechanisms to

detect bypassing traffic. The fingerprinting of a blocked web page is performed by the first

sub‐mechanism by analysing the size of the embedded objects of a web page while the

second sub‐mechanism inspects the traffic for anomalies related to the average size of the

packets, number of TCP flows and the inter‐arrival time of the packets. According to the

results obtained, network traffic generated by a CGI proxy is characterised by a high inter‐

arrival of the packets and an abnormal average size for the packets transmitted.

From the results of the experiments, a bypassing proxy can be detected on a virtual

network by applying the rules outlined in Table 6.2.

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Image

1

Image

2

Image

3

Image

4

Image

5

Image

6

Image

7

Image

8

Size(in

bytes)

Embedded objects

Web page 10

Direct Access

HTTP bypassing

HTTPS bypassing


Page |

68

Table 6.2: Detection condition of a CGI proxy traffic

Pre‐built profile Condition CGI profile

Size of object <= or >= Size of object

Inter‐arrival time < Inter‐arrival time

Average size of packets > Average size of packets

Number of TCP flows < Number of TCP flows


Page |

69

CCHHAAPPTTEERR 77

AADDDDIITTIIOONNAALL EEXXPPEERRIIMMEENNTTSS

Additional investigation has been carried out to evaluate the accuracy of the proposed

detection approach in a more realistic situation. Firstly, a blacklist made of 70 web pages

was accessed to validate the reliability of the detection parameters in a physical network,

define the rules of the detection mechanism and build the network profile for each entry of

the blacklist. Then, 542 random accesses were made on the 70 web pages to evaluate the

efficiency of the detection approach. More specifically, 271 accesses were made in HTTP

bypassing mode and 271 in HTTPS bypassing mode. The results of the accuracy of the

detection model are presented this chapter.

7.1 Physical network for accuracy evaluation

The purpose of the implementation of a physical network is to evaluate the accuracy of the

detection model in a more realistic situation (see Figure 7.1). This physical network was

made of three physical machines located on different networks across the Internet: a proxy

firewall, a client computer and a bypassing server. In addition, a total of 70 websites,

located on the Internet, were blacklisted on the proxy firewall and therefore represented

the blocked servers of the bypassing environment. It can be seen from the network

topology (see figure 5.4) that the client of the private network (192.168.1.0) was connected

to the proxy firewall (192.168.1.1) via an IP router. A blacklist was enforced on the proxy

firewall to deny access to 70 websites. Two home computers, connected to BIGPOND

network, were used as proxy firewall and client. The bypassing server was a third computer

hosting an Apache web server containing a CGI script. This server was connected to the

Internet

blocked

7.2 Accu

The aim

research

blacklist,

blacklist

accuracy

In

co

Lo

re

through th

server in th

Figure 7.1

uracy evalu

of this scrip

h. This scrip

, the pre‐bu

and the de

y are presen

nitialisation

ontaining U

oop: The s

eached. Dur

e Internet S

e blacklist.

: Topology o

uation scri

pt is to eva

pt takes as

uilt profiles

etection ru

nted in Figur

n: Fiddler is

RLs is fetch

script will lo

ring each lo

Service Prov

of the physi

ipt

luate the ac

inputs: a ra

obtained d

les. The dif

re 7.2. The f

s started to

ed during th

oop throug

op, a rando

Cha

vider (ISP) I

ical network

ccuracy of t

ndom web

during the d

fferent step

flow chart c

o capture

his step.

gh until the

m web page

apter 6 – Fin

IINET and w

k for the acc

the detectio

page select

direct acces

ps executed

an be divide

network tr

e maximum

e is selected

ndings: Resu

was not exp

curacy evalu

on approach

ted from th

ss of each w

d for the ev

ed into five

raffic and t

m number o

d from the b

ults and Ana

Page

plicitly listed

uation.

h covered in

he proxy fire

web page o

valuation o

main steps

the blacklist

of executio

blacklist.

alyses

| 70

d as a

n this

ewall

of the

f the

:

t file

ons is


Page |

71

Retrieval of webpage in HTTP and HTTPS bypassing modes: During this step, the

random web page chosen during the previous step is retrieved in HTTP bypassing

mode. Once the web page is fully loaded, Fiddler2 computes automatically the live

traffic profile of the random web page. The same process is then repeated in HTTPS

bypassing mode.

Comparison of live traffic profile to pre‐built profiles: The live traffic profile of the

random webpage is fingerprinted during this step by comparing it to the pre‐built

profiles. At the end of this process, the live traffic profile will match zero, one or

many pre‐built profiles.

Classification of the web page: A webpage can be classified as a positive alarm, a

false alarm or unknown. After the comparison of the live traffic profile to the pre‐

built profiles, a positive alarm is raised if a unique pre‐built profile matches the

random webpage. This pre‐built profile must correspond to the network profile of

the direct access of the random webpage. In case the live traffic profile matches two

or more pre‐built profiles, the webpage is classified as a false alarm. An unknown

flag is triggered if no pre‐built profile is similar to the live traffic profile.

Figurre 7.2: Flow chart of thee evaluation

Cha

n of the effi

apter 6 – Fin

ciency of th

ndings: Resu

he detection

ults and Ana

Page

n approach

alyses

| 72


Page |

73

77..33 AACCCCUURRAACCYY EEVVAALLUUAATTIIOONN OOFF TTHHEE DDEETTEECCTTIIOONN AAPPPPRROOAACCHH

7.3.1 Building phase of network profiles

The dataset of the experiments was increased from 10 web pages to 70 web pages to

evaluate the accuracy of the detection approach. In the initial experiment, the 70 web

pages were directly accessed from the client’s computer to determine the size of

embedded objects for each webpage, the inter‐arrival time and the number of TCP flows.

The direct access of each web page was then followed with two separate accesses of the

same web page: one in HTTP bypassing mode and the other in HTTPS bypassing mode.

Thus, three initial accesses were necessary for each web page to identify the correlation

between the three access modes.

Contrary to the web pages used in the virtual network, the web pages retrieved in this

physical network did not remain static throughout the experiments. This was due to the fact

that they were fetched from the Internet. Therefore, some websites were regularly updated

with new information. As for the virtual network, the cache was cleared after each access in

order to ensure that the web pages are fetched from the blocked server and not served

from the cache.

7.3.2 Frequency distribution of the size of embedded objects

The size of an object embedded on a web page is obtained by summing up the header size

and the payload size of the IP packets received during the downloading of the object.

Size of embedded object = ∑ (IP packets size)

or

Size of embedded object = ∑ (Header Size) + ∑ (Payload Size)


Page |

74

Many researchers have identified the size of the objects embedded within a web page as a

reliable parameter to fingerprint a web page. Therefore, it was expected that the frequency

distribution of the size of the objects embedded within a web page would be similar in

direct access mode, HTTP and HTTPS bypassing modes. As mentioned earlier, 3 initial

accesses are performed on each web page during the profile building phase. After the first

access, which is the direct access of a web page, the frequency distribution of the size of the

objects embedded within each web page in direct access mode is then compared with

those in HTTP and HTTPS bypassing modes. The percentage of objects for each web page in

HTTP and HTTPS bypassing modes matching the objects of the same web page in direct

access mode, in terms of size, is depicted in Figure 7.3. It can be seen from this figure that

the percentage of matches in the two bypassing modes compared to the percentage in

direct access mode is below 20% throughout the 70 web pages accessed in this

investigation. For some web pages, no matches were found between direct access and

bypassing modes.

The discrepancy between the size of the objects transmitted in direct access, HTTP and

HTTPS bypassing mode, can be explained by the fact that CGI proxies alter the headers’

information of the data retrieved from the source server before forwarding the data back to

the client. By doing that, the original headers are replaced by those of the CGI proxies. This

alteration, performed by CGI proxies on the headers on the IP packets relayed to the client,

increases or decreases the size of embedded objects.

An in depth view of the trends observed in Figure 7.3 is provided in Table 7.1. It can be seen

from this table that the objects of more than 50% of the web pages accessed in HTTP and

HTTPS bypassing did not match any objects of the same web pages accessed in direct mode.

37 and 39 web pages were recorded with no matches in HTTP and HTTPS bypassing modes

respectively (row in red). In addition, the majority of the other half of the web pages had a

match ranged between 1% and 6% of the objects fetched in direct access (rows in blue).


Page |

75

The intermediary role played by a CGI proxy between a source server and a client minimises

the possibilities of fingerprinting a web page. For this reason, the size of the objects

received by a web browser during an HTTP or HTTPS session is not a reliable parameter for

the detection of bypassing traffic carried out by CGI proxies. As a result, the frequency

distribution of each web page was divided into two: the first based on the header size and

the second on the payload of embedded objects.

The purpose of dividing the frequency distribution of the size of embedded objects is to

understand the origin of the discrepancies observed between the three access modes. An

increase or decrease of the size of the header or the payload of the packets in the HTTP or

HTTPS bypassing mode could justify the difference in size of the objects received in direct

access compared to those received in bypassing mode.

Figure 7.3: Comparison of the frequency distribution of the size of embedded objects within

a web page in direct access, HTTP bypassing access and HTTPS bypassing access.

0

20

40

60

80

100

120

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70

Percentage

of o

bjects m

atches (%)

Webpage

FREQUENCY DISTRIBUTION OF SIZE RECEIVED

HTTP Bypassing

HTTPS Bypassing

Direct Access


Page |

76

Table 7.1: Repartition of web pages in relation to the percentage of matches of the size of

embedded objects in direct access compared to HTTP and HTTPS bypassing accesses.

Range (%) HTTP Bypassing HTTPS Bypassing 0% 37 39

0% ‐ 1% 4 2 1% ‐ 2% 7 6 2% ‐ 3% 7 5 3% ‐ 4% 5 4 4% ‐ 5% 2 6 5% ‐ 6% 5 1 6% ‐ 7% 0 2 7% ‐ 8% 0 2 8% ‐ 9% 1 1 9% ‐ 10% 0 0 10% ‐ 20% 2 2 20% ‐ 100% 0 0

TOTAL 70 70

7.3.3 Frequency distribution of the header size of embedded objects

An IP packet is made of two main parts: a header and a payload. The information contained

in the header of an IP packet, such as source IP, destination IP, source port and destination

port, are used to route the packet to its final destination. The total header size of an

embedded object is obtained by summing up the header size of the IP packets transmitted

to a web browser during the retrieval of the object.

Header Size of embedded object = ∑ (Header Size)

The frequency distribution of the header size of embedded objects is a good approach to

understand the inconsistency observed between the size of embedded objects within a web

page fetched in direct and the size of the same objects retrieved in HTTP and HTTPS

bypassing modes. In fact, if the header size of the packets received in direct mode is

marginally higher or lower than the header size of the packets received in HTTP and HTTPS


Page |

77

bypassing modes then this would explain the discrepancy observed in the frequency

distribution of the size of embedded objects within each web page (see Figure 7.4).

Figure 7.4 outlines the classification of web pages in relation to the percentage of matches

of the header size of embedded objects in direct access compared to HTTP and HTTPS

bypassing accesses. As can be seen from this figure, only a tiny fraction of the header sizes

of embedded objects in HTTP and HTTPS bypassing modes match the header sizes of the

object received in direct mode for the same web page. More specifically, there were no

matches between the header size of embedded objects of 31 and 30 web pages in HTTP

and HTTPS bypassing modes respectively when compared to the same web pages retrieved

in direct mode (see Table 7.2, row in red). Additionally, the percentage was marginally

insignificant in HTTP and HTTPS bypassing modes for the rest of the web pages. As shown in

Table 7.2, 64 web pages accessed in HTTP bypassing mode had a matching percentage

comprised between 0% and 10% (rows in blue). From the remaining 6 web pages only 1

webpage reached nearly 30%. The same trend was also observed in HTTPS bypassing mode.

63 web pages out of 70 have a matching rate comprised between 0% and 10% (row in red)

and only 2 web pages reached nearly 30% (row in blue).

The findings made in this section imply that neither the size of embedded objects within a

web page nor the size of the header of the same objects relayed back to a client during an

HTTP or HTTPS bypassing traffic are trustworthy parameters to fingerprint the origin of a

web page. If the alteration of original headers is necessary to hide circumventing traffic, the

actual data retrieved from a web server is unmodified by most CGI proxies. For that reason,

investigating the frequency distribution of the size of the payload may be an alternative

way for identifying a reliable parameter which would be almost constant in direct mode,

HTTP and HTTPS bypassing modes.


Page |

78

Figure 7.4: Comparison of the frequency distribution of the header size of embedded

objects within a web page in direct access, HTTP bypassing access and HTTPS bypassing

access.

Table 7.2: Repartition of web pages in relation to the percentage of matches of the header

size of embedded objects in direct access compared to HTTP and HTTPS bypassing accesses.

Range (%) HTTP Bypassing HTTPS Bypassing 0% 31 30

0% ‐ 1% 1 1 1% ‐ 2% 5 5 2% ‐ 3% 6 5 3% ‐ 4% 5 6 4% ‐ 5% 5 3 5% ‐ 6% 5 4 6% ‐ 7% 1 3 7% ‐ 8% 2 2 8% ‐ 9% 3 0 9% ‐ 10% 1 4 10% ‐ 20% 5 5 20% ‐ 100% 1 2

TOTAL 70 70

0

20

40

60

80

100

120

1 4 7 10 1316 1922 25 2831 34 3740 43 4649 5255 58 6164 67 70

Percentage

of o

bjects m

atches (%)

Webpage

FREQUENCY DISTRIBUTION OF HEADER SIZE

HTTP bypassing

HTTPSbypassing

Direct Access


Page |

79

7.3.4 Frequency distribution of the payload size of embedded objects

The payload is the second part of an IP packet. It contains the data which is being

exchanged between a client and a server. The total payload size of an embedded object is

obtained by summing up the payload size of the IP packets transmitted to a web browser

during the retrieval of the object.

Header Size of embedded object = ∑ (Payload Size)

It was expected that the size of the payload of embedded objects fetched in direct access

would be similar to the size of the same objects fetched in HTTP and HTTPS bypassing

modes. It can be seen from the trends shown in Figure 7.5 that the size of the payload of

embedded objects within a web page in HTTP and HTTPS bypassing modes match most of

the objects in direct access mode.

As shown in Table 7.3, more than 50% of the objects collected from the direct access of a

web page are identical to those collected in HTTP bypassing mode for 64 web pages (rows

in blue). Only few web pages fetched in HTTP bypassing mode (rows in red) recorded a

matching percentage below 50%. The same observations were made with the HTTPS

bypassing mode. Even though the traffic was encrypted, a total of 65 web pages accessed in

direct mode had more than 50% of their embedded object payloads matching the payloads

of the same web pages accessed in HTTPS bypassing mode. All together, more than 90% of

the web pages composing the dataset of this research recorded a high percentage of

matches related to the payload of the objects embedded within each web page in direct

access, HTTP and HTTPS bypassing accesses. Consequently, the findings made in this section

indicate that the size of the payload of the objects embedded within a web page is a

reliable parameter for tracing the source of a web page.


Page |

80

Figure 7.5: Comparison of the frequency distribution of the payload size of embedded

objects within a web page in direct access, HTTP bypassing and HTTPS bypassing modes.

Table 7.3: Repartition of web pages in relation to the percentage of matches of the payload

size of embedded objects in direct access compared to HTTP and HTTPS bypassing modes.

Range (%) HTTP Bypassing HTTPS Bypassing 0% ‐ 10% 0 1 10% ‐ 20% 0 0 20% ‐ 30% 1 2 30% ‐ 40% 3 2 40% ‐ 50% 2 0 50% ‐ 60% 10 11 60% ‐ 70% 13 14 70% ‐ 80% 23 21 80% ‐ 90% 16 17 90% ‐ 100% 2 2

Total 70 70

7.3.5 Inter-arrival time

The average inter‐arrival time of a TCP flow is obtained by dividing the sum of the inter‐

arrival of the packets of the TCP flow by the number of packets.

0

20

40

60

80

100

120

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69

Percentage

of o

bjects m

atches (%)

Webpage

FREQUENCY DISTRIBUTION OF PAYLOAD

HTTP Bypassing

HTTPS Bypassing

Direct Access


Page |

81

The fetching of a web page through a CGI proxy adds one or more hops to the path of the

packets exchanged between a client and the source server. For this reason, it was expected

that the interval time of the packets in bypassing mode would be higher than the inter‐

arrival time of the packets in direct access mode. However, the results obtained from the

experiments did not confirm this expectation. It is clear from Table 7.4 that 55 web pages,

which is equal to nearly 77% of the dataset, recorded a higher inter‐arrival time during the

HTTP bypassing mode compared to direct access mode. In the same way, the average inter‐

arrival time of 58 web pages representing 80% of the web pages accessed in HTTPS

bypassing mode was higher compared to the inter‐arrival of the packets in direct access

mode. Thus, it is evident from the observations made from the comparison between the

inter‐arrival time of packets in direct access and bypassing access that the inter‐arrival time

is a potential parameter to identify bypassing traffic. However, the evaluation of the

accuracy of the detection system will establish the degree of reliability of this parameter in

detecting bypassing traffic.

Table 7.4: Repartition of web pages in relation to the inter‐arrival time of the packets in

direct access compared to HTTP and HTTPS bypassing accesses.

Inter‐Arrival time HTTP bypassing HTTPS Bypassing Number of web pages

Percentage Number of web pages

Percentage

Direct access <

bypassing access 54 77.14% 56 80%

Direct access >

bypassing access 16 22.86% 14 20%

Number of packets Average inter‐arrival time =

∑ (inter‐arrival of packets)

7.3.6 Nu

The obje

web bro

connecti

This num

expectat

generate

addition

by a clie

informat

objects.

homepag

(www.gl

homepag

Figure

umber of T

ects embed

owser throu

ions establi

mber can va

tion by inv

e more TCP

al objects, s

ent’s comp

tion implies

As an exam

ge of the

ypeproxy.co

ge but also

e 7.6: Retrie

TCP flows

ded within

gh a TCP co

shed betwe

ry dependin

vestigating t

flows than

such as the

uter in ord

s that more

mple, it can

e universit

om), the we

a toolbar re

eval of www

a webpage

onnection.

een a client

ng on the si

the numbe

n direct acce

CGI bypass

der to allow

e TCP flows

n be seen

ty website

eb browser

epresenting

w.uws.edu.a

Cha

e are down

The numbe

and a serv

ize of the w

er of TCP f

ess traffic. T

sing scripts,

w the user

s are neede

from Figure

(www.uw

Firefox rec

the CGI scr

u through t

apter 6 – Fin

loaded from

er of TCP flo

ver during th

web page or

flows was

This can be

are added

to make fu

ed in bypas

e 7.6 that

ws.edu.au)

ceived not o

ript (red squ

the CGI prox

ndings: Resu

m the sourc

ows is the

he retrieval

the load of

that bypas

e explained

to the web

urther requ

sing traffic

during the

through t

only the dat

uare on Figu

xy www.glyp

ults and Ana

Page

ce server by

total numb

l of a web p

f the server

ssing traffic

by the fact

page reque

uests. Addit

to fetch al

fetching of

the CGI p

ta related to

ure 7.6).

peproxy.com

alyses

| 82

y the

ber of

page.

r. The

c will

t that

ested

tional

ll the

f the

proxy

o the

m.


Page |

83

The results from the experiments, related to the number of TCP flows, are highlighted in

Table 7.4. It is evident from this table that around 21% and 17% of web pages of the dataset

did not match the expectation respectively in HTTP and HTTPS accesses. In other words, the

number of TCP flows of 15 web pages was lower in HTTP bypassing compared to the same

number of TCP flows for identical web pages. In HTTPS bypassing access, the same trend

was observed only for 12 web pages. Even though these percentages are marginal

compared to the percentages of web pages matching the expectation (nearly 78% for HTTP

bypassing access and 83 for HTTPS), these findings can have a significant impact on the

accuracy of the detection system. As for the inter‐arrival time, this parameter would need

to be validated as a reliable indicator of bypassing traffic during the evaluation of the

accuracy of the detection approach.

Table 7.5: Repartition of web pages in relation to the number of TCP flows in direct access

compared to HTTP and HTTPS bypassing accesses.

77..44 DDEETTEECCTTIIOONN RRUULLEESS

During the experiments, it was observed that the size of the payload of embedded objects

remained nearly constant in direct access, HTTP and HTTPS bypassing accesses. This

observation is crucial in predicting the source of the web page. Additionally, for the majority

of web pages, more TCP flows are occurring while using the CGI proxy to access a web page

Number of TCP flows

HTTP bypassing HTTPS Bypassing Number of web pages

Percentage Number of web pages

Percentage

Direct access >

bypassing access

55 78.57142857% 58 82.85714286%

Direct access <

bypassing access

15 21.42857143% 12 17.14285714%


Page |

84

either with the HTTP protocol or the HTTPS protocol. Also, the inter‐arrival time of the

packets remained higher for the bypassing traffic throughout the experiments.

The results of the experiments, carried out during the profile building phase, outlined the

necessity to implement two sub‐mechanisms to detect bypassing traffic. The fingerprinting

of a blocked web page is performed by the first sub‐mechanism by matching the payload

size of embedded objects within a web page while the second sub‐mechanism inspects the

traffic for anomalies related to the number of TCP flows and the inter‐arrival time of the

packets. According to the results obtained, the bypassing traffic generated by a CGI proxy is

generally characterised by a high inter‐arrival of the packets and an abnormal number of

TCP flows initiated to fetch a web page.

To sum up, a web page circumvented by a HTTP or HTTPS bypassing traffic, can be detected

by comparing pre‐built profiles with live traffic profiles according to the rules outlined in

Table 7.5. If the profile of live network traffic matches one of the pre‐built profiles after

applying the detection rules, a positive alarm is then raised.

Table 7.6: Detection rules of bypassing traffic

Live traffic profile Rules Pre‐built profile Frequency distribution of the size of payload of embedded objects

Match at least 50%

Frequency distribution of the size of payload of embedded objects

Inter‐arrival time > Inter‐arrival time Number of TCP flows > Number of TCP flows

77..55 RREESSUULLTTSS OOFF TTHHEE AACCCCUURRAACCYY OOFF TTHHEE DDEETTEECCTTIIOONN AAPPPPRROOAACCHH

The significance of this study lies in the evaluation of the accuracy of the proposed

approach. Therefore the accuracy of the detection approach will be analysed based on the

combination of the parameters, as follows:


Page |

85

Frequency distribution of payload

Frequency distribution of payload and average inter‐arrival time

Frequency distribution of payload, inter‐arrival time and number of TCP flows

7.5.1 Results of HTTP bypassing mode

7.5.1.1 Frequency distribution of the size of payload

The first evaluation of the accuracy of the detection model was carried out with the

frequency distribution of the size of the payload of embedded objects as the only detection

parameter. It is clear from Figure 7.7 that the accuracy of the detection mechanism was

above 50% for a matching percentage range comprised between 50% and 70%. More

specifically, 233 web pages accessed randomly were successfully fingerprinted when the

percentage of payload size matches between pre‐built profiles and live HTTP bypassing

traffic was set between 50% and 55%. At the same time, 31 web pages were classified as

unknown as no match was found for these web pages when compared to the pre‐built

profiles. That is to say that the frequency distribution of the payload of these web pages did

not match at least 50% of objects from the collection of embedded objects of any of the

pre‐built profiles. Finally, the detection recorded more than 1 match for 7 web pages.

It can also be seen from Figure 7.7 that, as the percentage of payload size matches

increases, a sharp drop of the number of positive alarms is observed while, simultaneously,

many web pages are unclassified. However, by increasing the percentage of matches, the

number of false alarms dropped from 7 to 1 web page after 70% of matches.


Page |

86

Figure 7.7: Evaluation of the accuracy of the detection approach according to the frequency

distribution in HTTP bypassing mode.

7.5.1.2 Frequency distribution of payload combined with inter-arrival time

The second evaluation of the detection model accuracy was carried out with the frequency

distribution and the inter‐arrival time as the parameters for detection. In this evaluation a

web page is fingerprinted if the following rules are met:

Live traffic profile Rules Pre‐built profile

Frequency distribution of the size of

payload of embedded objects

Match at

least 50%

Frequency distribution of the size

of payload of embedded objects

Inter‐arrival time > Inter‐arrival time

233 218

206

163 135

103

68

28 12

0 7 4 4 4 1 1 1 1 1 1 31

49 61

104 135

167

202

242 258 270

0

50

100

150

200

250

300

50 ‐55 55 ‐ 60 60 ‐ 65 65 ‐ 70 70 ‐ 75 75 ‐ 80 80 ‐ 85 85 ‐ 90 90 ‐ 95 95 ‐ 100

Number o

f webpage(s)

Percentage of payload size matches between direct Accees and Bypassing Access (%)

DETECTION ACCURACY: PAYLOAD SIZE

POSITIVE FALSE UNKNOWN

50%‐55%

55%‐60%

60%‐65%

65%‐70%

70%‐75%

75%‐80%

80%‐85%

85%‐90%

90%‐95%

95% ‐100%

POSITIVE 85.98 80.44 76.01 60.15 49.82 38.01 25.09 10.33 4.43 0.00 FALSE 2.58 1.48 1.48 1.48 0.37 0.37 0.37 0.37 0.37 0.37 UNKNOWN 11.44 18.08 22.51 38.38 49.82 61.62 74.54 89.30 95.20 99.63 TOTAL 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00


Page |

87

The results of the second evaluation are shown in Figure 7.8. From this figure, it is evident

that adding the inter‐arrival time as a detection parameter did not increase the accuracy of

the detection approach. Contrary to the first evaluation where 85.98% of web pages were

successfully fingerprinted, it can be seen from Figure 6.6 that the accuracy of the detection

system dropped to 75.66% for a matching percentage set between 50% and 55%. For

instance, the detection system recorded 205 positive alarms, 7 false alarms and 59

unclassified web page in this evaluation compared to 233 positive alarms, 7 false alarms

and 31 unclassified web pages for a matching percentage set between 50% and 55%. The

same trends were observed throughout this evaluation when the matching percentage

increased. However, the number of false alarms remained steady in the first and second

evaluation. It is clear from this observation that the accuracy of the detection system

dropped due to an increase of unclassified web pages.

In total, the findings from the second evaluation proved that the inter‐arrival time is not a

reliable parameter to detect bypassing traffic. This parameter decreased the accuracy of the

detection approach by increasing the rate of unclassified web pages. At the same time, the

inter‐arrival time had no effect on reducing the number of false alarms.


Page |

88


distribution and the inter‐arrival time in HTTP bypassing mode.

7.5.1.3 Frequency distribution of payload combined with inter-arrival time and the

number of TCP flows

The third evaluation of the detection approach accuracy was carried out combining the

frequency distribution, the inter‐arrival time and the number of TCP flows as parameters

for detection. In this evaluation a web page is fingerprinted if in addition to the rules of the

second evaluation, the number of TCP flows of a pre‐built traffic is greater than those

recorded for a live traffic profile. The results obtained from the third evaluation are

presented in Figure 7.9. It can clearly be seen from this figure that the accuracy of the

detection system dropped by nearly half throughout this evaluation in relation to the

205 192

181

146 120

91 59

27 12

0 7 4 4 4 1 1 1 1 1 1

59 75 86

121 150

179 211

243 258 270

0

50

100

150

200

250

300

50 ‐55 55 ‐ 60 60 ‐ 65 65 ‐ 70 70 ‐ 75 75 ‐ 80 80 ‐ 85 85 ‐ 90 90 ‐ 95 95 ‐ 100

Number o

f webpage(s)

Percentage of payload size matches between direct Accees and Bypassing Access

DETECTION ACCURACY: PAYLOAD SIZE + INTER‐ARRIVAL TIME


50%‐55%

55%‐60%

60%‐65%

65%‐70%

70%‐75%

75%‐80%

80%‐85%

85%‐90%

90%‐95%

95% ‐100%



Page |

89

accuracy recorded during the first evaluation. The accuracy decreased from about 75% in

the second evaluation to just 54% during the third evaluation. As for the second evaluation,

adding the number of TCP flows to the detection parameters created a sharp rise of

unclassified web pages and a dramatic fall of positive alarms. However the rate of false

alarms decreased steadily to reach 0 when the matching percentage was set between 70%

and 75%. Overall, the number of TCP flows observed during an HTTP session is not a reliable

indicator to expose bypassing activities on a private network. However, this parameter is a

good metric to minimise the false alarms raised by a detection mechanism of bypassing

traffic in HTTP mode.


distribution, the inter‐arrival time and the number of TCP flows in HTTP bypassing mode.

147 140 133 103

87 61

36 12 4 0 7 3 3 3 0 0 0 0 0 0

117 128 135 165

184 210

235 259 267 271

0

50

100

150

200

250

300

50 ‐55 55 ‐ 60 60 ‐ 65 65 ‐ 70 70 ‐ 75 75 ‐ 80 80 ‐ 85 85 ‐ 90 90 ‐ 95 95 ‐ 100

Number o

f webpage(s)


DETECTION ACCURACY: PAYLOAD SIZE + INTER‐ARRIVAL TIME + NUMBER OF TCP FLOWS


50%‐55%

55%‐60%

60%‐65%

65%‐70%

70%‐75%

75%‐80%

80%‐85%

85%‐90%

90%‐95%

95% ‐100%



Page |

90

7.5.2 Results of HTTPS bypassing mode

7.5.2.1 Frequency distribution of the size of payload

The rules applied during the first evaluation of the accuracy in HTTP bypassing mode were

identical to those used in this evaluation. The trends observed in HTTP bypassing mode

were very similar to those observed in HTTPS. As shown in Figure 7.10, an accuracy of

84.13% was recorded when the matching percentage was set between 50% and 55%. As for

HTTP, the accuracy dropped as the matching percentage increased. However, it is

important to notice the number of false alarm stayed low compared to the results obtained

in HTTP bypassing mode.

Figure 7.10: Evaluation of the accuracy of the detection approach according to the frequency distribution in HTTPS bypassing mode

228 217

200

165

129 111

61 35

14 0 9 2 0 0 0 0 0 0 0 0

34 52

71

106

142 160

210 236

257 271

0

50

100

150

200

250

300

50 ‐55 55 ‐ 60 60 ‐ 65 65 ‐ 70 70 ‐ 75 75 ‐ 80 80 ‐ 85 85 ‐ 90 90 ‐ 95 95 ‐ 100

Number o

f webpage(s)


DETECTION ACCURACY: PAYLOAD SIZE


50%‐55%

55%‐60%

60%‐65%

65%‐70%

70%‐75%

75%‐80%

80%‐85%

85%‐90%

90%‐95%

95% ‐100%



Page |

91

7.5.2.2 Frequency distribution of payload combined with inter-arrival time

The results of the second investigation are shown in Figure 7.11. The same detection rules,

as those applied in the second investigation of HTTP bypassing mode, were used to this

investigation. From Figure 7.11, it can be seen that the trends of the second evaluation in

HTTP and HTTPS bypassing modes are almost identical. However, the impact of the inter‐

arrival on the accuracy of the detection is marginal in the HTTPS bypassing mode. In other

words, the majority of the web pages randomly accessed in HTTPS bypassing mode had a

higher inter‐arrival time compared to the direct access mode.

As seen in Figure 7.11, the accuracy dropped from 84.13% in the first investigation to 80%

in this investigation for a matching percentage set between 50% and 55%. For the same

investigation in HTTP bypassing mode, the accuracy dropped almost by 10% which is double

the figures recorded for HTTPS mode. In fact, during the HTTPS bypassing scenario,

encryption is applied to the packets exchanged between a client and a server. The process

of encrypting and decrypting IP packets delays the arrival of the packets to the client. This

could explain why the arrival‐time of the packets is affected in this mode. In addition, it is

clear from Figure 6.10 that the inter‐arrival time had no impact in decreasing the number of

false alarms.

Overall, the inter‐arrival time is a reliable parameter to find out the origin of a web page

when HTTPS is used to bypass censorship. High accuracy was scored by the detection model

when the matching range was set between 50 % and 65%. This optimal range was larger in

HTTP bypassing mode where high accuracy was still recorded up to 75% of matching

percentage. As a result, it is crucial to fine‐tune the detection system depending on the

protocol used during a bypassing mode to fingerprint the majority of the web pages

blacklisted on the proxy firewall.


Page |

92

Figure 7.11: Evaluation of the accuracy of the detection approach according to the

frequency distribution and the inter‐arrival time in HTTPS bypassing mode.

7.5.2.3 Frequency distribution of payload combined with inter-arrival time and the

number of TCP flows

The results obtained from the use of the three parameters are summarized in Figure 7.12. It

can be seen from this figure that the accuracy of the detection approach declined to 51.29

for the first matching percentage range (50% ‐ 60%). The same observations are noticed

across the 271 web pages accessed randomly in HTTPS bypassing mode. As for the HTTP

bypassing mode, the number of TCP flows is not a good indicator to identify the source of a

web page with a high accuracy.

217 208 191

157 124

106

61 35

14 0 9 2 0 0 0 0 0 0 0 0

45 61

80 114

147 165

210 236

257 271

0

50

100

150

200

250

300

50 ‐55 55 ‐ 60 60 ‐ 65 65 ‐ 70 70 ‐ 75 75 ‐ 80 80 ‐ 85 85 ‐ 90 90 ‐ 95 95 ‐ 100

Number o

f webpage(s)


DETECTION ACCURACY: PAYLOAD SIZE + INTER‐ARRIVAL TIME


50%‐55%

55%‐60%

60%‐65%

65%‐70%

70%‐75%

75%‐80%

80%‐85%

85%‐90%

90%‐95%

95% ‐100%



Page |

93

Figure 7.12: Evaluation of the accuracy of the detection approach according to the

frequency distribution, the inter‐arrival time and the number of TCP flows in HTTPS

bypassing mode.

139 133 125 96

81 70 34

15 6 0 8 1 0 0 0 0 0 0 0 0

124 138 146 175 190 201

237 256 265 271

0

50

100

150

200

250

300

50 ‐55 55 ‐ 60 60 ‐ 65 65 ‐ 70 70 ‐ 75 75 ‐ 80 80 ‐ 85 85 ‐ 90 90 ‐ 95 95 ‐ 100

Number o

f webpage(s)


DETECTION ACCURACY: PAYLOAD SIZE + INTER‐ARRIVAL TIME + NUMBER OF TCP FLOWS


50%‐55%

55%‐60%

60%‐65%

65%‐70%

70%‐75%

75%‐80%

80%‐85%

85%‐90%

90%‐95%

95% ‐100%


CHAPT

CONCL

8.1 Co

The desig

The bypa

firewall

detection

evaluate

first pha

then mat

size of th

during th

of the TC

the inter

predict if

8.2 Fu

Integrati

of sensit

The prop

our inves

real wor

scripts b

impleme

in this in

testing n

TER 8

LUSION

ontribution

gn and testi

assing of pr

products la

n approach

d by testing

se network

tched to pre

he different

he first phas

CP flows use

r‐arrival time

f the traffic i

uture work

ng CGI prox

ive data. Th

posed detect

stigation wil

ld scenario.

because on

ntation of d

vestigation.

eeds to be c

ng of a mec

roxy firewa

ck an effici

, to detect

g it in a virtu

profiles of

e‐built profi

objects em

se. Once a b

ed to fetch t

e of the pac

s originating

y detection

is research r

tion model w

l be to carry

Furthermor

ly the byp

ifferent byp

For the det

carried out t

chanism to d

lls was inve

ent mechan

t CGI bypas

ual network

blacklisted w

les to finger

bedded in t

blacklisted w

the web pag

ckets is then

g from a CGI

in proxy fire

raised a lot

was only tes

y out the sa

re, the inves

passing scri

passing script

tection of CG

o overcome

detect CGI tr

estigated in

nism to det

ssing traffic

k. The detect

web pages w

rprint traffic

he web pag

web page is s

ge, the aver

n compared

proxy or a n

ewalls will g

of issues tha

sted in a virt

me experim

stigation wil

pt glype [4

ts can produ

GI proxies, o

the problem

raffic was th

this thesis

tect and blo

c, was prop

tion is perfo

were create

c emanating

ge is utilized

successfully

age size of t

with the sta

normal web

greatly incre

at need to b

ual network

ments on a la

l be expand

46] was co

uce different

our propose

m presented

Chapte

he main aim

by using s

ock circumv

posed in th

ormed in tw

ed. The inco

g from a blo

as the dete

fingerprinte

the packets

atistics of in

server.

ease the priv

be investigat

k. Therefore,

arge dataset

ded to cover

overed in t

t results tha

ed model is

d in the thes

er 7 ‐‐ Concl

Page

of this rese

imulation. M

venting traff

he research

wo phases. I

oming traffic

ocked server

ection param

ed, an inspe

transmitted

coming traf

vacy and sec

ted in the fu

, the next st

t generated

r more bypa

this thesis.

n those obta

a start but

is.

usion

e | 94

earch.

Many

fic. A

and

n the

c was

r. The

meter

ection

d and

ffic to

curity

uture.

tep of

from

assing

The

ained

more

References

Page | 95

RREEFFEERREENNCCEESS

[1] Myth: A connected PC will be infected in less than 5 minutes

Available at: http://en.kioskea.net/faq/455‐myth‐a‐connected‐pc‐will‐be‐infected‐in‐

less‐than‐5‐minutes

Accessed in 2009.

[2] Kenneth Ingham and Stephanie Forrest. Rep. A History and Survey of Network

Firewalls, University of New Mexico Computer Science Department, 2002.

[3] Vacca, John R. Jumpstart for network and systems administrators, Elsevier Digital

Press, 2005.

[4] Brian Baskin, Tony Bradley, Jeremy Faircloth, Craig A. Schiller, Ken Caruso, Paul

Piccard, et al. Combating spyware in the enterprise, Syngress Publishing, 2006.

[5] Eliezer Idjalahoue, Approaches for limiting the bypassing of proxy firewalls, University

of Western Sydney, 2008.

[6] Gordana DODIG‐CRNKOVIC, Scientific Methods in Computer Science

Department of Computer Science Mälardalen University Västerås, Sweden

Available at: http://www.mrtc.mdh.se/~gdc/work/cs_method.pdf

Accessed in 2009.

[7] Thomas W Shinder. The Best Damn Firewall Book Period, Second Edition, Syngress

publishing, Elsevier: December 2007.

[8] William R. Cheswick, Steven M. Bellovin, Aviel D. Rubin. Firewalls and Internet

Security, Second Edition: Repelling the Wily Hacker. Addison‐Wesley Longman

Publishing, 2003.

[9] John Chirillo. Hack attacks revealed: A complete reference with custom security

hacking toolkit, John Wiley & Sons, New York, 2001.

[10] Stephen Hochstetler, Harry Tanner, Ramachandra Kulkarni, Sebastian Mika. Extending

Network Management Through Firewalls, IBM, June 2001.

References

Page | 96

[11] Elizabeth D. Zwicky, Simon Cooper, D. Brent Chapman

Building Internet Firewalls, Second Edition

O'Reilly Media, June 2000

[12] Behrouz A. Forouzan, Sophia Chung Fegan. TCP/IP Protocol Suite Third edition,

McGraw Hill, 2006.

[13] Karen Scarfone, Peter Mell

Guide to Intrusion Detection and Prevention Systems (IDPS)

Recommendations of the National Institute of Standards and Technology

February 2007

[14] Brian Komar, Ronald Beekelaar, and Joern Wettern. Firewalls for dummies, second

edition, Wiley Publishing, 2003.

[15] Eric Cole, Ronald Krutz, James W. Conley. Network security bible, second edition, John

Wiley & Sons 2005.

[16] Jan L. Harrington.

Network Security: A Practical Approach, Elsevier 2005.

[17] Joachim von zur Gathen. How to bypass a firewall, Bonn‐Aachen International Centre

for Information Technology, 2006.

[18] William Stallings. Network security essentials, Applications and standards, third

edition”, Prentice Hall, 2007.

[19] Ari Luotonen.

Web proxy servers, Prentice Hall, 1997.

[20] John W. Rittinghouse, William M. Hancock. Cybersecurity operations handbook,

Elsevier Digital Press, 2003.

[21] Michael E. Whiteman, Herbert J. Mattord, Richard D. Austin, Greg Holden.

Guide to firewalls and network security: with intrusion detection and VPNs, Second

edition, Course Technology Press Boston, MA, United States, 2003.

[22] Daniel J. Barrett, Richard E. Silverman. SSH, the Secure Shell: The Definitive Guide,

O’Reilly Media, Inc., 2005.

References

Page | 97

[23] Srinivas Sampalli. Security in Virtual Private Networks, in Network Security: Current

Status and Future Directions, C. Douligeris and D. Serpanos (editors), Wiley‐IEEE Press,

March 2007

[24] Floss manuals, sesame. Bypassing Internet Censorship

Available at: http://www.scribd.com/doc/12714224/how‐to‐bypass‐internet‐

censorship, Accessed in 2009.

[25] The living Internet

Available at: http://www.livinginternet.com/i/is_anon_work.htm

Accessed in 2009.

[26] Lozdodge: proxy avoidance application

Available at: http://www.lozware.com/

Accessed in 2009.

[27] Jeffry Dwight, Michael Erwin and al. Using CGI, Special edition, Que Corp.

Indianapolis, IN, USA, 1997.

[28] Markus Jakobsson, Zulfikar Ramzan. Crimeware, understanding new attacks and

defences, Addison‐Wesley Professional, 2008.

[29] SOPHOS, Security threat report: 2009

Available at : http://www.sophos.com/sophos/docs/eng/marketing_material/sophos‐

security‐threat‐report‐jan‐2009‐na.pdf

Accessed in 2009.

[30] Computer Economics. 2007 Malware Report: The Economic Impact of Viruses,

Spyware, Adware, Botnets and other Malicious Code, Tech. rep., June 2007.

[31] Cybercrime: Public and Private Entities Face Challenges in Addressing Cyber Threats

Available at: http://www.gao.gov/new.items/d07705.pdf, June 2007

Accessed in 2009.

[32] ScanSafe: Annual global threat report 2009

Available at: http://www.scansafe.com/downloads/gtr/2009_AGTR.pdf

Accessed in 2009.

References

Page | 98

[33] Michael Erbschloe. Trojans, Worms, and Spyware: A computer security professional’s

guide to malicious code, MA: Elsevier Butterworth‐Heinemann, 2005.

[34] Compete Inc: web traffic analysis

Available at: http://www.compete.com/

Accessed in 2009.

[35] IEEE Computer Society, Guy‐Vincent (University of Ottawa)

Centralized Web Proxy Services: security and privacy considerations, pp. 46‐52.

December 2007.

[36] Manuel Crotti, Maurizio Dusi, Francesco Gringoli, Luca Salgarelli. Detecting HTTP

Tunnels with Statistical Mechanisms, in Proceedings of the 42th IEEE International

Conference on Communications (ICC 2007), (Glasgow, Scotland), pp. 6162–6168, June

2007.

[37] Manuel Crotti, Maurizio Dusi, Francesco Gringoli, Luca Salgarelli. Traffic Classification

through Simple Statistical Fingerprinting, Computer Communications Review, 37(1):7–

16, 2007.

[38] Jeffrey Horton and Rei Safavi‐Naini. Detecting policy violations through traffic analysis,

22nd Annual Computer Security Applications Conference (ACSAC '06), Miami Beach,

Florida, USA, December 2006, 109‐120.

[39] Riyad Alshammari, Nur Zincir‐Heywood. A Flow Based Approach For SSH Traffic

Detection, In Systems, Man and Cybernetics, IEEE International Conference on, pages

296–301, Oct. 2007.

[40] Kevin Borders, Atul Prakash. Web Tap: Detecting Covert Web Traffic, In Proceedings of

ACM CCS, October 2004.

[41] Liang Lu, Jeffrey Horton, Reihaneh Safavi‐Naini, and Willy Susilo. Transport Layer

Identification of Skype Traffic, International Conference ICOIN 2007, Estoril, Portugal,

January 2007.

[42] Sen, S., Spatscheck, O., Wang, D.: Accurate, Scalable In‐Network Identification of P2P

Traffic Using Application Signatures. In: Proceedings International WWW

References

Page | 99

Conference, New York, USA (2004).

[43] Stephen Thomas. HTTP essentials: Protocols for secure, scalable web sites, John Wiley

& Sons Inc, 2001.

[44] Bud Ratliff and Jason Ballard with the Microsoft ISA server team.

Microsoft® Internet Security and Acceleration (ISA) Server 2004, Administrator’s

Pocket consultant, Sams Indianapolis, IN, USA, 2005.

[45] Nils‐Erik Frantzell, IBM

Install XAMPP for easy, integrated development

Available at: http://www.ibm.com/developerworks/linux/library/l‐xampp/

Accessed in 2009.

[66] Apache friends ‐ XAMMP

Available at: http://www.apachefriends.org/en/xampp.html

Accessed in 2009.

[47] Angela Orebaugh, Gilbert Ramirez, Josh Burke, Larry Pesce, Joshua Wright, Greg

Morris

Wireshark & Ethereal: Network Protocol Analyzer Toolkit, Syngress Publishing, Inc.

2007.

[48] Fiddler2

Available at: http://www.fiddler2.com/fiddler2/

Accessed in 2009.

[49] Glype proxy: free browsing

Available at: http://www.glype.com/

Accessed in 2009.

[50] Eric Hammersley,

Professional VMwareServer, Wrox Press Ltd., Birmingham, UK, 2006.

[51] G. Ziemba, D. Reed and P. Traina. RFC 1858, Security Considerations for IP Fragment

Filtering

Available at: http://www.ietf.org/rfc/rfc1858.txt

References

Page | 100

Accessed in 2009.

[52] I. Miller. RFC 3128, Protection against a variant of the tiny fragment Attack, June 2001

Available at: http://tools.ietf.org/html/rfc3128

Accessed in 2009.

[53] J. Anderson. An Analysis of Fragmentation Attacks

Available at: http://www.ouah.org/fragma.html

Accessed in 2009.

[54] Heyning Cheng and Ron Avnur,

Traffic Analysis of SSL Encrypted Web Browsing

[55] Andrew Hintz, Workshop on Privacy Enhancing Technologies PET2002

Fingerprinting websites using traffic analysis. The university of Texas at Austin

[56] Qixiang Sun; Simon, D.R.; Yi‐Min Wang; Russell, W.; Padmanabhan, V.N.; Lili Qiu,

2002 IEEE Symposium on Security and Privacy

Statistical Identification of Encrypted Web Browsing Traffic

APPE

JSCRIPT

WEBTRA

// This fu// The st

static fun{ //Selec

Fiddle //Dum Fiddle //Proc var We

if (null { Fidd retu } var Sta var We

try { //Cr // Th // Th Web

ENDIX

T.NET EM

AFFICSTATS

unction is desatistical of ea

nction WebTr

ct all the stre

rObject.UI.ac

mp the raw darObject.UI.ac

cess the strea

ebTraffic = F

l == WebTraff

dlerObject.Sta

urn;

atFilename = ebStatFile:Str

reate a statisthe statistic filhis file is later

bStatFile = Fil

MBEDDED

S

signed to genach session is

rafficStats(Fil

am generate

ctSelectAll();

ata of the webctSaveSession

ms and extra

iddlerApplica

fic || WebTra

atusText= "No

CONFIG.GetP

reamWriter =

tic file for the e is save in a r passed to M

e.CreateText

WITH

erate and dusave in a uni

eIndex)

d by a web se

b session in a nsToZip(CONF

ct statistics o

ation.UI.GetA

affic.Length <

o web traffic a

Path("Capture

= null;

web sessionCSV format.

Microsoft exce

(StatFilename

FIDDLER2

mp the statisque file: Web

ession after th

ZIP file FIG.GetPath("

of interest

llSessions();

< 1)

available for

es") + "WebS

el to generate

e);

2 TO

stics of a web bstatfile + file

he completio

"Captures") +

analysis!";

StatFile" + File

e graphs

COMPUTE

session (HTTe index + .CSV

on of a reques

+ "dump" + Fil

eIndex + ".csv

App

Page

STATIST

TP and HTTPS V

st

leIndex + ".sa

v";

pendix

| 101

TICS:

)

az");

Appendix

Page | 102

//Heading of the statistics file WebStatFile.Write("ExecutionOrder,"); WebStatFile.Write("ProcessID,"); WebStatFile.Write("Protocol,"); WebStatFile.Write("Method,"); WebStatFile.Write("ServerName,"); WebStatFile.Write("ServerIP,"); WebStatFile.Write("ServerPort,"); WebStatFile.Write("ClientIP,"); WebStatFile.Write("ClientPort,"); WebStatFile.Write("BytesReceived,"); WebStatFile.Write("HeaderSize,"); WebStatFile.Write("DataSize,"); WebStatFile.WriteLine("InterArrivalTime" + "\t") // This code goes through the traffic streams generated during a web session // and dump all their statistics in the log file for (var webstream = 0; webstream < WebTraffic.Length; webstream++) { var BytesRcv = 0; var BytesHeader = 0; var BytesData = 0; var InterArrivalTime = 0; // The following lines of code computes: // 1‐ Total size of data received by a stream // 2‐ Total size of the headers // 3‐ Total size of the payload InterArrivalTime = WebTraffic[webstream].Timers.ServerDoneResponse ‐ WebTraffic[webstream].Timers.ServerBeginResponse; if (null != WebTraffic[webstream].responseBodyBytes) { BytesData= WebTraffic[webstream].responseBodyBytes.LongLength; } if ((null != WebTraffic[webstream].oResponse) && (null != WebTraffic[webstream].oResponse.headers)) { BytesHeader = WebTraffic[webstream].oResponse.headers.ByteCount();

} // W W W W W W W W W W W W W

} } catch ( { Mes

} finally { if (W { W W Fi } } }

TRAFFIC

#! /usr/bi#=======

# IMPORT

# cPAMIE

BytesRcv = B

/Write the sta

WebStatFile.W

WebStatFile.W

WebStatFile.W

WebStatFile.W

WebStatFile.W

WebStatFile.W

WebStatFile.W

WebStatFile.W

WebStatFile.W

WebStatFile.W

WebStatFile.W

WebStatFile.W

WebStatFile.W

(ErrorMsg)

ssageBox.Sho

y

WebStatFile !=

WebStatFile.Cl

WebStatFile.Di

ddlerObject.U

C GENERATO

n/env python============

TATION OF LIE is high level

BytesHeader +

atistics in the

Write(webstre

Write(WebTraf

Write(WebTraf

Write(WebTraf

Write(WebTraf

Write(WebTraf

Write(WebTraf

Write(WebTraf

Write(WebTraf

Write(BytesRcv

Write(BytesHe

Write(BytesDa

WriteLine(Inte

w(ErrorMsg);

= null )

ose(); ispose(); UI.actRemove

OR

n ===========

BRARIES library use to

+ BytesData

log file

am + ","); ffic[webstrea

ffic[webstrea

ffic[webstrea

ffic[webstrea

ffic[webstrea

ffic[webstrea

ffic[webstrea

ffic[webstrea

v + ","); ader + ","); ta + ","); rArrivalTime

;

eAllSessions()

============

o automate th

am].oFlags["x

am].oRequest

am].oRequest

am].hostname

am].oFlags["x

am].port + ","am].oFlags["x

am].oFlags["x

+ "\t");

);

===========

he Microsoft

x‐ProcessInfo"t.headers.Uri

t.headers.HTT

e + ","); x‐hostIP"] + ",); x‐clientIP"] + "x‐clientport"]

===========

Internet Expl

"] + ","); Scheme + ","TPMethod + "

");

","); + ",");

============

lorer client.

App

Page

); ",");

===========

pendix

| 103

==="

Appendix

Page | 104

# cPAMIE is used in this project to simulate user browsing activities by: # 1‐ Creating a Microsoft Internet Explorer object # 2‐ Passing an URL to the object # 3‐ Retrieving automatically the URL # Python scripting is used to interact and send commands to the Microsoft Internet Explorer object #==================================================================================" from cPAMIE import PAMIE import os, sys, time, datetime, subprocess from time import sleep #==================================================================================" # This function displays error messages during execution #==================================================================================" def ExecMessage(ExecOutput, nMode): if ExecOutput == 0: print("%s statistics Generation... DONE" % nMode) os.system('ExecAction.exe "clear"') sleep(2) elif ExecResp == 1: print("Number of arguments to Fiddler incorrect") elif ExecResp == 2: print("Fiddler not working") else: print("%s statistics Generation... FAILED" % nMode) #==================================================================================" # This function simulates the bypassing traffic in HTTP or HTTPS mode # The Microsoft Internet Explorer object first accesses the bypassing server and then passes the URL to bypassing server # BypassServer: Name of the bypassing server # CurrentUrl: Current URL # CurrentRound: Current Round if many rounds are specified, 1 by default # CurrentFile: Index of the current file to store the statistics and raw data # BypassMode: bypassing mode: HTTP or HTTPS #==================================================================================" def ExecBypassing(BypassServer, CurrentUrl, CurrentRound, CurrentFile, BypassMode): global ie bText = 'u' bButton = 'Go' print ("%s Bypassing traffic for: %s" % (BypassMode, CurrentUrl)) print ("%s bypassing traffic simulation... In Progress" % BypassMode) ie.navigate(BypassServer) # Wait until the page is completely loaded while ie.getIE().Busy and ie.getIE().readyState != "complete": sleep(3) sleep(10) # Pass URL to retrive to the bypassing server os.system('ExecAction.exe "clear"')

Appendix

Page | 105

ie.setTextBox(bText, CurrentUrl) ie.clickButton(bButton) # Wait until the page is completely loaded while ie.getIE().Busy and ie.getIE().readyState != "complete": sleep(3) sleep(60) while ie.getIE().Busy and ie.getIE().readyState != "complete": sleep(3) FiddlerExec = 'ExecAction.exe ' + '"stats ' + CurrentRound + '‐' + CurrentFile + BypassMode + '"' print("Generating statistics in file.... %s" % FiddlerExec) ExecResp = os.system(FiddlerExec) sleep(3) ExecMessage(ExecResp, "Bypassing Traffic") #==================================================================================" # This function retrieves a URL from the Internet in HTTP and HTTPS bypassing modes #==================================================================================" def WebBrowsing(nRound, uUrlList, HTTPBypassServer, HTTPSBypassServer): global ie print ("Starting web browser") FileCount = 0 if len(uUrlList) > 0: for line in uUrlList: ie = PAMIE() os.system('ExecAction.exe "clear"') sleep(2) print ("Normal traffic for: %s" % line.strip()) ie.navigate(line.strip()) # Wait until the page is completely loaded while ie.getIE().Busy and ie.getIE().readyState != "complete": sleep(3) sleep(40) while ie.getIE().Busy and ie.getIE().readyState != "complete": sleep(3) FileCount += 1 FiddlerExec = 'ExecAction.exe ' + '"stats ' + str(nRound) + '‐' + str(FileCount) + '"' sleep(2) print("Generating statistics in file.... %s" % FiddlerExec) ExecResp = os.system(FiddlerExec) os.system('ExecAction.exe "clear"') sleep(2) ExecMessage(ExecResp, "Normal Traffic") ExecBypassing(HTTPBypassServer, line.strip(), str(nRound), str(FileCount), \

"HTTP") ExecBypassing(HTTPSBypassServer, line.strip(), str(nRound), str(FileCount), \ "HTTPS") ie.quit() sleep(5) else: print ("No Url to retrieve")

Appendix

Page | 106

#==================================================================================" # This function performs the following tasks: # 1‐ Retrieve a webpage using its URL # 2‐ Collect all the traffic streams generated during the retrieval of the webpage # 3‐ Compute the statistics of interest. # 4‐ Store the statistics in a CVS file and dump the raw data in a ZAR file #==================================================================================" def ProfileGenerator(): HTTPBypassingServer = 'SPECIFY THE HTTP BYPASSING SERVER HERE' HTTPSBypassingServer = ' SPECIFY THE HTTPS BYPASSING SERVER HERE'' os.system('cls') os.system('ExecAction.exe "start"') os.system('ExecAction.exe "clear"') print ("Starting browsing simulator at: %s" % (str(datetime.datetime.now()))) if len(sys.argv) > 1: try: file = open(sys.argv[2], 'r') UrlList = file.readlines() file.close print ("Parameters loaded successfully") Round = 1 while(Round <= int(sys.argv[1])): os.system('ExecAction.exe "nuke"') sleep(5) WebBrowsing(Round, UrlList, HTTPBypassingServer, \ HTTPSBypassingServer) Round += 1 else: print ("Profile Generation ...Done") except: print ("Generator initialization... FAILED") else: print ("Usage: python simulator.py <Number of rounds> <path of the file containing the URLs>") #==================================================================================" # Usage of the Script: python ProfileGenerator.py parameter1 parameter2 # Parameter1: Number of rounds # Parameter2: Path of the file containing the list of URLs to retrieve #==================================================================================" if len(sys.argv) == 3: if (int(sys.argv[1]) >= 1): ProfileGenerator()

el

EFFICIEN

#! /usr/bi#=======

#= IMPOR

#=======

from cPAimport osfrom time

#=======

# This fun#=======

def TimeT

TotalTim hours, m TotalTim return ( #=======

# This fun#=======

def ExecM if el

el

el

#=======

# This fun# profile f#=======

def ExecB gl

Li

bT

b

lse: print (

NCY OF DET

n/env python============

RTATION OF L============

MIE import Ps, sys, time, de import slee

============

nctions conve============

ToMillisecond

me = 0 minutes, secome += 3600 *(TotalTime*1

============

nction display============

Message(Exec

ExecOutput print("

os.syst

sleep(

lif ExecResp =lif ExecResp =lse: print("%s

============

nction retrievefor each URL ============

Bypassing(Byp

lobal ie iveProfile = '"Text = 'u' Button = 'Go'

"The number

TECTION A

n ===========

LIBRARIES ===========

PAMIE atetime, randp

===========

rts the inter a===========

des(TimeElaps

onds = TimeE

* float(hours) 1000)

===========

s error messa

===========

cOutput, nMo

== 0: "%s statistics tem('ExecAct

2) == 1: print("N== 2: print("Fs statistics Ge

===========

es a web page

===========

passServer, Cu

livestats C:\P

'

r of rounds m

PPROACH S

============

============

dom

============

arrival from h============

sed):

lapsed.split("

+ 60 * float(m

============

ages during ex============

ode):

Generation..

tion.exe "clea

Number of argiddler not woeneration... FA

============

e in HTTP or H

============

urrentUrl, By

Python31\Pro

must be greate

SCRIPT

===========

===========

===========

hh:mm:ss into===========

":") minutes) + flo

===========

xecution ===========

. DONE" % nMr"')

guments to Fiorking") AILED" % nMo

===========

HTTPS bypass

===========

passMode):

ojects\Master

er or equal t

===========

===========

===========

o milliseconds

===========

oat(seconds)

===========

===========

Mode)

iddler incorre

ode)

===========

sing modes a

===========

rs\LiveProfile.

to 1")

============

============

============

s ============

============

============

ect")

============

nd compute t

============

.csv"'

App

Page

===========

===========

===========

===========

===========

===========

===========

the live traffic

===========

pendix

| 107

==="

==="

==="

==="

==="

==="

===" c

==="

Appendix

Page | 108

print ("%s bypassing traffic simulation... In Progress" % BypassMode) ie = PAMIE() ie.navigate(BypassServer) # Wait until the page is completely loaded while ie.getIE().Busy and ie.getIE().readyState != "complete": sleep(3) sleep(10) # Pass URL to retrive to the bypassing server os.system('ExecAction.exe "clear"') ie.setTextBox(bText, CurrentUrl) ie.clickButton(bButton) # Wait until the page is completely loaded while ie.getIE().Busy and ie.getIE().readyState != "complete": sleep(3) sleep(40) while ie.getIE().Busy and ie.getIE().readyState != "complete": sleep(3) FiddlerExec = 'ExecAction.exe ' + LiveProfile print("Generating statistics in file.... %s" % FiddlerExec) ExecResp = os.system(FiddlerExec) sleep(3) ExecMessage(ExecResp, "Bypassing Traffic") ie.quit() #==================================================================================" # This functions compares the objects of a pre‐built profile to the objects of a live profile and # output the number of matches #==================================================================================" def MatchObject(cLiveProfileArray, ObjSize, ObjPosition): ObjCounter = 0 TotalObjects = 0 InterArrivaltime = 0 for line in cLiveProfileArray: TrafficArray = line.strip().split(',') if (ObjSize == float(TrafficArray[ObjPosition].strip())): ObjCounter += 1 if float(TrafficArray[9].strip()) > 0: InterArrivaltime += TimeToMillisecondes(TrafficArray[12].strip()) TotalObjects += 1 if (TotalObjects > 0): InterArrivaltime = InterArrivaltime / TotalObjects return ObjCounter, InterArrivaltime #==================================================================================" # This functions saves the statistics of HTTP and HTTPS bypassing traffic in a CSV file #==================================================================================" def UpdateFingerprintFile(FileIndex, InputLine, BypassMode, ParamMode): if BypassMode == "HTTP": cFileName = 'HTTPFingerprint' + str(ParamMode) + '‐' + str(FileIndex)\

+ '.csv' elif BypassMode == "HTTPS": cFileName = 'HTTPSFingerprint' + str(ParamMode) + '‐' + \

Appendix

Page | 109

str(FileIndex) + '.csv' FingerprintFile = open(cFileName, 'a') FingerprintFile.write(InputLine) FingerprintFile.close() #==================================================================================" # This functions search for a pre‐built profile matching the live traffic # The function compares each pre‐built to the live profile and output the number of matches # recorded for the live traffic profile #==================================================================================" def FingerprintURL(RandURL, LowerBoundary, ObjPosition, BypassingMode): print ("Fingerprinting Webpage... STARTING") print ("Reading Live Profile statistics") ProfileFile = 'LiveProfile.csv' sFile = open(ProfileFile, 'r') LiveProfileArray = sFile.readlines() if len(LiveProfileArray) > 1: LiveProfileArray.pop(0) sFile.close IndexBlackList = 1 while(IndexBlackList <= 70): cProfileFilename = 'Profile' + str(IndexBlackList) + '.csv' print ("File: %s" % cProfileFilename) sFile = open(cProfileFilename, 'r') BlackListProfileArray = sFile.readlines() if len(BlackListProfileArray) > 2: InterArrivaltimeArray = BlackListProfileArray[len(BlackListProfileArray)‐\ 1].split(',') InterArrivaltime = float(InterArrivaltimeArray[1]) BlackListProfileArray.pop(0) BlackListProfileArray.pop(len(BlackListProfileArray)‐1) sFile.close NbrElements = 0 MatchCounter = 0 for CurrentObject in BlackListProfileArray: ObjectArray = CurrentObject.strip().split(',') NbrElements += int(ObjectArray[1]) ObjectCounter, LiveInterArrivaltime = MatchObject(LiveProfileArray,\

float(ObjectArray[0]), ObjPosition) if (ObjectCounter > 0) and (ObjectCounter <= int(ObjectArray[1])):

MatchCounter += ObjectCounter elif (ObjectCounter > int(ObjectArray[1])): MatchCounter += int(ObjectArray[1]) BoundaryRange = LowerBoundary while(BoundaryRange <= 100): RulesCounter1 = 0 RulesCounter2 = 0 RulesCounter3 = 0

Appendix

Page | 110

if ((MatchCounter/len(LiveProfileArray))*100 >= BoundaryRange): RulesCounter1 += 1

if ((len(LiveProfileArray)<= NbrElements)): RulesCounter2 = RulesCounter1 + 1 if ((LiveInterArrivaltime > InterArrivaltime)): RulesCounter3 = RulesCounter2 + 1 UpdateIndex = 1 while (UpdateIndex <= 3): if (UpdateIndex == 1): RulesCounter = RulesCounter1 if (UpdateIndex == 2): RulesCounter = RulesCounter2 if (UpdateIndex == 3): RulesCounter = RulesCounter3 if (IndexBlackList == 1): UpdateFingerprintFile(BoundaryRange,\

str(RandURL+1), BypassingMode, UpdateIndex) if (RulesCounter == UpdateIndex) : if (BoundaryRange >= LowerBoundary) and

(BoundaryRange <= 100): UpdateFingerprintFile(BoundaryRange, ',' + \ str(IndexBlackList), BypassingMode, UpdateIndex) if (IndexBlackList == 70): UpdateFingerprintFile(BoundaryRange, '\n', \ BypassingMode, UpdateIndex) UpdateIndex += 1 BoundaryRange += 5 IndexBlackList += 1 #==================================================================================" # This function performs the following tasks: # 1‐ Generate a random URL to retrieve # 2‐ The random URL is then accessed in HTTP bypassing and HTTPS bypassing modes # 3‐ The live traffic profile obtained after each access is fingerprinted (compared to pre‐built profiles) #==================================================================================" def RandomBrowsing(NbrURLs, URLFilename, ObjPosition, Boundary): HTTPBypassingServer = ' SPECIFY THE HTTP BYPASSING SERVER HERE' HTTPSBypassingServer = ' SPECIFY THE HTTP BYPASSING SERVER HERE' ExecFlag = 0 os.system('cls') print ("Starting Efficiency Testing egine at: %s" % (str(datetime.datetime.now()))) try: file = open(URLFilename, 'r') UrlList = file.readlines() file.close print ("Parameters loaded successfully") ExecFlag = 1 except: print ("Loading of URLs file... FAILED") if (ExecFlag == 1): nRound = 1 while (nRound <= 4): IndexRun = 0

Appendix

Page | 111

RandomRun = [] while(IndexRun < NbrURLs): RandomRun.append(IndexRun) IndexRun += 1 random.shuffle(RandomRun) print (RandomRun) IndexURL = 0 for RandomURL in RandomRun: IndexURL += 1 print ("Current Round: %s Run: %s" % (str(nRound), str(IndexURL))) ExecBypassing(HTTPBypassingServer, UrlList[RandomURL], "HTTP") FingerprintURL(int(RandomURL), Boundary, ObjPosition, "HTTP") sleep(3) ExecBypassing(HTTPSBypassingServer, UrlList[RandomURL], "HTTPS") FingerprintURL(int(RandomURL), Boundary, ObjPosition, "HTTPS") nRound += 1 #===================================================================================" # Usage of the Script: python EfficiencyTest.py parameter1 parameter2 parameter3 parameter4 # Parameter1: Total number of URLS # Parameter2: Path of the URLS' file # Parameter 3: Position of the parameter as followed: # 0 ‐ ExecutionOrder 1‐ ProcessID 2‐ Protocol 3‐ Method 4‐ ServerName 5‐ ServerIP # 6‐ ServerPort 7‐ ClientIP 8‐ ClientPort 9‐ BytesReceived 10‐ HeaderSize 11‐ PayloadSize # Parameter 4: lower boundary #===================================================================================" if len(sys.argv) == 5: if (len(sys.argv[1]) < 5): RandomBrowsing(int(sys.argv[1]), sys.argv[2], int(sys.argv[3]), int(sys.argv[4])) else: print ("Usage: Python EfficiencyTest <Maximum URLs> <URL File Path> <Payload Index> <Boundary>")

researchdirect.westernsydney.edu.auresearchdirect.westernsydney.edu.au/islandora/object/uws:8960... ·...

Documents