going a step beyond the black and white lists for url accesses in the enterprise by means of...

35
Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers Authors: Antonio Miguel Mora García Paloma de las Cuevas Delgado Juan Julián Merelo Guervós ECTA 2014, Rome, Italy

Upload: paloma-de-las-cuevas

Post on 28-Nov-2014

567 views

Category:

Science


2 download

DESCRIPTION

Corporate systems can be secured using an enormous quantity of methods, and the implementation of Black or White lists is among them. With these lists it is possible to restrict (or to allow) the users the execution of applications or the access to certain URLs, among others. This paper is focused on the latter option. It describes the whole processing of a set of data composed by URL sessions performed by the employees of a company; from the preprocessing stage, including labelling and data balancing processes, to the application of several classification algorithms. The aim is to define a method for automatically make a decision of allowing or denying future URL requests, considering a set of corporate security policies. Thus, this work goes a step beyond the usual black and white lists, since they can only control those URLs that are specifically included in them, but not by making decisions based in similarity (through classification techniques), or even in other variables of the session, as it is proposed here. The results show a set of classification methods which get very good classification percentages (95-97%), and which infer some useful rules based in additional features (rather that just the URL string) related to the user's access. This led us to consider that this kind of tool would be very useful tool for an enterprise.

TRANSCRIPT

Page 1: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

Going a Step Beyond the Black and White Lists for URL Accesses

in the Enterprise by means of Categorical Classifiers

Authors:Antonio Miguel Mora García

Paloma de las Cuevas DelgadoJuan Julián Merelo Guervós

ECTA 2014, Rome, Italy

JJ Merelo
Y estas "cortinillas" tampoco. O se deduce por el contexto o pon algún marcador, icono, color...
JJ Merelo
En presentaciones cortas esto no hace falta que lo pongas. Todas siguen el mismo esquema.
Paloma de las Cuevas Delgado
pero en las de esta mañana todas tenían index jajajaja your argument is invalid :$ pero sí que voy a quitar todas las repeticiones de después en plan paso a paso...
Antonio Mora
Lo del Hash como la otra vez, si quieres no es necesario comentarlo. ;)
Antonio Mora
Yo quitaría los subapartados 'a' y 'b' y renombraría Data mining process como Preprocessing.Data Mining implica que se extraiga información/conocimiento de los datos. ;)Dí que el proceso de etiquetado se hace para componer los conjuntos de entrenamiento y prueba para los clasificadores.El punto 4 lo pondría como 'Analysis of Results', ya que también se han mirado las reglas, por ejemplo. ;)
Paloma de las Cuevas Delgado
Todo esto es lo que quería comentar el lunes pero no estabas xD así que ya veré como lo enfoco. Gracias por los comentarios.
Antonio Mora
Yo quitaría esta transparencia y lo comentaría en cada sección, es decir, cuando hables de las reglas dices que se considera formato Drools, cuando hables de la clasificación, comentas que se ha usado WEKA.Lo de las implementaciones no es relevante para el congreso, yo ni lo comentaría. ;)
Page 2: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

MUSES is an EU funded research project

1

Page 3: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

Bring Your Own DeviceWhat happens to corporate assets in a BYOD environment?

2

Page 4: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

Structure of the MUSES server

3

Page 5: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

Enterprise Security applied to employees’ connections to the Internet (URL requests).

Underlying Problem

4

www

● Proxies● Firewalls● Corporate Security Policies (CSP) which may

include Blacklists and Whitelists

Page 6: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● Every URL inside a Blacklist is denied, if not, it is allowed.What if something is directly allowed but it should not be?

● Every URL inside a Whitelist is allowed, if not, it is denied.What if something is directly denied but it should not be?

Therefore, we want to go a step beyond.

What do Black and White lists cover?

5

Page 7: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● Objective → to obtain a tool for automatically making an allowance or denial decision with respect to URLs that are not included in the black/whitelists.o This decision would be based in the one made for similar URL

accesses (those with similar features).o The tool should consider other parameters of the request in

addition to the URL string.

Objectives

6

Page 8: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

Classification accuracies and Rules Classification

methods

Labelled requestsUnlabelled requests

Followed Schema

7

Data Mining Labelling Process

Machine LearningAnalysis of results

Page 9: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

Employees requesting accesses to URLs (records from an actual Spanish company - around 100 employees) from 8 to 10 am.

Working Scenario

8

www

● Log File of 100k entries (patterns). CSV file format.● A set of rules (specification of the security policies

on if-then clauses).

Page 10: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● An Entry (unlabelled)

● It has 7 categorical fields and 3 numerical fields.● Leads to classification which support both types:

o Rule based classifierso Tree based classifiers

Data description: Entries in the Log

http_reply_code

http_method

duration_miliseconds

content_type server_or_cache_address

time squid_hierarchy bytes url client_adress

200 GET 1114 application/octet-stream

X.X.X.X 08:30:08 DEFAULT_PARENT 106961 http://www.one.example.com

X.X.X.X

9

Page 11: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● A Policy and a Rule

“Video streamings cannot be reproduced”

● It has a set of conditions, and a decision (ALLOW/DENY).● Each condition has: Data Type, Relationship, Value.

Data description: Policies and Rules

rule "policy-1 MP4"attributeswhen

squid:Squid(dif_MCT=="video",bytes>1000000,content_type matches "*.application.*,url matches "*.p2p.* )

thenPolicyDecisionPoint.deny();

end

10

Page 12: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● The two data sets are compared during the labelling process.● Conditions of each rule are checked in each entry/request.● If an entry meets all conditions, it is labelled with the

corresponding decision of the rule.

When- Entry meets conditions of a rule that allows making the request.

AND - Entry meets conditions of a rule that denies making the request.

THEN - DENY is chosen.

Labelling Process

11

Page 13: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● The CSV file, now with all the patterns that could be labelled (the others were not covered by the rules), has 57502 entries/patterns:o 38972 with an ALLOW label.o 18530 with a DENY label.

● Application of data balancing techniques:o Undersampling: random removal of patterns in majority class.o Oversampling: duplication of each pattern in minority class.

Data Summary

2:1 ratio

12

Page 14: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● The classifiers are tested, firstly, with a 10-fold cross-validation process.o Top five classifiers in accuracy, are chosen for the following

experiments.o Also, Naïve Bayes classifier is taking as a reference.

● Secondly, a division process is performed over the initial (labelled) log file, into both training and test files.

● These training and test files are created with different ratios and either taking the entries randomly or sequentially.

Experimental Setup

13

Page 15: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

Flow Diagram1) Initial labelling process.Experiments with unbalanced, and balanced data. From those, divisions are made:

● 80% training 20% testing● 90% training 10% testing

Randomly, and sequentially.

3) Enhancing the creation of training and test files.Experiments with unbalanced data. From those, divisions are made, patterns randomly taken:

● 80% training 20% testing● 90% training 10% testing● 60% training 40% testing

2) Removal of duplicated requests.Experiments with unbalanced data. From those, divisions are made:

● 80% training 20% testing● 90% training 10% testing● 60% training 40% testing

Randomly, and sequentially.

4) Filtering the features of the URL.Experiments with unbalanced, and balanced data. From those, divisions are made, patterns randomly taken:

● 80% training 20% testing● 90% training 10% testing● 60% training 40% testing

14

Page 16: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● The classifiers are tested, firstly, with a 10-fold cross-validation process over the balanced data.

10-fold cross-validation experiments1) Initial labelling process.

15

Page 17: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● Naïve Bayes and top five classifiers are tested with training and test divisions, in order to avoid testing patterns being used for training and vice versa.

Using separate training/test files1) Initial labelling process.

16

Page 18: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

Serendipity rocks1) Initial labelling process.

Divisions made over unbalanced data

17

JJ Merelo
Pon el "titular" de los experimentos y usa una sola transparencia con animaciones para los gráficos.
Paloma de las Cuevas Delgado
pero en PDF no salen las animaciones u_u
JJ Merelo
Lo puedes presentar directamente desde el navegador si lo publicas, no esobligatorio el PDF. Incluso exportarlo en PP2014-10-22 20:27 GMT+02:00 Paloma de las Cuevas D... (Google Drive) <
Paloma de las Cuevas Delgado
Supongo, pero vaya la diferencia está en ir haciendo la animación pasando de transparencia, no te creas que me voy a parar mucho. Es que no estuviste en lo del TFM si no, verías que aunque había 40 y tantas transp, me dio tiempo en los 20 minutos :(
JJ Merelo
El titular no s "first set". "Serendipity rocks" porque es random forest y randomly taken.
Paloma de las Cuevas Delgado
Ya verás cuando las coñas me salgan sosísimas porque voy a estar histérica xD
JJ Merelo
Si vas a estar histérica saldrán súper destroyer.Si miras al público y les sonríes te los ganas. No despertarás al japonéssalvo que te carcajees, pero al resto sí.
Page 19: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

Results continue falling1) Initial labelling process.

Divisions made over balanced data (undersampling)

18

Page 20: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

Results continue falling1) Initial labelling process.

Divisions made over balanced data (oversampling)

19

Page 21: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● We studied the field squid_hierarchy and saw that had two possible values: DIRECT or DEFAULT_PARENT.

Why are accuracies still high?2) Removal of duplicated requests.

http_reply_code

http_method

duration_miliseconds

content_type server_or_cache_address

time squid_hierarchy bytes url client_adress

200 GET 1114 application/octet-stream

X.X.X.X 08:30:08 DEFAULT_PARENT 106961 http://www.one.example.com

X.X.X.X

20

Page 22: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● The connections are made, firstly, to the Squid proxy, and then, if appropriate, the request continues to another server.o Then, some of the entries were repeated, and results may be affected for

that.

Repeated entries affect accuracies2) Removal of duplicated requests.

21

www

“Some local IP” 192.194.2.2 “Some server IP”

Page 23: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

Serendipity rocks again2) Removal of duplicated requests.

Divisions made over unbalanced data

22

Page 24: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● Repeated URL core domains could yield to false results.● During the division process, we ensured that requests with the same

URL core domain went to the same file (either for training or for testing).

Where are the URL features going?3) Enhancing the creation of training and test files.

23

Page 25: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

Accuracies fall down automatically3) Enhancing the creation of training and test files.

24

Page 26: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● In the experiments that included only the URL core domain as a classification feature, rules were too focused on that feature.

Created Rules During Classification

PART decision list------------------

url = dropbox: deny (2999.0)

url = ubuntu: allow (2165.0)

url = facebook: deny (1808.0)

url = valli: allow (1679.0)

25

Page 27: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● Another kind of rules were found, but always dependant on the URL core domain.

Created Rules During Classification

url = grooveshark ANDhttp_method = POST: allow (733.0)

url = googleapis ANDcontent_type = text/javascript ANDclient_address = 192.168.4.4: allow (155.0/2.0)

url = abc ANDcontent_type_MCT = image ANDtime <= 31532000: allow (256.0)

26

Page 28: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● Rules created by the classifiers are too focused on the URL core domain feature.

● We did the experiments again with the original file, but including as a feature only the Top Level Domain of the URL, and not the core domain.

Training with other URL features4) Filtering the features of the URL.

27

Page 29: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

Random Forest defeats everyone4) Filtering the features of the URL.

Divisions made over balanced data

28

Page 30: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● After including the URL top level domain as a classification feature, instead of URL core domain, rules classify mainly by server address.

Created Rules During Classification

PART decision list------------------

server_or_cache_address = 173.194.34.248: allow (238.0/1.0)

server_or_cache_address = 91.121.155.13: deny (235.0)

server_or_cache_address = 90.84.53.48 ANDclient_address = 10.159.39.199 ANDtld = es ANDtime <= 31533000: allow (138.0/1.0)

29

Page 31: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● URL TLD appears, but now the rules are not always dependant on this feature.

Created Rules During Classification

server_or_cache_address = 90.84.53.19 ANDtld = com: deny (33.0/1.0)

server_or_cache_address = 87.248.20.254 ANDcontent_type_MCT = image ANDduration_milliseconds > 21: deny (15.0)

server_or_cache_address = 23.38.17.224 ANDtime > 30532000 ANDhttp_reply_code = 200 ANDcontent_type_MCT = image ANDbytes <= 520 ANDtime <= 33677000: allow (40.0)

30

Page 32: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● In most cases, Random Forest classifier is the one that yields better results.

● The loss of information when analysing a Log of URL requests lowers the results. This happens when:o Oversampling data (because we randomly remove data).o Keeping the sequence of the requests of the initial Log file while

making the division in training and test files.

Conclusions

31

Page 33: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● As seen in the rules obtained, it is possible to develop a tool that automatically makes an allowance or denial decision with respect to URLs, and that decision would depend on other features of a URL request and not only the URL.

Conclusions

33

Page 34: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

● Making experiments with bigger data sets (e.g. a whole workday).

● Include more lexical features of a URL in the experiments (e.g. number of subdomains, number of arguments, or the path).

● Consider sessions when classifying.o Defining session as the set of requests that are made from a certain

client during a certain time).

● To finally implement a system and to prove them with real data, in real-time.

Future Work

34

Page 35: Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

Thank you for your attentionQuestions?

[email protected]@[email protected]

Twitter (@amoragar, @jjmerelo, @unintendedbear)