Download - 1 Data Mining Approaches for Network Intrusion Detection Karla Bracamonte Jeffrey Gawlinski Jordan Harstad Omar Rodriguez Michael Wright

1

Data Mining Approaches for Network Intrusion Detection

Karla BracamonteJeffrey GawlinskiJordan HarstadOmar RodriguezMichael Wright

2

Intrusion Detection Current

Detection is best if it occurs during the scanning step

Real-time intrusion detection Pros: Scan network traffic on the fly looking for

well known scan patterns Cons: Tuned specifically to detect known service

level network attacks Intrusion detection should follow a

proactive approach

3

VisualizationPresenting a Graphical Summary of the Data

Communication- presenting a graphical summery of the data PROS: Possible to communicate the most

important aspects of collected data CONS: Not all information can be

communicated visually & it is limited by the complexity that the human eye can appreciate.

4


Visual Techniques: Scatterplots Projection Matrices Coplots Parallel Coordinates

Etc

5


Distortion Methods: minimize a scope of data which allows a certain data set to be studied without loss of entire perspective.

Interactive Methods: viewing output dynamically through the use of a possible UI to project, zoom, and manipulate the data on demand.

6


Audio Data Mining: uses visual techniques by changing signal pitches into graphs to recognize unique patterns. Help find a pattern of early warning signs

of human anger through telephone communication.

7

Data Summarization Data Summarization

is an important data analysis task in data warehouse and online analytic processing, another used term for data summarization is summary statistics

Feature

ProductProductProductProductProductProductProductProductProductProduct

FeatureFeatureFeatureFeatureFeatureFeatureFeatureFeatureFeature

Selection

Preprocessing

Mining

Postprocessing

Data

SelectedData

PreprocessedData

Patterns

Knowledge

8

Data Summarization Offline Data Mining and Importance of Statistics

For example, networks with high traffic are faced with a larger amount of data to analyze. Nevertheless, with the use of data summarization, data may be analyzed pattern by pattern, detecting abnormal behavior and/or results

9


Summary statistics are quantities, such as the mean and standard deviation that capture various characteristics of a potential large set of values with a single number or small set of numbers

Indeed, for many people, summary statistics are the most visible manifestation of statistics

10


Frequencies and Mode Given a set of unordered categorical

values, there is not much that can be done to further characterize the values except to compute the frequency with which each value occurs for a particular set of data

Percentiles For ordered data, it is more useful to

consider the percentiles of a set of values

11

Measures of Location: Mean and Median For continues data, two of the most widely used

summary statistics are the mean and median, which are measures of the location of the set of values

Measures of Spread: Range and Variance Another set of commonly used summary statistics for

continuous data are those that measure the dispersion or spread of a set of values. Such measures indicate if the attribute values are widely spread out or if they are rarely concentrated around a single point such as the mean.

Data SummarizationOffline Data Mining and Importance of Statistics

12

Data SummarizationOffline Data Mining and Importance of Statistics

Multivariate Summary Statistics Measures of location for data that

consists of several attributes (multivariate data) can be obtained by computing the mean or median separately for each attribute.

Other Ways to Summarize Data skeweness

13


off-line processing is a reasonable solution.

Off-line processing provides the techniques for broader analysis of network traffic.

14

Network Intelligence Gathering

Foot Printing Administrative, technical, and billing

contacts, which include employee names, email addresses, and phone & fax numbers

IP address range DNS servers Mail servers

15

Network Intelligence Gathering Enumeration

process of extracting valid accounts or exported resource names from systems

Scanning the art of detecting which systems are

alive and reachable via the Internet, and what services they offer, using techniques such as ping sweeps, port scans, and operating system identification

16

Network Based Attacks Attack on availability –

Making a network unavailable or unusable to a user or a group of users

Attack on confidentiality- Many attacks are on that of personal data.

Whether it is a name, address, email, social security number or credit card number, many network based attacks are solely there for the purpose of gathering confidential and/or personal information on an individual, group of individuals, company or object.

17

Network Based Attacks

Attack on integrity- It is possible for the data to be

intercepted all together and thus never reach the intended recipient.

Attack on authenticity- Modifies an original data cluster and then

passes it on as unmodified.

18

Network Based Attacks Attack on access control-

This method attacks a legitimate machine within a secure network in hopes to access network and server resources.

Attack on privacy – An attack on privacy is mainly used for the

recording of data in some way or another. Whether it is tracking specific website usage, online video game play, email addresses; this method is used by attackers to exploit an individual’s activity on a computer.

19

Network Based Attacks

Prevention Firewalls Virus Scanners Common Sense

20

Known FlagsData Mining for Security

Suspicious red flags are not conclusive proof that fraud has been committed. Simply one tool of many for preventative

measures. Not a single catch all rule through data

mining- should not be solely relied upon. Consistent pattern is a must for possible

fraud identification.

21


Example: telecommunications fraud Nodes represent

different countries Lines represent

international phone calls

Unusually bright activity represents strange activity determined as fraud

22


Example: compromised credit card accounts- A distinct pattern usually involves a lost

or stolen account to be swiped at a gas station. No gas is purchased; only used to check status of account to see if active.

Subsequent large jewelry and electronic purchases shortly follow.

23


Example: terrorist activity has not been countered as a result of data mining. It has no distinct pattern; a terrorist’s profile is

no clear definition. Large government (NSA) programs have made

attempts in data mining for preventive measures without success. Total Information Awareness generated

thousands of tips every month for over a year without a single lead into terrorist organizations

24

ClassificationPredicting the Category to Which a Particular Record Belongs

A major part of the classification process is the initial information gathering task.

The idea behind this collection of data is that normal and abnormal patterns of occurrences can be differentiated from one another, and algorithms can then be created to detect such patterns.

Once detected, said algorithms would then be able to flag suspicious events as abnormal in real-time, and alert the appropriate person(s) as to the potential intrusion(s).

25


There are several ways these algorithms can operate, and commonly they are implemented to run off of decision trees or simply a set of predefined rules that the system data must meet.

26


There are many different options available that employ decision trees. Some of these options include: Classification and Regression Trees (CART) Chi Square Automatic Interaction Detection (CHAID)

CART works by inducing two-way splits in a dataset, causing it to become segmented, whereas CHAID uses chi square tests to create splits in a dataset of variable size, also causing the data to become segmented.

27


Lee and Stolfo conducted several experiments pertaining to classification methods in their paper “Data Mining Approaches for Intrusion Detection.” The first of these experiments was on a set of sendmail system call data. This data consisted of sendmail traces, with the trace data consisting of two columns of integers.

The traces contained within the data were classified as both normal and abnormal, where the normal constituted “a trace of the sendmail daemon and a concatenation of several invocations of the sendmail program” and the abnormal was composed of the following attacks: Three traces of sunsendmailcp (sscp) Two traces of syslog-remote Two traces of syslog-local Two traces of decode One trace of sm5x One trace of sm565a

28


After this data was obtained, system call sequences had to be derived and labeled as normal or abnormal so that they could then be supplied to RIPPER, the rule learning program that was used to generate rules that predict whether or not a sequence is normal or abnormal. The Intrusion Detection system then followed a post-processing scheme to decide whether or not the current trace was an intrusion, using the RIPPER predictions.

The logic here is that when there is an intrusion on the system, most of the adjacent system call sequences will be abnormal.

29


From the results it is important to notice that generally speaking, intrusion traces will create much larger abnormal regions than normal traces.

Also note that the results show that the rules were generated can be applied to intrusion traces not included in the training dataset.

This means that the rules for normal patterns can be used to detect anomalies.

The rules from experiments C and D, on the other hand, represent the abnormal sequence patterns. These rules work very well for detecting types seen in the training data, but perform worse than A and B when it comes to detecting intrusions on traces that were not seen in the training data. The implication here is that the rule set for abnormal patterns performs well on predictable intrusions from things such as misuse or other repeatable events where good basis data can be used to generate the rules, but is unreliable when it comes to flagging new types of intrusions that may occur in the future.

30


The next approach that Lee and Stolfo attempted involved creating an anomaly detection routine using only normal traces for training data. Experiments were carried out to determine the normal correlation between system calls, i.e. the nth or the middle system calls in normal sequences of length n.

Lee and Stolfo declared that “improvement in accuracy can come from adding more features, rather than just system calls, into the models of program execution.” Items such as the file structure and paths within that were traversed (directories and names of touched files) could be used to generate stronger rules.

31


Lee and Stolfo further examined network intrusion detection by monitoring network traffic directly using a packet capturing program, tcdump, to collect data.

In conclusion, when the data is not designed specifically for security purposes (like in this case), it cannot be used to build a detection model without a certain amount of modifications (or pre-processing) being made. Due to all of the changes that must be taken care of, it goes without saying that one must have a lot of knowledge in the domain being tested, and as such the process is not easily automated. On the other hand, it is important to again note that by adding extra measures, the accuracy of the classification model can be improved.

Download - 1 Data Mining Approaches for Network Intrusion Detection Karla Bracamonte Jeffrey Gawlinski Jordan Harstad Omar Rodriguez Michael Wright

Top Related