1
Data Mining Approaches for Network Intrusion Detection
Karla BracamonteJeffrey GawlinskiJordan HarstadOmar RodriguezMichael Wright
2
Intrusion Detection Current
Detection is best if it occurs during the scanning step
Real-time intrusion detection Pros: Scan network traffic on the fly looking for
well known scan patterns Cons: Tuned specifically to detect known service
level network attacks Intrusion detection should follow a
proactive approach
3
VisualizationPresenting a Graphical Summary of the Data
Communication- presenting a graphical summery of the data PROS: Possible to communicate the most
important aspects of collected data CONS: Not all information can be
communicated visually & it is limited by the complexity that the human eye can appreciate.
4
VisualizationPresenting a Graphical Summary of the Data
Visual Techniques: Scatterplots Projection Matrices Coplots Parallel Coordinates
Etc
5
VisualizationPresenting a Graphical Summary of the Data
Distortion Methods: minimize a scope of data which allows a certain data set to be studied without loss of entire perspective.
Interactive Methods: viewing output dynamically through the use of a possible UI to project, zoom, and manipulate the data on demand.
6
VisualizationPresenting a Graphical Summary of the Data
Audio Data Mining: uses visual techniques by changing signal pitches into graphs to recognize unique patterns. Help find a pattern of early warning signs
of human anger through telephone communication.
7
Data Summarization Data Summarization
is an important data analysis task in data warehouse and online analytic processing, another used term for data summarization is summary statistics
Feature
ProductProductProductProductProductProductProductProductProductProduct
FeatureFeatureFeatureFeatureFeatureFeatureFeatureFeatureFeature
Selection
Preprocessing
Mining
Postprocessing
Data
SelectedData
PreprocessedData
Patterns
Knowledge
8
Data Summarization Offline Data Mining and Importance of Statistics
For example, networks with high traffic are faced with a larger amount of data to analyze. Nevertheless, with the use of data summarization, data may be analyzed pattern by pattern, detecting abnormal behavior and/or results
9
Data Summarization Offline Data Mining and Importance of Statistics
Summary statistics are quantities, such as the mean and standard deviation that capture various characteristics of a potential large set of values with a single number or small set of numbers
Indeed, for many people, summary statistics are the most visible manifestation of statistics
10
Data Summarization Offline Data Mining and Importance of Statistics
Frequencies and Mode Given a set of unordered categorical
values, there is not much that can be done to further characterize the values except to compute the frequency with which each value occurs for a particular set of data
Percentiles For ordered data, it is more useful to
consider the percentiles of a set of values
11
Measures of Location: Mean and Median For continues data, two of the most widely used
summary statistics are the mean and median, which are measures of the location of the set of values
Measures of Spread: Range and Variance Another set of commonly used summary statistics for
continuous data are those that measure the dispersion or spread of a set of values. Such measures indicate if the attribute values are widely spread out or if they are rarely concentrated around a single point such as the mean.
Data SummarizationOffline Data Mining and Importance of Statistics
12
Data SummarizationOffline Data Mining and Importance of Statistics
Multivariate Summary Statistics Measures of location for data that
consists of several attributes (multivariate data) can be obtained by computing the mean or median separately for each attribute.
Other Ways to Summarize Data skeweness
13
Data Summarization Offline Data Mining and Importance of Statistics
off-line processing is a reasonable solution.
Off-line processing provides the techniques for broader analysis of network traffic.
14
Network Intelligence Gathering
Foot Printing Administrative, technical, and billing
contacts, which include employee names, email addresses, and phone & fax numbers
IP address range DNS servers Mail servers
15
Network Intelligence Gathering Enumeration
process of extracting valid accounts or exported resource names from systems
Scanning the art of detecting which systems are
alive and reachable via the Internet, and what services they offer, using techniques such as ping sweeps, port scans, and operating system identification
16
Network Based Attacks Attack on availability –
Making a network unavailable or unusable to a user or a group of users
Attack on confidentiality- Many attacks are on that of personal data.
Whether it is a name, address, email, social security number or credit card number, many network based attacks are solely there for the purpose of gathering confidential and/or personal information on an individual, group of individuals, company or object.
17
Network Based Attacks
Attack on integrity- It is possible for the data to be
intercepted all together and thus never reach the intended recipient.
Attack on authenticity- Modifies an original data cluster and then
passes it on as unmodified.
18
Network Based Attacks Attack on access control-
This method attacks a legitimate machine within a secure network in hopes to access network and server resources.
Attack on privacy – An attack on privacy is mainly used for the
recording of data in some way or another. Whether it is tracking specific website usage, online video game play, email addresses; this method is used by attackers to exploit an individual’s activity on a computer.
19
Network Based Attacks
Prevention Firewalls Virus Scanners Common Sense
20
Known FlagsData Mining for Security
Suspicious red flags are not conclusive proof that fraud has been committed. Simply one tool of many for preventative
measures. Not a single catch all rule through data
mining- should not be solely relied upon. Consistent pattern is a must for possible
fraud identification.
21
Known FlagsData Mining for Security
Example: telecommunications fraud Nodes represent
different countries Lines represent
international phone calls
Unusually bright activity represents strange activity determined as fraud
22
Known FlagsData Mining for Security
Example: compromised credit card accounts- A distinct pattern usually involves a lost
or stolen account to be swiped at a gas station. No gas is purchased; only used to check status of account to see if active.
Subsequent large jewelry and electronic purchases shortly follow.
23
Known FlagsData Mining for Security
Example: terrorist activity has not been countered as a result of data mining. It has no distinct pattern; a terrorist’s profile is
no clear definition. Large government (NSA) programs have made
attempts in data mining for preventive measures without success. Total Information Awareness generated
thousands of tips every month for over a year without a single lead into terrorist organizations
24
ClassificationPredicting the Category to Which a Particular Record Belongs
A major part of the classification process is the initial information gathering task.
The idea behind this collection of data is that normal and abnormal patterns of occurrences can be differentiated from one another, and algorithms can then be created to detect such patterns.
Once detected, said algorithms would then be able to flag suspicious events as abnormal in real-time, and alert the appropriate person(s) as to the potential intrusion(s).
25
ClassificationPredicting the Category to Which a Particular Record Belongs
There are several ways these algorithms can operate, and commonly they are implemented to run off of decision trees or simply a set of predefined rules that the system data must meet.
26
ClassificationPredicting the Category to Which a Particular Record Belongs
There are many different options available that employ decision trees. Some of these options include: Classification and Regression Trees (CART) Chi Square Automatic Interaction Detection (CHAID)
CART works by inducing two-way splits in a dataset, causing it to become segmented, whereas CHAID uses chi square tests to create splits in a dataset of variable size, also causing the data to become segmented.
27
ClassificationPredicting the Category to Which a Particular Record Belongs
Lee and Stolfo conducted several experiments pertaining to classification methods in their paper “Data Mining Approaches for Intrusion Detection.” The first of these experiments was on a set of sendmail system call data. This data consisted of sendmail traces, with the trace data consisting of two columns of integers.
The traces contained within the data were classified as both normal and abnormal, where the normal constituted “a trace of the sendmail daemon and a concatenation of several invocations of the sendmail program” and the abnormal was composed of the following attacks: Three traces of sunsendmailcp (sscp) Two traces of syslog-remote Two traces of syslog-local Two traces of decode One trace of sm5x One trace of sm565a
28
ClassificationPredicting the Category to Which a Particular Record Belongs
After this data was obtained, system call sequences had to be derived and labeled as normal or abnormal so that they could then be supplied to RIPPER, the rule learning program that was used to generate rules that predict whether or not a sequence is normal or abnormal. The Intrusion Detection system then followed a post-processing scheme to decide whether or not the current trace was an intrusion, using the RIPPER predictions.
The logic here is that when there is an intrusion on the system, most of the adjacent system call sequences will be abnormal.
29
ClassificationPredicting the Category to Which a Particular Record Belongs
From the results it is important to notice that generally speaking, intrusion traces will create much larger abnormal regions than normal traces.
Also note that the results show that the rules were generated can be applied to intrusion traces not included in the training dataset.
This means that the rules for normal patterns can be used to detect anomalies.
The rules from experiments C and D, on the other hand, represent the abnormal sequence patterns. These rules work very well for detecting types seen in the training data, but perform worse than A and B when it comes to detecting intrusions on traces that were not seen in the training data. The implication here is that the rule set for abnormal patterns performs well on predictable intrusions from things such as misuse or other repeatable events where good basis data can be used to generate the rules, but is unreliable when it comes to flagging new types of intrusions that may occur in the future.
30
ClassificationPredicting the Category to Which a Particular Record Belongs
The next approach that Lee and Stolfo attempted involved creating an anomaly detection routine using only normal traces for training data. Experiments were carried out to determine the normal correlation between system calls, i.e. the nth or the middle system calls in normal sequences of length n.
Lee and Stolfo declared that “improvement in accuracy can come from adding more features, rather than just system calls, into the models of program execution.” Items such as the file structure and paths within that were traversed (directories and names of touched files) could be used to generate stronger rules.
31
ClassificationPredicting the Category to Which a Particular Record Belongs
Lee and Stolfo further examined network intrusion detection by monitoring network traffic directly using a packet capturing program, tcdump, to collect data.
In conclusion, when the data is not designed specifically for security purposes (like in this case), it cannot be used to build a detection model without a certain amount of modifications (or pre-processing) being made. Due to all of the changes that must be taken care of, it goes without saying that one must have a lot of knowledge in the domain being tested, and as such the process is not easily automated. On the other hand, it is important to again note that by adding extra measures, the accuracy of the classification model can be improved.