knowledge discovery in cyber vulnerability … discovery in cyber vulnerability databases ......
TRANSCRIPT
Knowledge Discovery in Cyber Vulnerability Databases
Sean Tierney
A project report
submitted in partial fulfillment of the requirements for the degree of
Master of Science University of Washington
2005
Committee: Committee Chair: Isabelle Bichindaritz, Ph.D.
Committee Member: Don McLane
Program Authorized to Offer Degree: Computing and Software Systems
University of Washington Graduate School
ii
Abstract
This project investigates the applicability of data mining repositories of the
various vulnerabilities and exposures to abuse, misuse, and other compromises of
computing hardware and software. The discovery, disclosure, correction and avoidance
of vulnerabilities in software and computing systems is essential to maintaining the
availability, confidentiality, and integrity of these systems. While significant effort is
invested in the above steps as well as intrusion detection, limited effort has been applied
to the study and comprehension of the cumulative body of vulnerabilities. Nonetheless,
cyber vulnerability databases contain undiscovered patterns which are novel, interesting
and of value to researchers and industry professionals. One such repository is the ICAT
Metabase maintained by the National Institute for Standards and Technology (NIST).
The goal of this project is to mine a cyber vulnerability database in order to
discover co-occurring vulnerability attributes, to predict vulnerability classes, and to
determine patterns of vulnerability. This report starts by providing background in
vulnerability research, discussing relevant machine learning concepts, and reviewing
related literature. Following, the specific data mining activities are presented with an
evaluation and analysis. Data mining the vulnerability database makes several important
contributions. The results demonstrate that previously undiscovered, novel, and
interesting knowledge was detected. Association mining was used to learn which
vulnerability attributes frequently occur together. This includes strong associations for
each of the severity levels, which facilitates the attribution of specific features to given
severity levels. Classification rules were mined and applied to a dataset of new examples
to determine the accuracy of class prediction. The resulting data models were
successfully used to classify high severity vulnerabilities. A comparison was made
between the machine learned rules and those used to produce the dataset, which
demonstrates that the metrics originally used to determine severity level could be
represented by machine learning algorithms. Clustering, classification, and association
results were analyzed to enumerate the novel or interesting patterns discovered in the
dataset. This characterization of vulnerabilities enables the observation of new patterns
iii
such as the correlation between medium severity vulnerabilities and the characteristics
for denial of service attacks. This project concludes by identifying current issues in data
mining vulnerability databases and by outlining some direction for future work.
iv
Table of Contents List of Figures .............................................................................................................................................. vi
List of Tables................................................................................................................................................ vi
1 Introduction......................................................................................................................................... 1 1.1 Thesis and Problem Statement .................................................................................................... 1 1.2 Purpose........................................................................................................................................ 1 1.3 Goals ........................................................................................................................................... 2 1.4 Method......................................................................................................................................... 2 1.5 Audience ...................................................................................................................................... 3 1.6 Organization................................................................................................................................ 4
2 Background ......................................................................................................................................... 4 2.1 Vulnerabilities, Exposures and Exploits...................................................................................... 4
2.1.1 Discovery and Disclosure ....................................................................................................... 5 2.1.2 Tracking.................................................................................................................................. 6 2.1.3 Detection and Assessment ...................................................................................................... 6
2.2 Data Mining ................................................................................................................................ 7 2.2.1 Preprocessing.......................................................................................................................... 7 2.2.2 Classification .......................................................................................................................... 7 2.2.3 Association ............................................................................................................................. 8 2.2.4 Clustering ............................................................................................................................... 9
2.3 Literature Review ........................................................................................................................ 9 2.3.1 Data Mining in Vulnerability Databases................................................................................. 9 2.3.2 Windows of Vulnerability: A Case Study Analysis.............................................................. 11 2.3.3 Data Mining for Security ...................................................................................................... 13
3 The Investigation............................................................................................................................... 15 3.1 Scope ......................................................................................................................................... 15 3.2 Resources and Technologies ..................................................................................................... 15
3.2.1 Training/Test Dataset ........................................................................................................... 15 3.2.1.1 Descriptive Summary.................................................................................................. 15 3.2.1.2 Preprocessing .............................................................................................................. 21
3.2.2 Evaluation Dataset ................................................................................................................ 22 3.2.3 Tools ..................................................................................................................................... 23
3.2.3.1 WEKA......................................................................................................................... 23 3.2.3.2 Clementine .................................................................................................................. 24
4 Results Analysis and Evaluation...................................................................................................... 26 4.1 Frequent Feature Sets ............................................................................................................... 26 4.2 Class Prediction ........................................................................................................................ 28 4.3 Comparison of ML rules to NIST Metrics ................................................................................. 29
4.3.1 Association Rules ................................................................................................................. 30
v
4.3.2 Classification Rules .............................................................................................................. 35 4.4 Pattern Discovery...................................................................................................................... 38
5 Educational Statement...................................................................................................................... 43
6 Further Work.................................................................................................................................... 44
7 Conclusion ......................................................................................................................................... 45
References ................................................................................................................................................... 47
Appendix A: Data Dictionary.................................................................................................................... 48
A.1 Purpose and Scope.............................................................................................................................. 48
Appendix B: Association Rules ................................................................................................................. 61
Appendix C: Classification Rules.............................................................................................................. 65
Appendix D: Severity Web Graphs .......................................................................................................... 71
Appendix E: Presentation Slides .................................................................... Error! Bookmark not defined.
List of Figures Figure 2-1. Disclosure Life-cycle ...................................................................................... 6 Figure 2-2. Host Life-cycle.............................................................................................. 11 Figure 2-3. Rate of Intrusions .......................................................................................... 13 Figure 3-1 Severity Web Graph........................................................................................ 18 Figure 3-2 Attack Range................................................................................................... 19 Figure 3-3 Loss Type ........................................................................................................ 20 Figure 3-4Vulnerability Type ........................................................................................... 21 Figure 3-5 Exposed Component ....................................................................................... 21 Figure 3-6 Weka GUI ....................................................................................................... 24
List of Tables Table 1-1. Partial Table of CRISP-DM Tasks ................................................................... 3 Table 2-1. Leading Sources of Vulnerability Research and Disclosure ............................ 5 Table 2-2. Leading vulnerability information repositories ................................................ 6 Table 3-1 Dataset Attributes ............................................................................................. 17 Table 3-2 – Distribution of Vulnerability Attributes ........................................................ 19 Table A-1 Dataset Sample ................................................................................................ 57 Table A-2 Dataset sample in ARFF format ...................................................................... 58
vi
1 Introduction This project investigates the applicability of data mining cyber security
vulnerabilities. Its goal is to mine a cyber vulnerability database in order to discover co-
occurring vulnerability attributes, to predict vulnerability classes, and to determine
patterns of vulnerability. Databases such as the ICAT Metabase maintained by the
National Institute for Standards and Technology (NIST) are repositories of the various
vulnerabilities and exposures to misuse, abuse, or similar compromises impacting
computing hardware and software. There has been significant research into both
hardware and software based cyber vulnerabilities. However, past research has focused
on discovery, detection, and assessment. Thus far there has been little study of the
overall body of cyber security vulnerability. This project analyzes the corpus of cyber
security vulnerabilities as represented in such databases.
1.1 Thesis and Problem Statement
Cyber vulnerability databases contain undiscovered patterns which are novel,
interesting and of value to researchers and industry professionals. Past as well as
ongoing research into cyber security vulnerabilities has tended not to consider the
aggregate body of vulnerabilities. The majority of academic work has been directed at
detecting the various attacks, intrusions, and similar activity that makes use of these
vulnerabilities. Industry professionals and Information Security enthusiasts have focused
either on discovering new vulnerabilities or detecting and assessing those that exist
within the enterprise or institution. By focusing on other topics, a potentially large body
of knowledge has been left untapped. This data needs to be explored to determine what
new knowledge can be discovered, what information can be applied to new areas or in
new ways to existing work.
1.2 Purpose
The predominant use of vulnerability databases has been to enumerate software
and hardware vulnerabilities for use in assessing specific systems (computers,
infrastructure, services, etc). While those activities are not trivial, it is important to study
1
the corpus of known vulnerability and to identify novel and interesting patterns. The
objective of this research project is to mine a cyber vulnerability database in order to
discover co-occurring vulnerability attributes, to predict vulnerability classes, and to
determine patterns of vulnerability. The focus will be to answer some questions identified
as potentially of interest to computer security professionals.
1.3 Goals
In order to achieve the purpose and objectives of this project the following goals
were established. Research and understand the purpose and uses of vulnerability
databases. Develop a comprehensive understanding of the contents and structure of the
publicly available vulnerability databases. Utilize this knowledge to select and process
an appropriate dataset. Apply association, classification and clustering techniques in
order to answer four research questions.
Q1: Which vulnerability attributes frequently occur together?
Q2: How accurately can the class of new examples be predicted?
Q3: How well do the machine learning derived rules compare to those used to
produce the dataset?
Q4: What novel or interesting patterns can be discovered in the dataset?
1.4 Method
To the maximum extent possible this project follows the methodology proposed
by the Cross-Industry Standard Process for Data Mining (CRISP-DM) [1]. Table 1-1
describes the specific phases employed; which include business understanding, data
understanding, data preparation, modeling, and evaluation with the deployment phase
omitted.
The business understanding phase was performed at the beginning of the project.
It included the research used to assess the situation such as literature review and study of
data mining principles and practices. Related work in data mining and vulnerability
assessment were examined to identify the data mining goals. The project plan was
developed as part of the initial project proposal.
2
Business Understanding
Data Understanding
Data Preparation
Modeling Evaluation
Determine Bus Obj • Background • Objectives • Success Criteria Assess Situation • Inventory of Resources • Requirements, Assumptions, and Constraints • Risks and Contingencies • Terminology • Costs and Benefits Determine DM Goals • DM Goals • DM Success Criteria Produce Project Plan • Project Plan • Initial Assessment of Tools and Techniques
Collect Initial Data • Initial Data Collection Report Describe Data • Data Description Report Explore Data • Data Exploration Report Verify Data Quality • Data Quality Report
Select Data • Rationale for Inclusion / Exclusion Clean Data • Data Cleaning Report Construct Data • Derived Attributes • Generated Records Integrate Data • Merged Data Format Data • Reformatted Data • Dataset • Dataset Description
Select Modeling Techniques • Modeling Technique • Modeling Assumptions Generate Test Design • Test Design Build Model • Parameter Settings • Models • Model Descriptions Assess Model • Model Assessment • Revised Parameter Settings
Evaluate Results • Assessment of DM Results • Approved Models Review Process • Review of Process Determine Next Steps • List of Possible Actions • Decision
Table 1-1. Partial Table of CRISP-DM Tasks [1]
Data understanding began with the selection of the dataset. The data was
explored and quality verified. The descriptive data summary in section 3.2.1.1 was
produced as a result. Section 3.2.1.2 covers the data preparation phase in the discussion
of the preprocessing activities, which consisted of cleaning, integration, reduction,
attribute selection and construction.
The modeling and evaluation phases are discussed in section 4. These include the
selection of the modeling techniques, model building, analysis and evaluation of the
results.
1.5 Audience
The target audience for this report are those in the computer science and
information technologies fields, whether academic or industry. The research may be of
specific benefit to individuals working with attacks, compromises, exposures, or
vulnerabilities in computing and software systems. Background knowledge is provided
in both the cyber vulnerabilities and the data mining techniques used. However, the
background information is not exhaustive, and readers are likely to benefit from prior
knowledge or experience with this topic.
3
1.6 Organization
This report is organized in the following manner. Section 2 provides background
on cyber vulnerabilities, discussing the discovery, reporting, tracking and detection
aspects of the field as well as providing a review of applicable literature. Section 3
describes the investigation including the scope, merit, dataset and tools employed.
Section 4 discusses and analyzes the results. The educational statement is provided in
section 5. Future work and suggestions for additional research is covered in section 6.
Section 7 contains the concluding remarks. A detailed data dictionary and other pertinent
materials are provided in the appendices.
2 Background
2.1 Vulnerabilities, Exposures and Exploits
There is some debate regarding the distinction between vulnerabilities and
exposures. In many of the references and literature reviewed for this research,
vulnerabilities are often viewed as flaws which may be leveraged to violate the security
policy or abuse the intended purpose of the system. In contrast, an exposure is a
component which is performing as designed. However, it may be used in an unintended
or undesirable fashion, which also results in a security violation or abuse. The Mitre
Corporation, developers of the industry standard Common Vulnerability and Exposures
(CVE) Dictionary [2], offer this view. Vulnerabilities in computer hardware and
software are generally considered to be facts about the system seen as problems “under
any commonly used security policies”. Exposures are only considered to be problems
under some security policies. Taking yet another approach, NIST simply refers to both as
vulnerabilities. Although, they also use the term “exposed component” to specify what
would be exploited in an attack.
Consider the UNIX services chargen and echo. Chargen, when contacted via
TCP, generates a continuous stream of characters, until the client disconnects. Echo
simple echoes back whatever is sent to it. Both of these may be very useful for
diagnostics and measurement. However, these two services may be exploited in a ping
pong attack, which bounces the stream from the chargen service between hosts causing a
4
degradation or denial of services for the hosts and the networks involved. In this
example, chargen and echo are performing as designed, however, their mere presence
opens the system to attack. Whether these services represent vulnerabilities or exposures
is academic and should be considered violations of any reasonable security policies.
Thus, in the context of this research, vulnerability and exposure may be used
interchangeably as a reference to any entry in the dataset.
2.1.1 Discovery and Disclosure
In the early days of network computing, vulnerabilities were largely discovered
by skilled systems administrators, crackers, or by accident. However, that has largely
given way to what is now referred to as the security researcher. This term applies to
those working for commercial or public interest as well as more nefarious minded
individuals. It may include hackers, crackers, developers and programmers. These
researchers often employ techniques such as code auditing, input fuzzing, fault
monitoring, execution and runtime tracing. A non-exhaustive list of organizations
engaged in either research or disclosure is provided in table 2-1. Source Audience URL CERT Public www.cert.orgeEye commercial www.eeye.comF-Secure commercial www.f-secure.comiDefense commercial www.idefense.comInternet Security Systems commercial www.iss.comInternet Strom Center Public www.isc.sans.comNeohapsis commercial www.neohapsisQualys commercial www.qualys.com
Table 2-1. Leading Sources of Vulnerability Research and Disclosure
Once discovered, and regardless of the aims of the researcher, a vulnerability
eventually works its way to public disclosure. If the vulnerability is handled by a
reputable researcher or firm, the vendor will typically be notified prior to further
disclosure. However, in the case of the malicious hacker or with Vulnerability Sharing
Clubs (VSC), knowledge of the exposure spreads until there is either an incident or an
issue that results in acknowledgment. The disclosure life-cycle spans from discovery to
full, public disclosure (see figure 2-1).
5
Figure 2-1. Disclosure Life-cycle
2.1.2 Tracking
While the organizations mentioned above concentrate on discovery and
disclosure, there are others, such those in table 2-2, which focus on keeping track of
known vulnerabilities. The CVE product from Mitre forms the hub of vulnerability
tracking by providing the commonly used definition, description, and refernce or
identification number for vulnerabilities. As described on the CVE website, newly
disclosed vulnerabilities are assigned a candidate number indicating either review or
acceptance status for the CVE dictionary. The other organizations form the spokes with
each providing a repository of vulnerabilities. While the repositories may differ in the
range and depth of detail related to specific vulnerabilities, they reference the CVE ID
numbers. Source Audience URL CERIAS academic www.cerias.purdue.eduMITRE public www.mitre.orgNIST public www.nist.govOSVDB public www.osvdb.orgSecunia Commercial www.secunia.comSPI Dynamics commercial www.spidynamics.com
Table 2-2. Leading vulnerability information repositories
2.1.3 Detection and Assessment
Detection and assessment are arguably the most critical use for vulnerability data.
However, they are undertaken both by individuals desiring to secure their systems against
intrusion and those intent on exploiting vulnerabilities. The information contained in the
disclosures and repositories facilitate the development of signatures or profiles used by
the various tools for detecting exploiters that specific systems are vulnerable to. There
are a variety of open source and commercially available vulnerability assessment
scanners such as Nessus and SAINT which may be used by both groups. While each tool
or service provider maintains their own dataset of vulnerabilities, they work to ensure
compatibility with CVE and other vulnerability tracking organizations.
6
2.2 Data Mining
Data mining as described by Hand, et al [3] is the analysis of datasets to learn
unsuspected relationships as well as summarize the data in a novel, understandable, or
useful manner. It is often performed in the context of knowledge discover in databases
(KDD) and involves data selection, preprocessing, identification of data mining goals,
selection of data mining tasks, and assessment of results. The data mining tasks typically
involve the application of data mining algorithms for classification, association, or
cluster. These topics are discussed further in following sections.
2.2.1 Preprocessing
Data mining is typically applied to data which has been collected for a purpose
other than mining. As such there are often inconsistencies, or incompatibilities in the
dataset which require addressing prior to beginning the data mining process. Han and
Kamber [4] discuss data preprocessing at length covering cleaning, transformation and
reduction. Data cleaning addresses missing values, noise, and inconsistencies. Missing
values may be handled by filling in the value or ignoring the case. Noisy data is
smoothed using a variety of techniques such as binning values into groups, clustering,
and fitting data to a function. Inconsistent data may be addressed by manual or
automated routines which reference external sources. Data Transformation is concerned
with converting or aggregating the data into formats useful for the data mining tasks.
This may involve combining attributes, normalization, or constructing new attributes.
Data reduction is the processing of the dataset to achieve a smaller representation of the
data while maintaining the integrity of the original data. Applicable techniques include
data cube aggregation, dimension reduction where irrelevant attributes are removed, and
data compression.
2.2.2 Classification
Classification seeks to both describe the current dataset and predict group
membership of new cases[5]. According to Han and Kamber, it establishes a concise
description of each class which distinguishes it from others. This is accomplished by
examining training cases and building predictive patterns such as classifications rules,
7
decision tables or trees. Classification rules take the IF, THEN form. An example from
Witten and Frank [11] is the following:
IF outlook = sunny and humidity = high THEN play = no
Classification is also performed by decision tree induction. This is one of the simplest
and most successful forms or learning algorithm. It accepts input in the form of attributes
and returns a decision, which is the predicted output for the input given [6]. Classifier
accuracy can be estimated through techniques such as the holdout method, random sub-
sampling and cross-validation.
2.2.3 Association
Association rule mining aims to identify groups of items that frequently occur
together as the result of transactions [7]. Examples may include items consumers
purchase together or the sequential series of web pages viewed by a user. As illustrated
in Han and Kamber [4], the output knowledge is represented as a set of association rule in
the form A ⇒ B (read as A implies B) such as the following association that purchasers
of a computer also buy financial management software at the same time.
Computer ⇒ financial_management_software
Buys(X, “computer”) ⇒ buys (X, “financial_management_software”)
Three measures of interestingness of association rules are support, confidence, and lift.
Support is the percentage of instances in the dataset for which the rule holds. Confidence
indicates the percentage of occurrences where if a rule contains A it also contains B. Lift
is the ratio of probability of A and B occurring together and the probability of the same
event with A and B occurring independently.
[Count of cases with both A and B]Support [A ⇒ B] = [Total count of cases]
[Count of cases with both A and B]Confidence [A ⇒ B]= [Count of cases with A]
Confidence [A ⇒ B]Lift [A ⇒ B]= Support [B]
8
2.2.4 Clustering
Han and Kamber describe a cluster as a collection of objects similar to others in
the same cluster and dissimilar to objects outside the cluster. The process of grouping
these objects is called clustering. Clustering methods include partitioning, hierarchical,
and density. Given a predetermined number of partitions to create, the partitioning
methods classify the cases into that number of partitions. Next, they iteratively relocate
cases from one partition to another in an attempt to improve the partitions. Hierarchical
methods can be separated into agglomerative and divisive. In an agglomerative approach
objects start out in separate groups and are merged into the top level of the hierarchy.
With the divisive methods, objects start out in one group and are successively split up
until the termination condition is reached. Density based methods, as the name implies,
form the cluster based on minimum density in the neighborhood of objects rather than the
distance between them. Unlike classification and association, clustering is difficult to
evaluate. Ultimately the evaluation comes in the form of an interestingness measure
assigned by the user or domain expert.
2.3 Literature Review
Various articles are presented which are relevant to this research. Specifically,
they cover data mining in security, vulnerability analysis and the single available article
on data mining vulnerability databases.
2.3.1 Data Mining in Vulnerability Databases
‘Data Mining in Vulnerability Databases’ [8] is a project update covering work
conducted at the Darmstadt University of Technology in 2000. Schumacher, et al [8]
discuss their research on the Testbed for Reliable, Ubiquitous, Secure Transactional,
Event-driven and Distributed Systems (TRUSTED) . The research appears to be in
several areas centered on the development of a vulnerability database (VDB).
VDB providers and collections of vulnerability information are available from
many sources. Computer Emergency Response Centers (CERT) have been established in
various countries including the US, Germany, and Australia. They publish warnings for
highly dangerous vulnerabilities or those impacting large groups of people. Other
9
sources include mailing lists, newsletters, and news groups available from commercial,
government, and private organizations. The article references FIRST, ISS X-Force,
CIAC, NT Bugtraq, Phrack, Rootshell, Security Focus, INFILSEC, and others. Many of
these sources are no longer available, while others have been combined, and new sources
have emerged.
An underlying aim of research under TRUSTED is the security and confidence in
IT systems. It is not only concerned with these systems remaining free of software bugs,
but that there remain under the control and operation of the systems owners. To this end
much of the work is centered on assessment, prognosis, and avoidance. Work on
assessment includes the comparison of comparable compromises and recommendations
of counter-measures. Work in prognosis covers the likelihood a given vulnerability will
occur. While avoidance seeks to steer clear of known flaws in future software. The
authors observe that background information as well as specific details will be required to
complete their work. Some of this can be accomplished by experts; however, there is far
too much data for human analysis alone.
With regard to data mining in VDBs, they propose that it can suit two purposes:
discovery of new patterns and prediction of class outcome of new cases. However,
before selecting mining techniques, there are four challenges faced by this project:
sufficient training instances, known classifications, concept of “free from vulnerabilities”,
and description of vulnerabilities. It also appears that the lack of a widely accepted and
used ontology also poses a significant obstacle for the identification and classification of
vulnerabilities.
This article [8] also features a lengthy discussion of the operating model to be
used for maintaining VDBs. Proposed organizational forms considered are centralized,
federated, open source, and Balkanized. In the centralized model there would be
precisely one VDB into which all disclosures would be contributed. The federated model
would be a conglomeration of separately owned datasets horizontally partitioned by
subject. The open-source model is similar to that used for software development and
allows for all users to access or contribute to all data. The Balkanized model was in use
at the time the article was written and is predominately the case today. In this model
there is no coordination or control of the various VDBs.
10
Additional topics covered in this article include the population, access, confidence
in the database and the protection of contributors. While the title and abstract indicate
that this article covers data mining of vulnerability databases, there is no mention of the
actual data mining activities or results. It is in essence a very high level discussion of the
authors work. No other mention of this work or TRUSTED could be located.
Nonetheless, this article is included in the literature review for completeness.
2.3.2 Windows of Vulnerability: A Case Study Analysis
This article was featured in IEEE Computer, December 2000 [9]. The authors
claim that well after fixes are provided, systems often remain vulnerable. They present
several case studies and propose a life-cycle model.
While vulnerabilities transition through distinct states ranging from the creation of
the flaw or exposure, through discovery and exploit, and on to patching and removal,
systems also transition through distinct stages relative to vulnerabilities. These systems’
states are often an oscillation between hardened and vulnerable with the periodic
excursion into compromised, as expressed in the figure below. In the article hardened is
described as having all security related patches and corrections applied. Hardening is a
continuous process. A system enters the vulnerable state when security related
corrections have not been made. Compromise occurs when the system is no longer
sufficiently hardened against current threats and a vulnerability is exploited.
Figure 2-2. Host Life-cycle [9]
The life-cycle model proposed by Arbaugh and McHugh [9] is more detailed than
what is typically used in analysis, capturing all possible states in which vulnerabilities
11
can enter during their lifetime. These states include: birth, discovery, disclosure,
correction, publicity, scripting, and death.
The authors select their study data from the CERT/CC database which covers a
period from 1996 thru 1999. While they observe that the CERT/CC is perhaps the best
possible source for data of this type, there are some issues. First they point out that the
records are self-selecting, in that only a subset of attacks will be reported. This is due to
many factors including reluctance by an organization to divulge this type of information.
Second, exploit of vulnerabilities is influenced by human factors such as interestingness.
A given attack may be very popular for a time, and then pass into obscurity, not because
the vulnerability has been removed, rather because it is no longer interesting.
The authors derive three case studies from the CERT data based on vulnerabilities
for the Phf Common Gateway Interface (CGI), Internet Message Access Protocol
(IMAP), and Berkeley Internet Domain (BIND) service. Phf CGI utilized server-side
scripting to provide web server functionality based on the UNIX phonebook command
(ph) concatenated with user input. In this incident an implementation error was exploited
allowing attackers to execute arbitrary code. The vulnerability was disclosed in February
1996, followed quickly by instructions on how to correct the issues. The first scripted
exploit was published in June 1996. The majority of the activity occurred after the script
was published, attempting a simple less affective attack. This supported the author’s
hypothesis that scripting significantly increases the exploitation rate. IMAP incident
resulted from an error in the source code which allowed the use of a long username to
cause a buffer overflow. The vulnerability was posted in March 1997, along with fixes
for the source code. The first known scripting of the vulnerability was in May 1997. An
additional flaw was reported, exploited and scripted a year later. However, the CERT
announcements combined the two. This incident illustrated how much network scanning
and probing are used to identify vulnerable systems. The BIND incident also contained a
flaw resulting in a buffer overflow. Announced in April 1998, automated exploits were
available two months later. The authors note that due to the critical role of BIND in the
internet infrastructure, they would have expected much closer management and
mitigation of the vulnerability.
12
Figure 2-3. Rate of Intrusions [9]
The authors mention the ongoing debate whether to disclose vulnerabilities or not.
They cite their research as evidence that automation of the exploit, not the disclosure
itself is the key to large scale use. Additionally, while patches are typically available
shortly after disclosure, large numbers of systems remain vulnerable, stating “deployment
of corrections is woefully inadequate … many systems remain vulnerable to security
flaws months or even years after corrections become available” [9].
2.3.3 Data Mining for Security
The article Data Mining for Security appeared in NEC Journal of Advanced
Technology, Winter 2005 [10]. The authors describe several new applications to perform
different aspects of intrusion detection utilizing data mining algorithms. The three
applications are SmartSifter which is used for outlier detection, ChangeFinder for
change-point detection, and an anomaly detection engine called AccessTrace.
The authors claim that the outlier detection engine, SmartSifter, is adaptive,
efficient, and highly accurate. It works by learning a statistical model based on past
examples and applies it to the current data. The anomaly score calculated for each datum
will be high for outliers. This permits intrusion detection efforts to focus more
efficiently. A histogram density is used for calculating the statistical model of discrete
variables; while a Gaussian mixture model is used for continuous variables. The
parameters of the model are learned using an unsupervised, on-line, discounting learning
algorithm. When tested against the KDDCup99 dataset, they found that it out performed
13
the Burge & Shawe-Taylor method. Additionally, they developed outlier-filters which
significantly improve the performance of SmartSifter and can be used to preprocess data
for outlier detection.
The ChangeFinder application is aimed at detecting worms or viruses by
examining the times series data in system logs. The authors report they have observed
that viruses and worms are frequently characterized by bursts of activity. A new
outbreak often causes a significant increase in access log entries. Therefore, a goal of
ChangeFinder is to detect the earliest point in time that a virus or worm emerges. During
the first stage of learning the on-line discounting algorithm is used to build an auto-
regression model and calculate an anomaly score using the Shannon information theory.
Next smoothing is performed using moving averages of the anomaly score. This is
followed by the building of a new auto-regression model based on the smoothed data.
AccessTracker is the anomalous behavior detection engine. It is used to detect
intrusion patterns such as those of a Trojan horse or a series of UNIX commands issued
by an intruder. It works by breaking time series data into sessions and building a
statistical model from them. The statistical model is based on a mixture of hidden
Markov models, such as those representing a user’s UNIX command history. Again, the
on-line discounting algorithm is used to learn the model followed by calculation of an
anomaly score for each session. AccessTracker dynamically tracks the changes in the
mixture of components in the model. Thus, detecting an increase in mixture components
would signal the emergence of a new pattern, while the decrease signals that the pattern
has disappeared.
The three applications differ to some degree in the accepted data, processing, and
intended use. At the same time they are very similar in their functionality. SmartSifter
cannot process time series data where as ChangeFinder can. Both are concerned with
detecting local anomalies by measuring how anomalous a specific data point is.
AccessTracker, however, is focused on the detection of patterns of anomalous behavior.
14
3 The Investigation
3.1 Scope
This research is a master’s project in computing and software systems. The
research was conducted between Spring 2005 and Winter 2006. While there are a
number of cyber vulnerability datasets, this project has remained focused on a single
version of one dataset available at the outset of the project. This project utilizes
algorithms readily available in the tools specified in section 3.3.2. Both of these tools are
extensible, however, no modification or enhancements to the algorithms have been made
beyond adjustment of parameters through the program interface due to the richness of the
methods available. The data models produced were analyzed for interestingness and
novelty. Analysis included evaluation metrics available in the tools, my own subjective
evaluation, and my advisor’s assessment. The project scope is limited to answering the
research questions provided in section 1.3.
3.2 Resources and Technologies
3.2.1 Training/Test Dataset
Table 2-2 lists six organizations which maintain vulnerability datasets potentially
appropriate for this project. There were varying degrees of availability and utility
associated with each of them. The ICAT Metabase was selected for several reasons. The
most significant factors were its close similarity to the CVE Dictionary described in
section 2.1 and its maturity. This dataset has been available in the format used here since
1998. An additional influencing factor was ease of access to the data; it can be
downloaded from the NIST website and has no licensing restrictions on the data. It
consists of a single 15MB database file “ICAT Master.mdb” in Microsoft Access format.
It was posted on to the NIST website, March 1, 2005 and downloaded March 7, 2005.
3.2.1.1 Descriptive Summary
From the original 87 attributes, 30 were selected for their value in identifying, evaluating,
or characterizing vulnerabilities by case, date, severity, range, loss, type, requirements,
15
and components. Attributes are categorized into case identification comprised of CVE
ID and publication date, Severity, Attack Requirements (AR), Loss Type (LT),
Vulnerability Type (VT), and Exposed Component (EC). Relevant attributes are listed in
Table 3-1 and descriptive statistics are provided in table 3-2.
Field Name Data Type Description CVE_ID Nominal The standard CVE name for the vulnerability. Publish_Date Nominal The date on which the vulnerability is published within
ICAT. Severity Ordinal The severity assigned to the vulnerability. AR_Launch_remotely Numerical Attack requires no previous access to the system, launch
remotely. AR_Launch_locally Numerical Attack requires launch on local system, must have some
previous access. AR_Target_access_attacker Numerical Attacker Requirements: Victim must access attackers
resource for attack to take place LT_Security_protection Numerical Loss Type: security protection, escalation of privileges. LT_Obtain_all_priv Numerical Loss Type: security protection, administrator access
gained. LT_Obtain_some_priv Numerical Loss Type: security protection, user level privilege
gained. LT_Confidentiality Numerical Loss Type: loss of confidentiality, theft of information. LT_Integrity Numerical Loss Type: loss of integrity, attacker can change the
information residing on or passing through a system. LT_Availability Numerical Loss Type: loss of availability. LT_Sec_Prot_Other Numerical Loss Type: Results in a loss of security protection where
some non-user privilege was gained by the attacker VT_Boundary_condition_error Numerical Vulnerability Type: boundary condition error. VT_Buffer_overflow Numerical Vulnerability Type: buffer overflow. VT_Access_validation_error Numerical Vulnerability Type: access validation error. VT_Exceptional_condition_error Numerical Vulnerability Type: exception condition handling error. VT_Environment_error Numerical Vulnerability Type: environmental error. VT_Configuration_error Numerical Vulnerability Type: configuration error. VT_Race_condition Numerical Vulnerability Type: race condition. VT_Other_vulnerability_type Numerical Vulnerability Type: not explicitly listed as an options VT_Design_Error Numerical Vulnerability Type: Design Error. EC_Operating_system Numerical Exposed Component: operating system. EC_Network_protocol_stack Numerical Exposed Component: network protocol stack of an
operating system EC_Non_server_application Numerical Exposed Component: non-server application. EC_Server_application Numerical Exposed Component: server application EC_Hardware Numerical Exposed Component: hardware EC_Communication_protocol Numerical Exposed Component: communication protocol EC_Encryption_module Numerical Exposed Component: encryption module EC_Other Numerical Exposed Component: some component not explicitly
listed EC_Specific_component Nominal Exposed Component: lists specific program.
16
Table 3-1 Dataset Attributes
Attribute
Mea
n
Std
. Err
or o
f M
ean
Var
ianc
e
Std
. Dev
iatio
n
AR_Launch_remotely 0.71 0.005 0.205 0.452 AR_Launch_locally 0.33 0.005 0.221 0.470 AR_Target_access_attacker 0.02 0.002 0.018 0.135 LT_Security_protection 0.59 0.006 0.242 0.492 LT_Obtain_all_priv 0.24 0.005 0.184 0.429 LT_Obtain_some_priv 0.13 0.004 0.114 0.338 LT_Confidentiality 0.20 0.005 0.159 0.399 LT_Integrity 0.13 0.004 0.113 0.336 LT_Availability 0.27 0.005 0.195 0.442 LT_Sec_Prot_Other 0.23 0.005 0.180 0.424 VT_Input_validation_error 0.47 0.006 0.249 0.499 VT_Boundary_condition_error 0.05 0.003 0.048 0.220 VT_Buffer_overflow 0.22 0.005 0.170 0.412 VT_Access_validation_error 0.11 0.004 0.095 0.308 VT_Exceptional_condition_error 0.10 0.004 0.093 0.306 VT_Environment_error 0.02 0.001 0.016 0.125 VT_Configuration_error 0.07 0.003 0.062 0.250 VT_Race_condition 0.02 0.002 0.021 0.147 VT_Other_vulnerability_type 0.02 0.001 0.015 0.124 VT_Design_Error 0.25 0.005 0.189 0.435 EC_Operating_system 0.19 0.005 0.152 0.390 EC_Network_protocol_stack 0.01 0.001 0.011 0.105 EC_Non_server_application 0.28 0.005 0.203 0.451 EC_Server_application 0.49 0.006 0.250 0.500 EC_Hardware 0.03 0.002 0.026 0.160 EC_Communication_protocol 0.00 0.000 0.001 0.026 EC_Encryption_module 0.01 0.001 0.006 0.080 EC_Other 0.01 0.001 0.012 0.110 CertAdv 0.04 0.002 0.039 0.196 High_Sev 0.50 0.006 0.250 0.500 Medium_Sev 0.44 0.006 0.246 0.496 Low_Sev 0.07 0.003 0.064 0.254
Table 3-2 Descriptive Statistics
Severity is a key attribute since it significantly influences how organizations react
to a specific vulnerability. The web graph in figure 3-1 illustrates the occurrence of other
attributes in relation to the severity levels. Additional web graphs can be found in
appendix D.
17
Figure 3-1 Severity Web Graph
18
It is important to note that vulnerability may have multiple features of the same
type, such as AR_Luanch_locally and AR_Launch_remotely, or LT_Integrity and
LT_Availability. Table 3-2 indicates the distribution of vulnerability features.
Table 3-3 – Distribution of Vulnerability Attributes
AR attributes would more accurately be termed as range. These fields specify
whether the attack can be initiated remotely or require execution from a vulnerable
system (see figure 3-2).
Launch locally
Target must access attacker
Launch remotely
0 1000 2000 3000 4000 5000 6000
Figure 3-2 Attack Range
19
There are seven LT attributes, which are listed in figure 3-3. These fields convey
the scope of loss. Loss of security protection indicates whether admin, user, or non-user
privileges are obtained on the vulnerable system. Loss of availability is equivalent to
some form of denial of service and can run the gamut from a simple reboot to
catastrophic system failure. Loss of confidentiality means that data was accessed or
stolen during the attack. Loss of integrity means that the system has been compromised
and the users can no longer trust the security and accuracy.
Escalation of privleages
Admin privleage gained
Non-user privleage gained
Confidentiality
Integrity
Availability
Non-user privleage gained
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Figure 3-3 Loss Type
VT attributes describe the manner in which the vulnerability manifests which
includes boundary conditions, buffer overflow, exception handling, environment,
configuration, and race condition (see figure 3-4).
20
Exceptional conditionerror
Environment error
Configuration error
Race condition
Other vulnerability
Accessvalidation error
Buffer overflow
Boundary condition error
Input validation error
Operating system
Design Error
0 500 1000 1500 2000 2500 3000 3500 4000
Figure 3-4 Vulnerability Type
EC attributes include nine fields and cover the particular part of the system that is
vulnerable such as hardware, operating system, application, communication or network
protocol and encryption module (see figure 3-5).
Other
Encryption module
Communication protocol
Hardware
Server application
Non-server application
Network protocol
Operating system
0 500 1000 1500 2000 2500 3000 3500 4000
Figure 3-5 Exposed Component
3.2.1.2 Preprocessing
It was necessary to perform several preprocessing tasks on the dataset prior to
beginning the data exploration and subsequent data mining. These included data
cleaning, reduction and transformation. The majority of this activity was performed
using Microsoft Access and Excel.
21
The data cleaning efforts have involved reconstructing missing values when
possible. This was addressed by ensuring there were entries for other attributes of that
type and then setting the missing value to false. As an example, in instances missing the
value for LT_Confidentiality, but with a true value in other fields of the Loss Type
category, the missing value was set to false. There were 39 records marked
**REJECT**, these were deleted from the dataset. Records where the missing value
could not be replaced were removed from the dataset.
Data reduction tasks included dimension reduction and attribute construction.
While the majority of the dataset was complete and consistent with expected entries,
there were a number of fields which have been deprecated by NIST and several others
which contain free form text. The 29 deprecated attributes were removed from the
dataset. The free form text attributes consisted of comments, legacy candidate numbers,
and lists of vendors and software. Exploration of this data reveled that it was highly
inconsistent in content and format. Since they were likely to induce noise in the dataset
these fields were removed. Since NIST’s requirements for assigning high severity
vulnerability included the issuing of an advisory from CERT, it was determined that a
new attribute indicating this was necessary. Attribute construction was used to derive a
new field from the 20 attributes used to reference associated records from other sources
such as those discussed in section 2. The specific attribute that was created was a
Boolean attribute for CERT Advisory.
This dataset is largely comprised of nominal data. The majority of the attributes
are Boolean represented as -1 for true and 0 for false. Data transformation was
performed on all of the numeric Boolean values setting them to “T” for true and “F” for
false.
3.2.2 Evaluation Dataset
In January 2006 a sample of new vulnerability cases was downloaded from the
NIST website for the purpose of evaluating the performance of data models generated in
the data mining phase. This dataset contained 274 new records as verified by a
comparison of the CVE_ID numbers. This dataset underwent preprocessing identical to
the training/test dataset.
22
3.2.3 Tools
3.2.3.1 WEKA
Weka is a popular data mining application developed at the University of Waikato
in New Zealand [11]. It contains a variety of both supervised and unsupervised machine
leaning algorithms for preprocessing, classification, association, and clustering tasks.
These algorithms are implemented in Java classes and organized as packages of related
classes. Users have the option of accessing the application from a command line
interface (CLI), through the graphic user interface (GUI), or via their own Java code.
Weka is released under the GNU General Public license and can be obtained at
www.cs.waikato.ac.nz/ml/weka.
Weka requires input data to be in the Attribute-Relation File Format (ARFF).
This is essentially an ASCII text file divided into header and data sections. The header
section contains a @RELATION tag for identifying the name and relation of the data.
This is followed by a list of all attributes in the dataset. They are required to be in
specific format @ATTRIBUTE <attribute name> <numeric or values>. In the case of
nominal attribute values rather than numeric, all possibilities must be included in braces.
An ARFF formatted sample of the dataset used for this project is included in Appendix
A.
23
Figure 3-6 Weka GUI
The Weka GUI contains five components depicted in figure 3-6, the GUI
Chooser, Simple CLI, Explorer, Experiment and KnowledgeFlow Environments. The
first two perform just what their name implies. The Weka Explorer is an interactive data
mining environment with options to preprocess, classify, cluster, associate, and visualize
data. The user selects the dataset and any preprocessing filters desired. Next the data
mining task is selected, configured and started. The Experiment Environment has similar
functionally, but is not intended to be interactive. Rather the user configures and runs the
selected algorithms, with the results saved to a file. The KnowledgeFlow Environment is
comparable to the streams paradigm available in SPSS Clementine. The user builds a
stream from the available objects consisting of data sources, filters, classifiers, clusterers,
evaluators, and visualizers. Once configured, the stream is executed and results written
to the selected visualization object.
3.2.3.2 Clementine
Clementine is a commercial data mining application from SPSS Inc. It is
described as an enterprise strength data mining workbench designed around the CRISP-
24
DM standard [1]. Clementine has various features for accepting, processing, and mining
input data as well as representing and storing the results. Similar to other data mining
applications, Clementine has capabilities for building decision trees, neural networks,
statistical models, association, clustering, and text mining.
Figure 3-7 Clementine Workbench
Data mining in Clementine is based on the stream paradigm where a stream is
constructed of nodes used to access and manipulate the data, build the model, visualize,
and store results. Nodes are categorized by the functionality they provide which includes
sources, record operations, field operations, graphs, modeling, and output. Source nodes
are used to access the dataset, and include user input, databases, text files, SPSS and SAS
file formats. The record operations nodes allow the manipulation of whole records in the
datasets. They include aggregate, append, balance, distinct, merge, sample, select, and
sort. Field operations contain nodes for manipulating and transforming record attributes.
25
These include binning, derive, filter, partition, reclassify, and reorder. The graph nodes
such as distribution, evaluation histogram, and plot may be used for descriptive data
summary and dataset characterization as well as displaying the results. The various
classification, clustering, and association algorithms are implemented in the modeling
nodes. These include nodes for decision tree and rule induction, neural networks, logistic
regression, and text extraction in addition to others. The output nodes provide the
capabilities for accessing, displaying, and storing the models and results. Table, matrix,
audit, database, and text file are available among others. Once the stream has been built
and executed in Clementine, the resulting data model may be viewed and analyzed or
incorporated into the stream as a node.
4 Results Analysis and Evaluation This section focuses on the hypothesis that Cyber Vulnerability Databases contain
undiscovered patterns which are novel, interesting and of value to researchers and
industry professionals. An analysis and evaluation of the results is presented within the
context of each of the research questions. Association mining is applied to research
question Q1 in order to identify the frequent features sets. Research question Q2 is
addressed by classification mining to predict severity class. Research question Q3
examines both the association and classification rules, comparing them to the rules used
to produce the dataset. Research question Q4 summarizes the interesting patterns
discovered in the association and classification mining and discusses the clustering work.
4.1 Frequent Feature Sets
Q1: Which vulnerability attributes frequently occur together?
Research question Q1 seeks to further characterize the dataset as well as reveal
interesting or novel patterns in the association trends between attributes. Given the
granularity of the data models, results are generated for each of the severity levels.
Table 4-1 lists the frequent attribute sets. The Apriori and Generalized Rule Induction
(GRI) algorithms, as implemented in SPSS Clementine, were used to generate the
itemsets. Node configuration is addressed in section 4.3.1.
26
There are eight high severity item sets, 88% of the itemsets have attributes related
to the loss of security protection, whether all or some privileges were gained.
Additionally, 75% of the itemsets contain vulnerability attributes indicating that it could
be initiated remotely. The frequent attribute sets for medium severity vulnerabilities do
not have as clearly identifiable key characteristics as the other severity levels. However,
loss of any form of security protection is not a member of any of the item sets; whereas
the loss of availability and confidentiality do frequently occur. The eight low severity
feature sets are comprised of attributes from the four significant feature categories
described in section 3.2.1.1, although, only two of the sets have items from each group.
100% of the itemsets have loss of confidentiality. The rest of the itemsets have various
attributes from the other categories. Loss of confidentiality may be one of the most
representative characteristics. Server/non-server applications as the exposed component
comprise an additional key characteristic of low severity vulnerabilities.
H i g h
[AR_Launch_remotely,AR_Launch_locally,LT_Obtain_all_priv]
[AR_Launch_remotely,AR_Target_access_attacker]
[AR_Launch_remotely,EC_Server_application,LT_Sec_Prot_Other]
[AR_Launch_remotely,LT_Obtain_all_priv]
[AR_Launch_remotely,LT_Obtain_some_priv]
[AR_Launch_remotely,LT_Sec_Prot_Other,VT_Input_validation_error]
[LT_Obtain_all_priv,EC_Server_application]
[LT_Obtain_all_priv,VT_Buffer_overflow,VT_Input_validation_error]
M e d i u m
[AR_Launch_locally,LT_Availability],
[AR_Launch_locally,LT_Confidentiality]
[AR_Launch_locally,LT_Integrity]
[AR_Launch_remotely,LT_Availability,EC_Server_application,VT_Input_validation_error]
[AR_Launch_remotely,LT_Availability,VT_Exceptional_condition_error]
[AR_Launch_remotely,LT_Confidentiality]
[AR_Launch_remotely,LT_Integrity]
[AR_Launch_remotely,VT_Exceptional_condition_error,EC_Server_application]
[LT_Availability,EC_Non_server_application]
L o w
[AR_Launch_locally,LT_Confidentiality,EC_Non_server_application]
[AR_Launch_locally,LT_Confidentiality,EC_Operating_system]
[AR_Launch_remotely,LT_Confidentiality,AR_Launch_locally]
[AR_Launch_remotely,LT_Confidentiality,EC_Non_server_application]
[AR_Launch_remotely,LT_Confidentiality,EC_Server_application,VT_Input_validation_error]
[AR_Launch_remotely,LT_Confidentiality,EC_Server_application,VT_Design_Error]
[AR_Launch_remotely,LT_Confidentiality,VT_Design_Error]
[AR_Launch_remotely,LT_Confidentiality,VT_Input_validation_error]
Table 4-1 High, Medium, and Low Severity Frequent Attribute Sets
27
Much of the value for association mining lies in the generation of observations
about relationships between the attributes for each vulnerability record. The feature sets
aid in characterizing the dataset and help to gain insight. Thus when examining new
vulnerabilities, seeing that one attribute is present in that vulnerability may lead to the
discovery of the presence of other attributes which have a strong association with the
first. The frequent feature sets discovered in this research, as exhibited in table 4-1, are
representative of the dataset and therefore comprise interesting patterns.
4.2 Class Prediction
Q2: How accurately can the class of new examples be predicted?
Classification mining was performed using the C5.0, Classification and
Regression Tree (C&R), and Chi-Squared Automatic Interaction Detector (CHAID)
algorithms in Clementine. The classification modeling nodes were set to simple mode
and not to use partitioned data, weight, or frequency. C5.0 was configured for 10-fold
cross-validation. The resulting rule sets are discussed in section 4.3.2. Performance of
each data model is indicated in table 4-2. High severity classification performed
exceptionally well. Accuracy is above 86% on the training/test data and 94% for new
cases. Class prediction for high severity vulnerabilities is therefore considered very
accurate. Low and medium severity classification rules performed similarly, doing well
on training/test dataset but not on the new cases. The low severity rules for each
algorithm have greater than 93% accuracy on the training/test data. However, there are
two very notable issues with the rule sets which suggest over fitting. First, all of the rules
are instances where low severity is false, there are no rules where low = true. The C5.0
algorithm did not produce any rules for low severity (refer to table C-7 in Appendix C).
Second, the high accuracy on training data is countered by poor accuracy on new
examples from the evaluation dataset discussed in section 3.2.2.
28
Model Training / Test Dataset
% Correct New Case Dataset % Correct
High Severity (C&R) 86.92% 94.89% High Severity (C5.0) 88.90% 98.18% High Severity (CHAID) 86.92% 94.89% Medium Severity (C&R) 80.03% 44.16% Medium Severity (C5.0) 83.11% 44.16% Medium Severity (CHAID) 80.03% 44.16% Low Severity (C&R) 93.08% 48.91% Low Severity (C5.0) 93.08% 49.91% Low Severity (CHAID) 93.08% 48.91%
Table 4-2
4.3 Comparison of ML rules to NIST Metrics
Q3: How do the machine learning derived rules compare to those used to
produce the dataset?
As previously discussed the dataset is derived from the ICAT Metabase produced
by the National Institute of Standards and Technology (NIST). Although NIST has
recently converted to the Common Vulnerability Scoring System (CVSS), the severity
rating assigned to records in the dataset is based on the following:
A vulnerability is High Severity if:
1. It allows a remote attacker to violate the security protection of a system (i.e. gain
some sort of user, root, or application account)
2. It allows a local attack that gains complete control of a system
3. It is important enough to have an associated CERT/CC advisory or US-CERT
alert.
A vulnerability is Medium Severity if:
1. It does not meet the definition of either “high” or “low” severity
A vulnerability is Low Severity if:
1. The vulnerability does not typically yield valuable information or control over a
system but instead gives the attacker knowledge that may help the attacker
find and exploit other vulnerabilities
2. NIST feels that the vulnerability is inconsequential for most organizations
29
4.3.1 Association Rules
Association rule mining has typically been applied to market basket analysis and
the examination of what products consumers purchase together with the objective of
increasing sales. Early association mining in this project produced models with large
numbers of strong association rules. Unfortunately, the results were heavily skewed
towards a small number of attributes and were not inherently interesting. Figure 4-1, is a
histogram representing the record count (Y-axis) for each of the resulting association
rules (X-axis). The bars are comprised of color segments representing each of the
attributes in a given feature set. This outcome was not surprising since this is a real
world dataset without an equal distribution of attributes throughout the records.
Figure 4-1 ARM Histogram
The Apriori and Generalized Rule Induction (GRI) algorithms, as implemented in
SPSS Clementine, are used to discover the associations and generate the rules.
Configuring the settings for support and confidence is largely based on trial and error.
The thresholds may be set to high in which case the resulting rules may be too obvious to
have much value. In contrast the thresholds may be set arbitrarily low, producing too
many rules to be accurately analyzed. Since these algorithms perform best in this tool
30
when the attribute values are binary (0/1, T/F) an additional transformation was
performed to divide the Severity attribute into three separate Boolean fields for High,
Medium, and Low. A Clementine data mining Stream is prepared with the appropriate
nodes to access the data source and perform the mining. Based on this refinement to the
dataset and the selected algorithms, eight data models are prepared as per table 4-3,
below.
Each rule is presented in the form Antecedents => Consequent (Support %,
Confidence %, Lift %). These rules are sorted in descending order by confidence,
support, and then lift. Since multiple algorithms were used, variations of equivalent rules
were discovered in some instances and duplicates were removed. Appendix B contains
all rules and indicates which algorithm was used. Additionally some rules have a Lift
percentage below 1.0 and are considered to be negatively correlated. However, the rule
is included for completeness.
Algorithm Target Attribute Minimum
Support Minimum
Confidence Apriori Any 10% 80% Apriori High Severity 9% 80% Apriori Medium Severity 5% 60% Apriori Low Severity 2% 30%
GRI Any 5% 85% GRI High Severity 1% 40% GRI Medium Severity 5% 40% GRI Low Severity 1% 20%
Table 4-3 Clementine Node Configuration
The high severity rule set contains 19 rules. Two are negatively correlated and
two more are single item rules with less than 50% confidence. In spite of the high
confidence for most of the association rule, there were generally low support percentages.
The predictive capability of these rules was poor on both the training/test dataset and the
new cases. This suggests that the machine learned rules do not compare well to the
original metrics. Additionally, 14 of the rules each contain attributes relating to loss of
security protection, with LT_Obtain_all_priv as the most predominant. Rule five is a
single item rule for that attribute with a greater than 95% confidence. These patterns are
consistent with the NIST metric for high severity vulnerabilities, and as such are neither
novel nor interesting.
31
1. VT_Buffer_overflow AND LT_Obtain_all_priv AND VT_Input_validation_error => High_Sev (9.53, 97.42, 1.97)
2. LT_Obtain_all_priv AND VT_Input_validation_error => High_Sev (14.15, 97.39, 1.97)
3. VT_Buffer_overflow AND LT_Obtain_all_priv => High_Sev (9.72, 97.33, 1.96)
4. LT_Obtain_all_priv AND AR_Launch_locally => High_Sev (13.42, 95.62, 1.93)
5. LT_Obtain_all_priv => High_Sev (24.29, 95.62, 1.93)
6. LT_Obtain_all_priv AND AR_Launch_remotely => High_Sev (12.10, 95.14, 1.92)
7. LT_Obtain_all_priv AND EC_Server_application => High_Sev (9.21, 94.07, 1.90)
8. AR_Launch_remotely AND AR_Launch_locally AND LT_Obtain_all_priv => High_Sev (1.46, 92.52, 1.87)
9. LT_Obtain_some_priv AND AR_Launch_remotely => High_Sev (10.22, 91.18, 1.84)
10. LT_Sec_Prot_Other AND EC_Server_application AND AR_Launch_remotely => High_Sev (10.52, 84.29, 1.70)
11. LT_Sec_Prot_Other AND VT_Input_validation_error AND AR_Launch_remotely => High_Sev (9.99, 84.13, 1.70)
12. LT_Obtain_some_priv => High_Sev (13.19, 81.55, 1.65)
13. LT_Sec_Prot_Other AND AR_Launch_remotely => High_Sev (18.00, 80.41, 1.62)
14. LT_Sec_Prot_Other AND EC_Server_application => High_Sev (11.73, 80.07, 1.62)
15. AR_Launch_remotely AND AR_Target_access_attacker => High_Sev (1.04, 55.26, 1.12)
16. AR_Launch_locally => High_Sev (32.86, 49.81, 1.01)
17. AR_Launch_remotely => High_Sev (71.30, 49.61, 1.00)
18. AR_Launch_remotely AND AR_Launch_locally => High_Sev (5.71, 49.04, 0.99)
19. AR_Target_access_attacker AND EC_Non_server_application => High_Sev (1.28, 46.81, 0.95)
Table 4-4 High Severity Association Rules
A large number of medium severity association rules were generated. It is
suspected that this is due to the broad range of vulnerabilities assigned. The NIST metric
for this severity is simply that the vulnerability is neither high nor low. The association
rules produced had a moderate to high confidence, greater than 50%. However, they did
not perform well on the datasets with accuracy less than 44% on the training/test set and
less than 12% on the new case dataset. Thus they are no comparison to the source
metrics. Nonetheless, there are interesting patterns to be observed from this rule set.
Most of the rules contain a loss of availability and many have an exception condition
error as the vulnerability type. These attributes are often related to Denial of Service
(DoS) attacks. Therefore the pattern is interesting in that it represents a previously
unknown relationship between medium severity vulnerabilities and DoS attacks.
32
1. VT_Exceptional_condition_error AND LT_Availability => Medium_Sev (6.90, 86.73, 1.99)
2. VT_Exceptional_condition_error AND LT_Availability AND AR_Launch_remotely => Medium_Sev (6.26, 86.46, 1.98)
3. VT_Exceptional_condition_error AND EC_Server_application AND AR_Launch_remotely => Medium_Sev (5.18, 73.09, 1.68)
4. VT_Exceptional_condition_error AND AR_Launch_remotely => Medium_Sev (9.10, 72.82, 1.67)
5. VT_Exceptional_condition_error AND EC_Server_application => Medium_Sev (5.47, 72.75, 1.67)
6. VT_Exceptional_condition_error => Medium_Sev (10.44, 71.73, 1.64)
7. AR_Launch_locally AND LT_Availability => Medium_Sev (5.08, 71.24, 1.63)
8. LT_Availability => Medium_Sev (26.65, 69.72, 1.60)
9. AR_Launch_remotely AND LT_Availability => Medium_Sev (22.88, 69.69, 1.60)
10. LT_Availability AND EC_Server_application AND AR_Launch_remotely => Medium_Sev (13.43, 69.28, 1.59)
11. LT_Availability AND EC_Server_application => Medium_Sev (14.54, 69.08, 1.58)
12. LT_Integrity AND AR_Launch_locally => Medium_Sev (6.12, 68.08, 1.56)
13. LT_Confidentiality AND AR_Launch_locally => Medium_Sev (5.03, 63.32, 1.45)
14. LT_Integrity => Medium_Sev (13.00, 62.82, 1.44)
15. LT_Availability AND EC_Non_server_application => Medium_Sev (5.62, 62.04, 1.42)
16. LT_Availability AND VT_Input_validation_error AND AR_Launch_remotely => Medium_Sev (12.99, 61.20, 1.40)
17. LT_Availability AND VT_Input_validation_error AND EC_Server_application AND AR_Launch_remotely => Medium_Sev (8.28, 60.89, 1.40)
18. LT_Availability AND VT_Input_validation_error => Medium_Sev (14.36, 60.80, 1.39)
19. LT_Availability AND VT_Input_validation_error AND EC_Server_application => Medium_Sev (8.79, 60.50, 1.39)
20. AR_Launch_remotely AND LT_Integrity => Medium_Sev (7.34, 58.92, 1.35)
21. LT_Confidentiality => Medium_Sev (19.81, 55.07, 1.26)
22. AR_Launch_remotely AND LT_Confidentiality => Medium_Sev (15.66, 52.48, 1.20)
23. AR_Launch_locally => Medium_Sev (32.86, 45.62, 1.05)
24. AR_Launch_remotely => Medium_Sev (71.30, 42.70, 0.98)
Table 4-5 Medium Severity Association Rule
Due to the somewhat vague metric for low severity vulnerabilities, “we feel the
vulnerability is inconsequential …,” data mining for low severity rules took into account
both true and false cases. That is, cases where the attribute was absent as well as cases
where it was present. Whereas in mining high and medium severity rules, the Clementine
modeling nodes were configured to consider record attributes with true values only. The
low severity association rule set is comprised of 20 entries with low to moderate
confidence (21-66%). The corresponding performance on all datasets was poor clearly
indicating that the low severity rules do not map well to the criteria used in assigning the
actual severity levels. Even though the rules preformed poorly for prediction, there are
interesting patterns to be observed in the rule set. All of the low severity rules have a loss
of confidentiality, and no loss of availability, integrity, or escalation of account
33
privileges. This is an indication this class of severity can be characterized by
inappropriate disclosure of confidential information without the loss of security
protection.
1. LT_Confidentiality=T and LT_Availability=F and LT_Obtain_all_priv=F and LT_Sec_Prot_Other=F and LT_Integrity=F => Low_Sev=F (14.87, 65.29, 0.70)
2. LT_Confidentiality=T and LT_Obtain_all_priv=F and LT_Sec_Prot_Other=F and LT_Obtain_some_priv=F and LT_Integrity=F => Low_Sev=F (14.78, 65.10, 0.70)
3. LT_Confidentiality=T and LT_Availability=F and LT_Sec_Prot_Other=F and LT_Obtain_some_priv=F and LT_Integrity=F => Low_Sev=F (14.72, 64.84, 0.70)
4. LT_Confidentiality=T and LT_Availability=F and LT_Sec_Prot_Other=F and LT_Obtain_some_priv=F and LT_Integrity=F => Low_Sev=T (14.72, 35.16, 5.08)
5. LT_Confidentiality=T and LT_Obtain_all_priv=F and LT_Sec_Prot_Other=F and LT_Obtain_some_priv=F and LT_Integrity=F => Low_Sev=T (14.78, 34.90, 5.04)
6. LT_Confidentiality=T and LT_Availability=F and LT_Obtain_all_priv=F and LT_Sec_Prot_Other=F and LT_Integrity=F => Low_Sev=T (14.87, 34.71, 5.01)
7. LT_Confidentiality AND EC_Server_application AND VT_Input_validation_error AND AR_Launch_remotely => Low_Sev (4.90, 34.54, 4.99)
8. LT_Confidentiality AND VT_Input_validation_error AND AR_Launch_remotely => Low_Sev (6.09, 34.30, 4.95)
9. LT_Confidentiality AND EC_Server_application AND VT_Input_validation_error => Low_Sev (4.99, 34.25, 4.95)
10. LT_Confidentiality AND EC_Non_server_application AND AR_Launch_remotely => Low_Sev (3.32, 34.16, 4.93)
11. LT_Confidentiality AND VT_Input_validation_error => Low_Sev (6.38, 33.62, 4.85)
12. LT_Confidentiality AND VT_Design_Error AND AR_Launch_remotely => Low_Sev (5.09, 31.37, 4.53)
13. LT_Confidentiality AND VT_Design_Error AND EC_Server_application AND AR_Launch_remotely => Low_Sev (2.84, 31.25, 4.51)
14. LT_Confidentiality AND EC_Non_server_application => Low_Sev (4.88, 30.81, 4.45)
15. LT_Confidentiality AND VT_Design_Error AND EC_Non_server_application => Low_Sev (2.35, 30.23, 4.37)
16. AR_Launch_remotely AND LT_Confidentiality => Low_Sev (15.66, 29.03, 4.19)
17. LT_Confidentiality => Low_Sev (19.81, 26.74, 3.86)
18. AR_Launch_locally AND LT_Confidentiality AND EC_Non_server_application => Low_Sev (1.60, 23.93, 3.46)
19. AR_Launch_remotely AND AR_Launch_locally AND LT_Confidentiality => Low_Sev (1.17, 23.26, 3.36)
20. AR_Launch_locally AND LT_Confidentiality AND EC_Operating_system => Low_Sev (1.45, 20.75, 3.00)
Table 4-6 Low Severity Association Rules
Model
Training / Test Dataset % Correct
New Case Dataset % Correct
High Severity (Apriori) 49.50% 37.23% High Severity (GRI) 49.50% 37.23% Medium Severity (Apriori) 43.58% 11.31% Medium Severity (GRI) 43.58% 11.31% Low Severity (Apriori) 4.18% 12.04 Low Severity (GRI) 6.92% 51.09%
Table 4-7 Association Rule Performance
34
4.3.2 Classification Rules
Each of the algorithms produced similar rule sets, although the CHAID rule sets
were comprehensive. CHAID rules for high, medium, and low severity level are
provided in tables 4-7, 4-8, and 4-9 respectively, with the remaining rule sets provided in
Appendix C.
There were 18 rules generated for high severity vulnerabilities which performed
exceptionally well on both the training set and new cases. The C5.0 algorithm performed
well on the training/test dataset with a 88.9% accuracy, however, it performed best of all
the classification algorithms on the new cases with 98.18%. Given the solid
performance, the rules are considered to be a good comparison to the NIST metric.
However, the patterns detected in the rule set were intuitive and therefore uninteresting.
Loss of security protection, in particular LT_Obtain_all_priv, was predominant. As with
the association rules this is due, at least in part, to the NIST metric for high severity
vulnerabilities.
35
1. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=F AND VT_Buffer_overflow=F AND LT_Availability=F THEN High=F
2. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=F AND VT_Buffer_overflow=F AND LT_Availability=T THEN High=F
3. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=F AND VT_Buffer_overflow=T AND LT_Availability=T THEN High=F
4. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=T AND AR_Launch_remotely=F THEN High=F
5. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=F AND EC_Server_application=F THEN High=F
6. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=F AND EC_Server_application=T THEN High=F
7. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=F AND VT_Buffer_overflow=T AND LT_Availability=F THEN High=T
8. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=T AND AR_Launch_remotely=T AND VT_Input_validation_error=F THEN High=T
9. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=T AND AR_Launch_remotely=TAND VT_Input_validation_error=T THEN High=T
10. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=T AND VT_Exceptional_condition_error=F AND EC_Server_application=F THEN High=T
11. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=T AND VT_Exceptional_condition_error=F AND EC_Server_application=T THEN High=T
12. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=T AND VT_Exceptional_condition_error=T THEN High=T
13. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=F AND EC_Non_server_application=F THEN High=T
14. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=F AND EC_Non_server_application=T THEN High=T
15. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=T AND LT_Obtain_some_priv=F AND LT_Availability=FAND AR_Launch_locally=F THEN High=T
16. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=T AND LT_Obtain_some_priv=F AND LT_Availability=F AND AR_Launch_locally=T THEN High=T
17. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=T AND LT_Obtain_some_priv=F AND LT_Availability=T THEN High=T
18. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=T AND LT_Obtain_some_priv=T THEN High=T
Table 4-8 High Severity CHAID Rule Set
Medium severity classification rule mining produced 18 rules in the CHAID rule
set. Eleven of these rules are negative instances where the case is not rated as a medium
severity. The remaining seven are true cases. Rules generated from all three algorithms
preformed well on the training/test dataset with accuracies above 80% (see table 4-10).
However, they each had accuracies below 45% and performed equally poorly on the new
case dataset.
36
1. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=T AND AR_Launch_remotely=T AND VT_Input_validation_error=F THEN Medium=F
2. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=T AND AR_Launch_remotely=T AND VT_Input_validation_error=T THEN Medium=F
3. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=T AND VT_Exceptional_condition_error=F AND EC_Non_server_application=F THEN Medium=F
4. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=T AND VT_Exceptional_condition_error=F AND EC_Non_server_application=T THEN Medium=F
5. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=T AND VT_Exceptional_condition_error=T THEN Medium=F
6. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=F AND EC_Non_server_application=F AND AR_Launch_remotely=F THEN Medium=F
7. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=F AND EC_Non_server_application=F AND AR_Launch_remotely=T THEN Medium=F
8. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=F AND EC_Non_server_application=T THEN Medium=F
9. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=T AND LT_Obtain_some_priv=F AND LT_Availability=F THEN Medium=F
10. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=T AND LT_Obtain_some_priv=F AND LT_Availability=T THEN Medium=F
11. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=T AND LT_Obtain_some_priv=T THEN Medium=F
12. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=F AND LT_Availability=F AND LT_Integrity=F THEN Medium=T
13. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=F AND LT_Availability=F AND LT_Integrity=T THEN Medium=T
14. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=F AND LT_Availability=T AND VT_Buffer_overflow=F THEN Medium=T
15. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=F AND LT_Availability=T AND VT_Buffer_overflow=T THEN Medium=T
16. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=T AND AR_Launch_remotely=F THEN Medium=T
17. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=F AND EC_Server_application=F THEN Medium=T
18. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=F AND EC_Server_application=T THEN Medium=T
Table 4-9 Medium Severity CHAID Rule Set
Low severity classification rules performed best on the training/test data, out of
the three severity levels. Rules for each algorithm have greater than 93% accuracy.
However, there are two very notable issues with the rule sets which suggest over fitting.
First, all of the rules are instances where low severity is false, there are no rules where
low = true. The C5.0 algorithm did not produce any rules for low severity (refer to
Appendix C, table C-7). Second, the high accuracy on training data is countered by poor
accuracy on new examples in the evaluation dataset.
37
1. if LT_Confidentiality=F AND VT_Input_validation_error=F AND LT_Obtain_all_priv=F AND AR_Launch_remotely=F THEN Low=F
2. if LT_Confidentiality=F AND VT_Input_validation_error=F AND LT_Obtain_all_priv=F AND AR_Launch_remotely=T AND EC_Server_application=F THEN Low=F
3. if LT_Confidentiality=F AND VT_Input_validation_error=F AND LT_Obtain_all_priv=F AND AR_Launch_remotely=T AND EC_Server_application=T THEN Low=F
4. if LT_Confidentiality=F AND VT_Input_validation_error=F AND LT_Obtain_all_priv=T THEN Low=F
5. if LT_Confidentiality=F AND VT_Input_validation_error=T AND VT_Buffer_overflow=F AND LT_Availability=F AND LT_Integrity=F THEN Low=F
6. if LT_Confidentiality=F AND VT_Input_validation_error=T AND VT_Buffer_overflow=F AND LT_Availability=F AND LT_Integrity=T THEN Low=F
7. if LT_Confidentiality=F AND VT_Input_validation_error=T AND VT_Buffer_overflow=F AND LT_Availability=T THEN Low=F
8. if LT_Confidentiality=F AND VT_Input_validation_error=T AND VT_Buffer_overflow=T THEN Low=F
9. if LT_Confidentiality=T AND LT_Sec_Prot_Other=F AND LT_Integrity=F AND VT_Input_validation_error=F AND VT_Design_Error=F THEN Low=F
10. if LT_Confidentiality=T AND LT_Sec_Prot_Other=F AND LT_Integrity=F AND VT_Input_validation_error=F AND VT_Design_Error=T THEN Low=F
11. if LT_Confidentiality=T AND LT_Sec_Prot_Other=F AND LT_Integrity=F AND VT_Input_validation_error=T THEN Low=F
12. if LT_Confidentiality=T AND LT_Sec_Prot_Other=F AND LT_Integrity=T THEN Low=F
13. if LT_Confidentiality=T AND LT_Sec_Prot_Other=T THEN Low=F
Table 4-10 Low Severity CHAID Rule Set
4.4 Pattern Discovery
Q4: What novel or interesting patterns can be discovered from the dataset?
Hand et al [3], describe data mining as “the analysis of (often large) observational
data sets to find unsuspected relationships and to summarize the data in novel ways that
are both understandable and useful to the data owner.” The association and classification
rules discussed above as well as in the appendices, contain many interesting patterns
since they involve interesting and important attributes, have high confidence and support
and they reveal unsuspected or new information [12]. While the association and
classification results were previously depicted in rule or fact form, they may also be
observed as the following patterns:
1. High severity vulnerabilities are characterized by gaining all privileges
associated with input validation error and buffer overflow.
2. Medium Severity vulnerabilities have several characteristics commonly
shared by Denial of Service Attacks. They typically consist of a loss of
38
availability and are remotely launched. They are frequently connected to an
exception error condition and buffer overflow.
3. Low severity vulnerabilities most frequently involve a loss of confidentiality
and do not include a loss of security protection where the attacker gained
unauthorized privileges.
Clustering was performed using SPSS TwoStep clustering algorithm configured
to use all available attributes. Maximum tree depth was set to 5 levels with maximum 8
branches per leaf node. The number of clusters specified was 3. Clustering is frequently
a difficult mode for which to evaluate results. Witten and Frank [11] observe that while
association and classification have objective criterion to judge success, clustering does
not. There is no intrinsic measure of right or wrong for clustering. Despite the lack of
criterion or indicators some observation can still be made from the results. Table 4-11
lists the frequency count and percent of attribute occurrence in each cluster. Tables 4-12
thru 4-14 provide a statistical profile of the tree clusters.
1. Cluster 1 consists exclusively of high severity vulnerabilities, which
facilitates the clear decomposition of the remaining characteristics. Of the four attributes
related to loss of security protection, 75% are related to high severity.
2. Cluster 2 corresponds to low severity and does not contain a significant
percentage of any attribute subset.
3. The medium severity can be found in cluster 3. This cluster contains the
majority of losses of confidentiality and availability.
39
4.
Cluster 1 Cluster 2 Cluster 3 Attributes Frequency Percent Frequency Percent Frequency Percent AR_Launch_remotely 2,591 49.6% 402 7.7% 2,230 42.7% AR_Launch_locally 1,199 49.8% 110 4.6% 1,098 45.6% AR_Target_access_attacker 69 50.4% 13 9.5% 55 40.1% LT_Security_protection 3,430 79.6% 33 0.8% 848 19.7% LT_Obtain_all_priv 1,701 95.6% 5 0.3% 73 4.1% LT_Obtain_some_priv 787 81.6% 7 0.7% 171 17.7% LT_Confidentiality 264 18.2% 388 26.7% 799 55.1% LT_Integrity 314 33.0% 40 4.2% 598 62.8% LT_Availability 543 27.8% 48 2.5% 1,361 69.7% LT_Sec_Prot_Other 1,198 69.7% 22 1.3% 499 29.0% VT_Input_validation_error 2,064 59.7% 182 5.3% 1,209 35.0% VT_Boundary_condition_error 153 41.0% 11 2.9% 209 56.0% VT_Buffer_overflow 1,216 76.4% 7 0.4% 369 23.2% VT_Access_validation_error 367 47.1% 60 7.7% 353 45.3% VT_Exceptional_condition_error 166 21.7% 50 6.5% 548 71.7% VT_Environment_error 58 49.6% 6 5.1% 53 45.3% VT_Configuration_error 233 47.6% 50 10.2% 207 42.2% VT_Race_condition 70 43.5% 7 4.3% 84 52.2% VT_Other_vulnerability_type 52 45.2% 11 9.6% 52 45.2% VT_Design_Error 778 42.1% 184 9.9% 888 48.0% EC_Operating_system 703 51.4% 56 4.1% 609 44.5% EC_Network_protocol_stack 25 30.5% 3 3.7% 54 65.9% EC_Non_server_application 1,080 51.9% 148 7.1% 851 40.9% EC_Server_application 1,778 49.1% 283 7.8% 1,563 43.1% EC_Hardware 66 34.2% 16 8.3% 111 57.5% EC_Communication_protocol 5 100.0% 0 0.0% 0 0.0% EC_Encryption_module 15 31.9% 3 6.4% 29 61.7% EC_Other 39 43.8% 9 10.1% 41 46.1% CertAdv 226 76.9% 3 1.0% 65 22.1% High_Sev 3,626 100.0% 0 0.0% 0 0.0% Medium_Sev 0 0.0% 0 0.0% 3,192 100.0% Low_Sev 0 0.0% 507 100.0% 0 0.0%
Table 4-11 SPSS Two Step Cluster Frequency Table
40
A
ttrib
utes
Mea
n
Std
. Err
or o
f Mea
n
Var
ianc
e
Kur
tosi
s
Std
. Err
or o
f Kur
tosi
s
Ske
wne
ss
Std
. Err
or o
f Ske
wne
ss
Std
. Dev
iatio
n
Gro
uped
Med
ian
AR_Launch_remotely 0.71 0.008 0.204 -1.097 0.081 -0.951 0.041 0.452 0.71 AR_Launch_locally 0.33 0.008 0.221 -1.482 0.081 0.720 0.041 0.471 0.33 AR_Target_access_attacker 0.02 0.002 0.019 47.637 0.081 7.044 0.041 0.137 0.02 LT_Security_protection 0.95 0.004 0.051 13.578 0.081 -3.946 0.041 0.226 0.95 LT_Obtain_all_priv 0.47 0.008 0.249 -1.986 0.081 0.124 0.041 0.499 0.47 LT_Obtain_some_priv 0.22 0.007 0.170 -0.114 0.081 1.373 0.041 0.412 0.22 LT_Confidentiality 0.07 0.004 0.068 8.827 0.081 3.290 0.041 0.260 0.07 LT_Integrity 0.09 0.005 0.079 6.653 0.081 2.941 0.041 0.281 0.09 LT_Availability 0.15 0.006 0.127 1.858 0.081 1.964 0.041 0.357 0.15 LT_Sec_Prot_Other 0.33 0.008 0.221 -1.480 0.081 0.721 0.041 0.470 0.33 VT_Input_validation_error 0.57 0.008 0.245 -1.923 0.081 -0.280 0.041 0.495 0.57 VT_Boundary_condition_error 0.04 0.003 0.040 18.771 0.081 4.556 0.041 0.201 0.04 VT_Buffer_overflow 0.34 0.008 0.223 -1.514 0.081 0.698 0.041 0.472 0.34 VT_Access_validation_error 0.10 0.005 0.091 5.001 0.081 2.645 0.041 0.302 0.10 VT_Exceptional_condition_error 0.05 0.003 0.044 16.916 0.081 4.348 0.041 0.209 0.05 VT_Environment_error 0.02 0.002 0.016 57.615 0.081 7.719 0.041 0.125 0.02 VT_Configuration_error 0.06 0.004 0.060 10.647 0.081 3.555 0.041 0.245 0.06 VT_Race_condition 0.02 0.002 0.019 46.886 0.081 6.990 0.041 0.138 0.02 VT_Other_vulnerability_type 0.01 0.002 0.014 64.836 0.081 8.173 0.041 0.119 0.01 VT_Design_Error 0.21 0.007 0.169 -0.065 0.081 1.391 0.041 0.411 0.21 EC_Operating_system 0.19 0.007 0.156 0.401 0.081 1.549 0.041 0.395 0.19 EC_Network_protocol_stack 0.01 0.001 0.007 140.242 0.081 11.923 0.041 0.083 0.01 EC_Non_server_application 0.30 0.008 0.209 -1.218 0.081 0.884 0.041 0.457 0.30 EC_Server_application 0.49 0.008 0.250 -2.000 0.081 0.039 0.041 0.500 0.49 EC_Hardware 0.02 0.002 0.018 50.029 0.081 7.211 0.041 0.134 0.02 EC_Communication_protocol 0.00 0.001 0.001 721.197 0.081 26.885 0.041 0.037 0.00 EC_Encryption_module 0.00 0.001 0.004 237.066 0.081 15.458 0.041 0.064 0.00 EC_Other 0.01 0.002 0.011 88.108 0.081 9.490 0.041 0.103 0.01 CertAdv 0.06 0.004 0.058 11.128 0.081 3.622 0.041 0.242 0.06 High_Sev 1.00 0.000 0.000 . . . . 0.000 1.00 Medium_Sev 0.00 0.000 0.000 . . . . 0.000 0.00 Low_Sev 0.00 0.000 0.000 . . . . 0.000 0.00
Table 4-12 SPSS Two Step Cluster 1 Profile
41
Attr
ibut
es
Mea
n
Std
. Erro
r of M
ean
Var
ianc
e
Kur
tosi
s
Std
. Err
or o
f Kur
tosi
s
Ske
wne
ss
Std
. Erro
r of S
kew
ness
Std
. Dev
iatio
n
Gro
uped
Med
ian
AR_Launch_remotely 0.79 0.018 0.165 0.103 0.217 -1.450 0.108 0.406 0.79 AR_Launch_locally 0.22 0.018 0.170 -0.103 0.217 1.377 0.108 0.413 0.22 AR_Target_access_attacker 0.03 0.007 0.025 34.376 0.217 6.020 0.108 0.158 0.03 LT_Security_protection 0.07 0.011 0.061 10.549 0.217 3.537 0.108 0.247 0.07 LT_Obtain_all_priv 0.01 0.004 0.010 97.379 0.217 9.950 0.108 0.099 0.01 LT_Obtain_some_priv 0.01 0.005 0.014 68.124 0.217 8.358 0.108 0.117 0.01 LT_Confidentiality 0.77 0.019 0.180 -0.425 0.217 -1.256 0.108 0.424 0.77 LT_Integrity 0.08 0.012 0.073 7.850 0.217 3.133 0.108 0.270 0.08 LT_Availability 0.09 0.013 0.086 5.735 0.217 2.777 0.108 0.293 0.09 LT_Sec_Prot_Other 0.04 0.009 0.042 18.282 0.217 4.496 0.108 0.204 0.04 VT_Input_validation_error 0.36 0.021 0.231 -1.659 0.217 0.590 0.108 0.480 0.36 VT_Boundary_condition_error 0.02 0.006 0.021 41.533 0.217 6.586 0.108 0.146 0.02 VT_Buffer_overflow 0.01 0.005 0.014 68.124 0.217 8.358 0.108 0.117 0.01 VT_Access_validation_error 0.12 0.014 0.105 3.632 0.217 2.370 0.108 0.323 0.12 VT_Exceptional_condition_error 0.10 0.013 0.089 5.313 0.217 2.700 0.108 0.298 0.10 VT_Environment_error 0.01 0.005 0.012 80.314 0.217 9.055 0.108 0.108 0.01 VT_Configuration_error 0.10 0.013 0.089 5.313 0.217 2.700 0.108 0.298 0.10 VT_Race_condition 0.01 0.005 0.014 68.124 0.217 8.358 0.108 0.117 0.01 VT_Other_vulnerability_type 0.02 0.006 0.021 41.533 0.217 6.586 0.108 0.146 0.02 VT_Design_Error 0.36 0.021 0.232 -1.680 0.217 0.572 0.108 0.481 0.36 EC_Operating_system 0.11 0.014 0.098 4.231 0.217 2.493 0.108 0.314 0.11 EC_Network_protocol_stack 0.01 0.003 0.006 165.647 0.217 12.923 0.108 0.077 0.01 EC_Non_server_application 0.29 0.020 0.207 -1.162 0.217 0.918 0.108 0.455 0.29 EC_Server_application 0.56 0.022 0.247 -1.952 0.217 -0.235 0.108 0.497 0.56 EC_Hardware 0.03 0.008 0.031 26.997 0.217 5.375 0.108 0.175 0.03 EC_Communication_protocol 0.00 0.000 0.000 . . . . 0.000 0.00 EC_Encryption_module 0.01 0.003 0.006 165.647 0.217 12.923 0.108 0.077 0.01 EC_Other 0.02 0.006 0.017 51.873 0.217 7.326 0.108 0.132 0.02 CertAdv 0.01 0.003 0.006 165.647 0.217 12.923 0.108 0.077 0.01 High_Sev 0.00 0.000 0.000 . . . . 0.000 0.00 Medium_Sev 0.00 0.000 0.000 . . . . 0.000 0.00 Low_Sev 1.00 0.000 0.000 . . . . 0.000 1.00
Table 4-13 SPSS Two Step Cluster 2 Profile
42
Attr
ibut
es
Mea
n
Std
. Erro
r of M
ean
Var
ianc
e
Kur
tosi
s
Std
. Err
or o
f Kur
tosi
s
Ske
wne
ss
Std
. Erro
r of S
kew
ness
Geo
met
ric M
ean
Std
. Dev
iatio
n
Gro
uped
Med
ian
AR_Launch_remotely 0.70 0.008 0.211 -1.251 0.087 -0.866 0.043 0.00 0.459 0.70 AR_Launch_locally 0.34 0.008 0.226 -1.569 0.087 0.657 0.043 0.00 0.475 0.34 AR_Target_access_attacker 0.02 0.002 0.017 53.139 0.087 7.423 0.043 0.00 0.130 0.02 LT_Security_protection 0.27 0.008 0.195 -0.874 0.087 1.062 0.043 0.00 0.442 0.27 LT_Obtain_all_priv 0.02 0.003 0.022 38.812 0.087 6.387 0.043 0.00 0.150 0.02 LT_Obtain_some_priv 0.05 0.004 0.051 13.747 0.087 3.967 0.043 0.00 0.225 0.05 LT_Confidentiality 0.25 0.008 0.188 -0.670 0.087 1.153 0.043 0.00 0.433 0.25 LT_Integrity 0.19 0.007 0.152 0.571 0.087 1.603 0.043 0.00 0.390 0.19 LT_Availability 0.43 0.009 0.245 -1.912 0.087 0.298 0.043 0.00 0.495 0.43 LT_Sec_Prot_Other 0.16 0.006 0.132 1.586 0.087 1.894 0.043 0.00 0.363 0.16 VT_Input_validation_error 0.38 0.009 0.235 -1.751 0.087 0.500 0.043 0.00 0.485 0.38 VT_Boundary_condition_error 0.07 0.004 0.061 10.361 0.087 3.515 0.043 0.00 0.247 0.07 VT_Buffer_overflow 0.12 0.006 0.102 3.789 0.087 2.406 0.043 0.00 0.320 0.12 VT_Access_validation_error 0.11 0.006 0.098 4.175 0.087 2.484 0.043 0.00 0.314 0.11 VT_Exceptional_condition_error 0.17 0.007 0.142 1.036 0.087 1.742 0.043 0.00 0.377 0.17 VT_Environment_error 0.02 0.002 0.016 55.332 0.087 7.569 0.043 0.00 0.128 0.02 VT_Configuration_error 0.06 0.004 0.061 10.508 0.087 3.536 0.043 0.00 0.246 0.06 VT_Race_condition 0.03 0.003 0.026 33.081 0.087 5.921 0.043 0.00 0.160 0.03 VT_Other_vulnerability_type 0.02 0.002 0.016 56.492 0.087 7.646 0.043 0.00 0.127 0.02 VT_Design_Error 0.28 0.008 0.201 -1.020 0.087 0.990 0.043 0.00 0.448 0.28 EC_Operating_system 0.19 0.007 0.154 0.480 0.087 1.575 0.043 0.00 0.393 0.19 EC_Network_protocol_stack 0.02 0.002 0.017 54.215 0.087 7.495 0.043 0.00 0.129 0.02 EC_Non_server_application 0.27 0.008 0.196 -0.885 0.087 1.056 0.043 0.00 0.442 0.27 EC_Server_application 0.49 0.009 0.250 -2.000 0.087 0.041 0.043 0.00 0.500 0.49 EC_Hardware 0.03 0.003 0.034 23.832 0.087 5.081 0.043 0.00 0.183 0.03 EC_Communication_protocol 0.00 0.000 0.000 . . . . 0.00 0.000 0.00 EC_Encryption_module 0.01 0.002 0.009 105.245 0.087 10.353 0.043 0.00 0.095 0.01 EC_Other 0.01 0.002 0.013 72.983 0.087 8.657 0.043 0.00 0.113 0.01 CertAdv 0.02 0.003 0.020 44.200 0.087 6.795 0.043 0.00 0.141 0.02 High_Sev 0.00 0.000 0.000 . . . . 0.00 0.000 0.00 Medium_Sev 1.00 0.000 0.000 . . . . 1.00 0.000 1.00 Low_Sev 0.00 0.000 0.000 . . . . 0.00 0.000 0.00
Table 4-14 SPSS Two Step Cluster 3 Profile
5 Educational Statement This project has drawn upon my experience and studies in several areas as well as
resulting in new learning. Specific courses taken at the Institute of Technology,
University of Washington, Tacoma, which have aided my work include TCSS 435
43
Artificial Intelligence and Knowledge Acquisition, TCSS555 Data Mining, and TCSS
598 Master’s Seminar. The AI and Data Mining courses formed the educational
foundation of this project by helping me to comprehend the relevant machine learning
concepts. They guided and influenced the research conducted during this project by
indicating areas where additional knowledge was needed and in selecting specific data
mining tasks. The Master’s Seminar was a benefit in conducting research and preparing
the documentation. New knowledge learned includes a deeper understanding of data
mining process and algorithms, which I intend to apply to further studies. I have also
achieved a significant comprehension of vulnerability databases which I can apply to
further study as well as in my career.
6 Further Work There are several areas proposed for further work including the incorporation of
changes to the ICAT Metabase, additional work with cluster generation and analysis, and
text extraction. In addition, alternate datasets should be considered for related, future
data mining activities.
During the course of this research, the ICAT Metabase underwent several
changes. First, NIST updated and converted the database to XML format and re-released
it as the National Vulnerability Database. The conversion to XML presents some
challenge for the data mining tools used in this project since neither of them is capable of
accurately using the data in its native format, importing, or converting it into a usable
format. Therefore the development of preprocessing scripts or other automation is
proposed to address this issue. The second change introduced by NIST was the move
from the severity rating metrics discussed in section 4.3 to the Common Vulnerability
Score System (CVSS). The new metric is a numeric value derived from a qualitative
assessment of different aspects of the vulnerability.
Additional research into clustering of the vulnerability data is the second area of
proposed future work. Even though there were a variety of clustering algorithms
available, between the tools used in this project, there were issues in working with them.
Most significantly, there were shortcomings in presentation, visualization, and analysis of
the clustering results.
44
Text extraction is the final area of future work. Each vulnerability record
contained several text fields. The information contained in these fields includes a
description of the vulnerability as well as vendor, product name, and version information.
There is a great deal of variance in the contents and quality of the data. In many cases
multiple terms are used to reference the same subject. Thus these fields cannot be used in
the current format. The use of text extraction and possibly named entity recognition is
proposed for the derivation of new attributes which could contribute to further data
mining efforts.
7 Conclusion This project has investigated the applicability of data mining vulnerability
databases, namely the ICAT Metabase, whereas related work had sought to discover new
vulnerabilities, or detect and assess existing ones. It has provided a background in the
field of vulnerability, exploit, and exposure research. It has described relevant data
mining methods and presented a review of related literature. By studying a representative
corpus of known vulnerabilities and applying various data mining techniques it has
demonstrated that cyber vulnerability databases contain previously undiscovered patterns
which are novel, interesting and of value to researchers and industry professionals.
Association mining was used to learn which vulnerability attributes frequently occur
together. Results show for example that high severity item sets are mostly associated with
the loss of security protection and the capacity to be initiated remotely, while the low
severity item sets are associated with loss of confidentiality. Classification rules were
mined and applied to a dataset of new examples to determine the accuracy of class
prediction, showing that high severity can be predicted from other attributes with
excellent accuracy, while low and medium severity did not provide as high accuracy. A
comparison was made between the machine learned rules and those used to produce the
dataset. Results suggest that the rules learnt by the system permitted to discriminate well
among severity level, except for low severity, which shows that the NIST rules can often
be reproduced from the database. Clustering, classification, and association results were
analyzed to enumerate the novel or interesting patterns discovered in the dataset.
Clustering grouped high severity vulnerabilities much better than medium and low
45
vulnerabilities, and discovered alternate interesting groupings. Finally, this report has
proposed future areas of study such as the inclusion of alternate datasets, changes in
severity metrics, and additional data mining algorithms, as well as addressing the
likelihood that additional novel or interesting patterns remain to be discovered.
46
References [1] J. C. P. Chapman, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, and R. Wirth, "CRISP-DM 1.0 - Step-by-step data mining guide," CRISP-DM Consortium, 1999.
[2] Mitre, "CVE - Common Vulnerabilities and Exposures," Mitre, 2005.
[3] H. M. David J. Hand, Padhraic Smyth, Principles of Data Mining (Adaptive Computation and Machine Learning): The MIT Press, 2001.
[4] M. K. Jiawei Han, Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufmann, 2001.
[5] H. A. Edelstein, Introduction to Data Mining and Knowledge Discovery, Third ed. Potomac: Two Crows Corporation, 1999.
[6] P. N. Stuart J. Russell, Artificial Intelligence: A Modern Approach, Second ed. Upper Saddle RIver, NJ: Prentice Hall, 2002.
[7] P. Giudici, Applied Data Mining. West Sussex: Wiley, 2003.
[8] C. H. M. Schumacher, M. Hurler, and A. Buchmann, "Data-Mining in Vulnerability Databases," in Computer Science. Darmstadt: Darmstadt University of Technology, 2000, pp. 12.
[9] W. F. W. Arbaugh, and J. McHugh, "Windows of vulnerability: a case study analysis," Computer, vol. 33, pp. 52-59, 2000.
[10] J. T. K. Yamanishi, and Y. Maruyama, "Data mining for security," NEC Journal of Advanced Technology, vol. 2, pp. 63-69, 2005.
[11] I. Witten and E. Frank, Data Mining: Practical machine learning tools with Java implementations. San Francisco: Morgan Kaufmann, 2000.
[12] A. Azmy, "SuperQuery; Data Mining for Everyone," WP1, 2004.
47
Appendix A: Data Dictionary
A.1 Purpose and Scope This document describes the metadata elements of the dataset attributes used in this
project. The database, “ICAT Master.mdb” is a 15MB file in Microsoft Access format,
prepared by the National Institute of Standards. The database contains several tables and
an aggregated record set of 7553 entries with 87 attributes. However, the task-relevant
data is restricted to a much smaller set of attributes while including all records. From the
original 87 attributes, 30 are selected for their value in identifying, evaluating, or
characterizing the specific vulnerabilities by case, date, severity, range, loss type,
vulnerability type, exposed component, and exposed component.
A.1.2 Reference Guide
A.1.2.1 Documentation Elements of this dictionary contain the following information:
TagName Type Allowable data type
Size size in bytes or characters
AllowZeroLength: TRUE or False
Attributes: Fixed Size, Variable Length, Updatable
Description: Element description and function
Values Range of values. When data type = text, example is provided
Required: TRUE or FALSE
A.1.2.2 Data Types Data types found in the task-relevant data includes the following:
Type Description Date/Time Used to sore dates and times, stores 8 bytes
Example: M/D/YYYY MM/DD/YYYY
Long Integer An integer, 8 bytes Memo Used to store length text and numbers up to 63,999
characters Text Used for text or numbers up to 255 characters
48
A.1.2.3 Entry Identification CVE_ID Type Text
Size 255
AllowZeroLength: FALSE
Attributes: Variable Length, Updatable
Description: The standard CVE or CAN name for the vulnerability
Values CAN-1999-0001 CVE-2002-1088
Required: FALSE
Publish_Date Type Date/Time
Size 8
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description:
The date on which the vulnerability is published within ICAT (except for pre-2000 vulnerabilities where we use the discovery date)
Required: FALSE
CVE_Description Type Memo
Size N/A
AllowZeroLength: FALSE
Attributes: Variable Length, Updatable
Description: Short paragraph description describing the vulnerability
Values
Required: FALSE
Severity Type Text
Size 255
AllowZeroLength: FALSE
Attributes: Variable Length, Updatable
Description: The severity assigned the vulnerability
Values
Low Medium High
Required: FALSE
49
A.1.2.4 Range AR_Launch_remotely Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Attacker Requirements: attacker may launch the attack remotely
Values
-1 or 0, a Boolean representation where -1 indicates true and 0 is false, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
AR_Launch_locally Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
CollatingOrder: General
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Description: Attacker Requirements: attacker may launch the attack locally
Required: FALSE
AR_Target_access_attacker Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
CollatingOrder: General
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Description: Attacker Requirements: attacker may launch the attack locally
Required: FALSE
A.1.2.5 Loss Type LT_Security_protection Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Loss Type: the vulnerability causes a loss of security protection
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
50
LT_Obtain_all_priv Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description:
Loss Type: the vulnerability causes a loss of security protection where administrator access was gained by the attacker
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
LT_Obtain_some_priv Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description:
Loss Type: the vulnerability causes a loss of security protection where user level priviledge was gained by the attacker
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
LT_Confidentiality Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Loss Type: the vulnerability causes a loss of confidentiality
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
LT_Integrity Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Loss Type: the vulnerability causes a loss of integrity
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
LT_Availability Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Loss Type: the vulnerability causes a loss of availability
Values -1 or 0, a Boolean representation where -1 indicates
51
true and 0 is false
Required: FALSE
LT_Sec_Prot_Other
Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description:
Loss Type: the vulnerability causes a loss of security protection where some non-user privilege was gained by the attacker
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
A.1.2.6 Vulnerability Type VT_Input_validation_error Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Vulnerability Type: the vulnerability is of the type "input validation error"
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
VT_Boundary_condition_error Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Vulnerability Type: the vulnerability is of the type "boundary condition error"
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
VT_Buffer_overflow Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Vulnerability Type: the vulnerability is of the type "buffer overflow"
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
52
VT_Access_validation_error Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Vulnerability Type: the vulnerability is of the type "access validation error"
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
VT_Exceptional_condition_error Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Vulnerability Type: the vulnerability is of the type "exception condition handling error"
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
VT_Environment_error Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Vulnerability Type: the vulnerability is of the type "environmental error"
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
VT_Configuration_error Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Vulnerability Type: the vulnerability is of the type "configuration error"
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
53
VT_Race_condition Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Vulnerability Type: the vulnerability is of the type "race condition"
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
VT_Design_Error Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Vulnerability Type: Design Error
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
A.1.2.7 Exposed component EC_Operating_system Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Exposed Component: vulnerability occurs within an operating system
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
EC_Network_protocol_stack Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description:
Exposed Component: vulnerability occurs within a network protocol stack of an operating system
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
54
EC_Non_server_application Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Exposed Component: vulnerability occurs within a non-server application
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
EC_Server_application Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Exposed Component: vulnerability occurs within a server application
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
EC_Hardware Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Exposed Component: vulnerability occurs within hardware
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
EC_Communication_protocol Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Exposed Component: vulnerability occurs within a communication protocol
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
55
EC_Encryption_module Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Exposed Component: vulnerability occurs within an encryption module
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
EC_Other Type Long Integer
Size 4
AllowZeroLength: FALSE
Attributes: Fixed Size, Updatable
Description: Exposed Component: vulnerability occurs within some component not explicitly listed
Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false
Required: FALSE
56
CAN-1999-0001
12/30/1999
T
F
F
F
F
F
F
F
T
F
F
F
F
F
F
F
F
F
F
F
T
T
F
F
F
F
F
F
CAN-1999-0004
12/16/1997
T
F
F
F
F
F
F
F
T
F
F
F
T
F
F
F
F
F
F
F
F
F
F
F
F
F
F
T
CAN-1999-0015
12/16/1997
T
F
F
F
F
F
F
F
T
F
F
T
F
F
F
F
F
F
F
F
F
F
F
F
F
T
F
F
CAN-1999-0030
7/16/1997
F
T
F
T
T
F
F
F
F
F
F
F
T
F
F
F
F
F
F
F
T
F
F
F
F
F
F
F
CAN-1999-0061
10/2/1997
T
F
F
T
F
T
F
T
F
T
F
F
F
F
F
F
F
T
F
F
T
F
T
F
F
F
F
F
CAN-1999-0076
7/1/1997
T
F
F
F
F
F
F
F
T
F
F
F
T
F
F
F
F
F
F
F
F
F
F
F
F
F
F
T
CVE_ID
Publish_Date
Severity
AR_Launch_remotely
AR_Launch_locally
AR_Target_access_attacker
LT_Security_protection
LT_Obtain_all_priv
LT_Obtain_some_priv
LT_Confidentiality
LT_Integrity
LT_Availability
LT_Sec_Prot_Other
VT_Input_validation_error
VT_Boundary_condition_error
VT_Buffer_overflow
VT_Access_validation_error
VT_Exceptional_condition_error
VT_Environment_error
VT_Configuration_error
VT_Race_condition
VT_Design_Error
EC_Operating_system
EC_Network_protocol_stack
EC_Non_server_application
EC_Server_application
EC_Hardware
EC_Communication_protocol
EC_Encryption_module
EC_Other
Table A-1 Training/Test Dataset Sample
57
@relation vulnerability.symbolic @attribute AR_Launch_remotely {T,F} @attribute AR_Launch_locally {T,F} @attribute AR_Target_access_attacker {T,F} @attribute LT_Security_protection {T,F} @attribute LT_Obtain_all_priv {T,F} @attribute LT_Obtain_some_priv {T,F} @attribute LT_Confidentiality {T,F} @attribute LT_Integrity {T,F} @attribute LT_Availability {T,F} @attribute LT_Sec_Prot_Other {T,F} @attribute VT_Input_validation_error {T,F} @attribute VT_Boundary_condition_error {T,F} @attribute VT_Buffer_overflow {T,F} @attribute VT_Access_validation_error {T,F} @attribute VT_Exceptional_condition_error {T,F} @attribute VT_Environment_error {T,F} @attribute VT_Configuration_error {T,F} @attribute VT_Race_condition {T,F} @attribute VT_Other_vulnerability_type {T,F} @attribute VT_Design_Error {T,F} @attribute EC_Operating_system {T,F} @attribute EC_Network_protocol_stack {T,F} @attribute EC_Non_server_application {T,F} @attribute EC_Server_application {T,F} @attribute EC_Hardware {T,F} @attribute EC_Communication_protocol {T,F} @attribute EC_Encryption_module {T,F} @attribute EC_Other {T,F} @attribute CertAdv {T,F} @attribute Severity {Low, Medium, High} @data T,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,F,F,F,T,T,F,F,F,F,F,F,T,High T,F,F,F,F,F,F,F,T,F,F,F,T,F,F,F,F,F,F,F,F,F,F,F,F,F,F,T,T,Medium T,F,F,F,F,F,F,F,T,F,F,T,F,F,F,F,F,F,F,F,F,F,F,F,F,T,F,F,T,High F,T,F,T,T,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,T,High T,F,F,T,F,T,F,T,F,T,F,F,F,F,F,F,F,T,F,F,T,F,T,F,F,F,F,F,F,High T,F,F,F,F,F,F,F,T,F,F,F,T,F,F,F,F,F,F,F,F,F,F,F,F,F,F,T,F,High F,T,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,T,T,High T,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,T,F,F,F,High F,F,F,T,T,F,F,F,F,F,F,F,F,F,F,F,F,T,F,F,T,F,F,F,F,F,F,F,F,High F,T,F,T,T,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,High T,F,F,T,T,F,F,F,T,F,F,F,T,F,F,F,F,F,F,F,F,F,F,T,F,F,F,F,F,High T,F,F,F,F,F,F,F,T,F,T,F,F,F,F,F,F,F,F,F,F,F,F,F,F,T,F,F,T,High F,T,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,F,T,F,F,F,F,T,F,F,F,F,F,High F,T,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,F,T,F,F,F,F,T,F,F,F,F,F,High T,F,F,F,F,F,F,F,T,F,F,F,T,F,F,F,F,F,F,F,F,F,F,F,F,T,F,F,F,High F,T,F,T,F,T,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,T,F,F,F,F,F,High F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,Low F,T,F,T,T,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,Medium F,T,F,T,F,T,F,F,F,F,F,F,F,F,F,F,F,T,F,F,T,F,F,F,F,F,F,F,F,Medium F,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,T,F,F,F,High
Table A-2 Training/Test Dataset sample in ARFF format
58
CVE-2005-0006
5/2/05
Low
T
F
F
F
F
F
F
F
T
F
T
F
F
F
F
F
F
F
F
F
F
F
T
F
F
F
F
F
CVE-2005-0005
5/2/05
High
T
F
F
T
F
T
F
F
F
F
T
F
T
F
F
F
F
F
F
F
F
F
T
F
F
F
F
F
CVE-2005-0004
4/14/05
Medium
F
T
F
T
F
T
T
F
F
F
F
F
F
F
F
F
F
F
F
T
F
F
T
F
F
F
F
F
CVE-2005-0003
4/14/05
Low
F
T
F
F
F
F
F
F
T
F
F
F
F
F
T
F
F
F
F
F
F
F
T
F
F
F
F
F
CVE-2005-0002
5/2/05
High
T
F
F
T
T
T
F
F
F
F
F
F
F
T
F
F
F
F
F
F
F
F
T
F
F
F
F
F
CVE-2005-0001
5/2/05
High
F
T
F
T
T
F
F
F
F
F
F
F
F
F
F
F
F
T
F
F
F
F
T
F
F
F
F
F
CVE_ID
Publish_Date
Severity
AR_Launch_remotely
AR_Launch_locally
AR_Target_access_attacker
LT_Security_protection
LT_Obtain_all_priv
LT_Obtain_some_priv
LT_Confidentiality
LT_Integrity
LT_Availability
LT_Sec_Prot_Other
VT_Input_validation_error
VT_Boundary_condition_error
VT_Buffer_overflow
VT_Access_validation_error
VT_Exceptional_condition_error
VT_Environment_error
VT_Configuration_error
VT_Race_condition
VT_Other_vulnerability_type
VT_Design_Error
EC_Operating_system
EC_Network_protocol_stack
EC_Non_server_application
EC_Server_application
EC_Hardware
EC_Communication_protocol
EC_Encryption_module
EC
Other
Table A-3 New Example Dataset
59
60
Appendix B: Association Rules Rules for T - contains 13 rule(s) Rule 1 for T (965, 0.816) if LT_Obtain_some_priv = T then T Rule 2 for T (1,779, 0.956) if LT_Obtain_all_priv = T then T Rule 3 for T (748, 0.912) if LT_Obtain_some_priv = T and AR_Launch_remotely = T then T Rule 4 for T (858, 0.801) if LT_Sec_Prot_Other = T and EC_Server_application = T then T Rule 5 for T (1,317, 0.804) if LT_Sec_Prot_Other = T and AR_Launch_remotely = T then T Rule 6 for T (711, 0.973) if VT_Buffer_overflow = T and LT_Obtain_all_priv = T then T Rule 7 for T (982, 0.956) if LT_Obtain_all_priv = T and AR_Launch_locally = T then T Rule 8 for T (674, 0.941)
if LT_Obtain_all_priv = T and EC_Server_application = T then T Rule 9 for T (1,035, 0.974) if LT_Obtain_all_priv = T and VT_Input_validation_error = T then T Rule 10 for T (885, 0.951) if LT_Obtain_all_priv = T and AR_Launch_remotely = T then T Rule 11 for T (770, 0.843) if LT_Sec_Prot_Other = T and EC_Server_application = T and AR_Launch_remotely = T then T Rule 12 for T (731, 0.841) if LT_Sec_Prot_Other = T and VT_Input_validation_error = T and AR_Launch_remotely = T then T Rule 13 for T (697, 0.974) if VT_Buffer_overflow = T and LT_Obtain_all_priv = T and VT_Input_validation_error = T then T Default: T
Table B-0-1 High Severity Apriori Rules for T - contains 9 rule(s) Rule 1 for T (1,779, 0.956) if LT_Obtain_all_priv = T then T Rule 2 for T (982, 0.956) if AR_Launch_locally = T and LT_Obtain_all_priv = T then T Rule 3 for T (885, 0.951) if AR_Launch_remotely = T and LT_Obtain_all_priv = T then T Rule 4 for T (107, 0.925) if AR_Launch_remotely = T and AR_Launch_locally = T and LT_Obtain_all_priv = T then T Rule 5 for T (76, 0.553)
if AR_Launch_remotely = T and AR_Target_access_attacker = T then T Rule 6 for T (94, 0.468) if AR_Target_access_attacker = T and EC_Non_server_application = T then T Rule 7 for T (2,407, 0.498) if AR_Launch_locally = T then T Rule 8 for T (418, 0.49) if AR_Launch_remotely = T and AR_Launch_locally = T then T Rule 9 for T (5,223, 0.496) if AR_Launch_remotely = T then T Default: T
Table B-0-2 High Severity GRI
61
Rules for T - contains 19 rule(s) Rule 1 for T (764, 0.717) if VT_Exceptional_condition_error = T then T Rule 2 for T (952, 0.628) if LT_Integrity = T then T Rule 3 for T (1,952, 0.697) if LT_Availability = T then T Rule 4 for T (505, 0.867) if VT_Exceptional_condition_error = T and LT_Availability = T then T Rule 5 for T (400, 0.728) if VT_Exceptional_condition_error = T and EC_Server_application = T then T Rule 6 for T (666, 0.728) if VT_Exceptional_condition_error = T and AR_Launch_remotely = T then T Rule 7 for T (448, 0.681) if LT_Integrity = T and AR_Launch_locally = T then T Rule 8 for T (368, 0.633) if LT_Confidentiality = T and AR_Launch_locally = T then T Rule 9 for T (411, 0.62) if LT_Availability = T and EC_Non_server_application = T then T Rule 10 for T (372, 0.712) if LT_Availability = T and AR_Launch_locally = T then T Rule 11 for T (1,051, 0.608) if LT_Availability = T and VT_Input_validation_error = T
then T Rule 12 for T (1,064, 0.691) if LT_Availability = T and EC_Server_application = T then T Rule 13 for T (1,676, 0.697) if LT_Availability = T and AR_Launch_remotely = T then T Rule 14 for T (458, 0.865) if VT_Exceptional_condition_error = T and LT_Availability = T and AR_Launch_remotely = T then T Rule 15 for T (379, 0.731) if VT_Exceptional_condition_error = T and EC_Server_application = T and AR_Launch_remotely = T then T Rule 16 for T (643, 0.605) if LT_Availability = T and VT_Input_validation_error = T and EC_Server_application = T then T Rule 17 for T (951, 0.612) if LT_Availability = T and VT_Input_validation_error = T and AR_Launch_remotely = T then T Rule 18 for T (983, 0.693) if LT_Availability = T and EC_Server_application = T and AR_Launch_remotely = T then T Rule 19 for T (606, 0.609) if LT_Availability = T and VT_Input_validation_error = T and EC_Server_application = T and AR_Launch_remotely = T then T Default: T
Table B-0-3 Medium Severity Apriori Rules for T - contains 11 rule(s) Rule 1 for T (1,952, 0.697) if LT_Availability = T then T Rule 2 for T (1,676, 0.697) if AR_Launch_remotely = T and LT_Availability = T then T Rule 3 for T (952, 0.628) if LT_Integrity = T then T Rule 4 for T (372, 0.712) if AR_Launch_locally = T and LT_Availability = T then T Rule 5 for T (448, 0.681) if AR_Launch_locally = T and LT_Integrity = T then T Rule 6 for T (1,451, 0.551) if LT_Confidentiality = T
then T Rule 7 for T (368, 0.633) if AR_Launch_locally = T and LT_Confidentiality = T then T Rule 8 for T (538, 0.589) if AR_Launch_remotely = T and LT_Integrity = T then T Rule 9 for T (1,147, 0.525) if AR_Launch_remotely = T and LT_Confidentiality = T then T Rule 10 for T (2,407, 0.456) if AR_Launch_locally = T then T Rule 11 for T (5,223, 0.427) if AR_Launch_remotely = T then T Default: T
Table B-0-4 Medium Severity GRI
62
Rules for T - contains 9 rule(s) Rule 1 for T (357, 0.308) if LT_Confidentiality = T and EC_Non_server_application = T then T Rule 2 for T (467, 0.336) if LT_Confidentiality = T and VT_Input_validation_error = T then T Rule 3 for T (172, 0.302) if LT_Confidentiality = T and VT_Design_Error = T and EC_Non_server_application = T then T Rule 4 for T (373, 0.314) if LT_Confidentiality = T and VT_Design_Error = T and AR_Launch_remotely = T then T Rule 5 for T (243, 0.342) if LT_Confidentiality = T and EC_Non_server_application = T and AR_Launch_remotely = T then T
Rule 6 for T (365, 0.342) if LT_Confidentiality = T and EC_Server_application = T and VT_Input_validation_error = T then T Rule 7 for T (446, 0.343) if LT_Confidentiality = T and VT_Input_validation_error = T and AR_Launch_remotely = T then T Rule 8 for T (208, 0.312) if LT_Confidentiality = T and VT_Design_Error = T and EC_Server_application = T and AR_Launch_remotely = T then T Rule 9 for T (359, 0.345) if LT_Confidentiality = T and EC_Server_application = T and VT_Input_validation_error = T and AR_Launch_remotely = T then T Default: t
Table B-0-5 Low Severity Apriori Rules for T - contains 5 rule(s) Rule 1 for T (1,451, 0.267) if LT_Confidentiality = T then T Rule 2 for T (1,147, 0.29) if AR_Launch_remotely = T and LT_Confidentiality = T then T Rule 3 for T (117, 0.239) if AR_Launch_locally = T and LT_Confidentiality = T and EC_Non_server_application = T
then T Rule 4 for T (86, 0.233) if AR_Launch_remotely = T and AR_Launch_locally = T and LT_Confidentiality = T then T Rule 5 for T (106, 0.208) if AR_Launch_locally = T and LT_Confidentiality = T and EC_Operating_system = T then T Default: T
Table B-0-6 Low Severity GRI
63
64
Appendix C: Classification Rules Rules for T - contains 6 rule(s) Rule 1 for T (1,779, 0.956) if LT_Obtain_all_priv = T then T Rule 2 for T (748, 0.911) if AR_Launch_remotely = T and LT_Obtain_some_priv = T then T Rule 3 for T (1,176, 0.898) if AR_Launch_remotely = T and LT_Availability = F and LT_Confidentiality = F and LT_Integrity = F and LT_Sec_Prot_Other = F then T Rule 4 for T (5, 0.857) if EC_Communication_protocol = T then T Rule 5 for T (941, 0.824) if AR_Launch_remotely = T
and AR_Target_access_attacker = F and LT_Sec_Prot_Other = T and VT_Design_Error = F then T Rule 6 for T (1,217, 0.823) if AR_Launch_remotely = T and AR_Target_access_attacker = F and LT_Sec_Prot_Other = T and VT_Exceptional_condition_error = F then T Rules for F - contains 2 rule(s) Rule 1 for F (76, 0.885) if LT_Obtain_all_priv = F and VT_Design_Error = T and VT_Exceptional_condition_error = T then F Rule 2 for F (5,546, 0.653) if LT_Obtain_all_priv = F then F Default: F
Table C-0-1 High Severity C5.0 Rules for F - contains 6 rule(s) Rule 1 for F (1,862, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and VT_Buffer_overflow in [“F”] and LT_Availability in [“F”] then F Rule 2 for F (1,220, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and VT_Buffer_overflow in [“F”] and LT_Availability in [“T”] then F Rule 3 for F (253, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and VT_Buffer_overflow in [“T”] and LT_Availability in [“T”] then F Rule 4 for F (121, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“F”] then F Rule 5 for F (268, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“F”] and EC_Server_application in [“F”] then F Rule 6 for F (77, 0.006) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“F”] and EC_Server_application in [“T”]
then F Rules for T - contains 6 rule(s) Rule 1 for T (102, 0.007) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and VT_Buffer_overflow in [“T”] and LT_Availability in [“F”] then T Rule 2 for T (155, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“T”] and VT_Input_validation_error in [“F”] then T Rule 3 for T (246, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“T”] and VT_Input_validation_error in [“T”] then T Rule 4 for T (1,154, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“F”] then T Rule 5 for T (88, 0.005) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“T”] then T Rule 6 for T (1,779, 0.01) if LT_Obtain_all_priv in [“T”] then T Default: F
Table C-0-2 High Severity C&R
65
Rules for F - contains 6 rule(s) Rule 1 for F (1,862, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and VT_Buffer_overflow in [“F”] and LT_Availability in [“F”] then F Rule 2 for F (1,220, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and VT_Buffer_overflow in [“F”] and LT_Availability in [“T”] then F Rule 3 for F (253, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and VT_Buffer_overflow in [“T”] and LT_Availability in [“T”] then F Rule 4 for F (121, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“F”] then F Rule 5 for F (268, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“F”] and EC_Server_application in [“F”] then F Rule 6 for F (77, 0.006) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“F”] and EC_Server_application in [“T”] then F Rules for T - contains 12 rule(s) Rule 1 for T (102, 0.007) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and VT_Buffer_overflow in [“T”] and LT_Availability in [“F”] then T Rule 2 for T (155, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“T”] and VT_Input_validation_error in [“F”] then T Rule 3 for T (246, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”]
and AR_Launch_remotely in [“T”] and VT_Input_validation_error in [“T”] then T Rule 4 for T (453, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“F”] and EC_Server_application in [“F”] then T Rule 5 for T (701, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“F”] and EC_Server_application in [“T”] then T Rule 6 for T (88, 0.005) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“T”] then T Rule 7 for T (512, 0.009) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“F”] and EC_Non_server_application in [“F”] then T Rule 8 for T (232, 0.01) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“F”] and EC_Non_server_application in [“T”] then T Rule 9 for T (267, 0.01) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“T”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“F”] and AR_Launch_locally in [“F”] then T Rule 10 for T (394, 0.01) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“T”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“F”] and AR_Launch_locally in [“T”] then T Rule 11 for T (88, 0.009) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“T”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“T”] then T Rule 12 for T (286, 0.01) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“T”] and LT_Obtain_some_priv in [“T”] then T Default: F
Table C-0-3 High Severity CHAID
66
Rules for T - contains 2 rule(s) Rule 1 for T (26, 0.821) if LT_Obtain_all_priv = F and LT_Sec_Prot_Other = T and VT_Exceptional_condition_error = T and VT_Design_Error = T then T Rule 2 for T (5,546, 0.562) if LT_Obtain_all_priv = F then T Rules for F - contains 11 rule(s) Rule 1 for F (408, 0.961) if AR_Launch_remotely = T and LT_Confidentiality = F and LT_Integrity = F and LT_Availability = F and LT_Sec_Prot_Other = F and VT_Buffer_overflow = T and EC_Network_protocol_stack = F then F Rule 2 for F (1,779, 0.958) if LT_Obtain_all_priv = T then F Rule 3 for F (1,030, 0.918) if AR_Launch_remotely = T and AR_Launch_locally = F and LT_Confidentiality = F and LT_Integrity = F and LT_Availability = F and LT_Sec_Prot_Other = F and EC_Network_protocol_stack = F then F Rule 4 for F (748, 0.912) if AR_Launch_remotely = T and LT_Obtain_some_priv = T then F Rule 5 for F (5, 0.857) if EC_Communication_protocol = T
then F Rule 6 for F (11, 0.846) if LT_Confidentiality = T and VT_Input_validation_error = T and VT_Access_validation_error = T and EC_Non_server_application = F then F Rule 7 for F (1,217, 0.832) if AR_Launch_remotely = T and AR_Target_access_attacker = F and LT_Sec_Prot_Other = T and VT_Exceptional_condition_error = F then F Rule 8 for F (941, 0.832) if AR_Launch_remotely = T and AR_Target_access_attacker = F and LT_Sec_Prot_Other = T and VT_Design_Error = F then F Rule 9 for F (47, 0.816) if LT_Integrity = F and LT_Availability = F and LT_Sec_Prot_Other = F and VT_Other_vulnerability_type = T then F Rule 10 for F (751, 0.795) if AR_Launch_remotely = T and AR_Target_access_attacker = F and LT_Integrity = F and LT_Availability = F and EC_Non_server_application = T then F Rule 11 for F (12, 0.786) if AR_Target_access_attacker = T and LT_Availability = T then F Default: F
Table C-0-4 Medium Severity C5.0
67
Rules for F - contains 11 rule(s) Rule 1 for F (155, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“T”] and VT_Input_validation_error in [“F”] then F Rule 2 for F (246, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“T”] and VT_Input_validation_error in [“T”] then F Rule 3 for F (852, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“F”] and EC_Non_server_application in [“F”] then F Rule 4 for F (302, 0.007) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“F”] and EC_Non_server_application in [“T”] then F Rule 5 for F (88, 0.005) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“T”] then F Rule 6 for F (313, 0.009) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“F”] and EC_Non_server_application in [“F”] and AR_Launch_remotely in [“F”] then F Rule 7 for F (199, 0.009) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“F”] and EC_Non_server_application in [“F”] and AR_Launch_remotely in [“T”] then F Rule 8 for F (232, 0.01) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“F”] and EC_Non_server_application in [“T”] then F Rule 9 for F (661, 0.01) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“T”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“F”] then F Rule 10 for F (88, 0.009)
if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“T”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“T”] then F Rule 11 for F (286, 0.01) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“T”] and LT_Obtain_some_priv in [“T”] then F Rules for T - contains 7 rule(s) Rule 1 for T (1,391, 0.005) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“F”] and LT_Integrity in [“F”] then T Rule 2 for T (573, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“F”] and LT_Integrity in [“T”] then T Rule 3 for T (1,220, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“T”] and VT_Buffer_overflow in [“F”] then T Rule 4 for T (253, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“T”] and VT_Buffer_overflow in [“T”] then T Rule 5 for T (121, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“F”] then T Rule 6 for T (268, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“F”] and EC_Server_application in [“F”] then T Rule 7 for T (77, 0.006) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“F”] and EC_Server_application in [“T”] then T Default: F
Table C-0-5 Medium Severity CHAID
68
Rules for F - contains 5 rule(s) Rule 1 for F (155, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“T”] and VT_Input_validation_error in [“F”] then F Rule 2 for F (246, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“T”] and VT_Input_validation_error in [“T”] then F Rule 3 for F (1,154, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“F”] then F Rule 4 for F (88, 0.005) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“T”] then F Rule 5 for F (1,779, 0.01) if LT_Obtain_all_priv in [“T”] then F Rules for T - contains 7 rule(s) Rule 1 for T (1,391, 0.005) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“F”] and LT_Integrity in [“F”] then T Rule 2 for T (573, 0.008)
if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“F”] and LT_Integrity in [“T”] then T Rule 3 for T (1,220, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“T”] and VT_Buffer_overflow in [“F”] then T Rule 4 for T (253, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“T”] and VT_Buffer_overflow in [“T”] then T Rule 5 for T (121, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“F”] then T Rule 6 for T (268, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“F”] and EC_Server_application in [“F”] then T Rule 7 for T (77, 0.006) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“F”] and EC_Server_application in [“T”] then T Default: F
Table C-0-6 Medium Severity C&R Rules for T - contains 0 rule(s) Rules for F - contains 0 rule(s) Default: F
Table C-0-7 Low Severity C5.0
69
Rules for F - contains 13 rule(s) Rule 1 for F (662, 0.009) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“F”] and LT_Obtain_all_priv in [“F”] and AR_Launch_remotely in [“F”] then F Rule 2 for F (756, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“F”] and LT_Obtain_all_priv in [“F”] and AR_Launch_remotely in [“T”] and EC_Server_application in [“F”] then F Rule 3 for F (760, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“F”] and LT_Obtain_all_priv in [“F”] and AR_Launch_remotely in [“T”] and EC_Server_application in [“T”] then F Rule 4 for F (708, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“F”] and LT_Obtain_all_priv in [“T”] then F Rule 5 for F (747, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“T”] and VT_Buffer_overflow in [“F”] and LT_Availability in [“F”] and LT_Integrity in [“F”] then F Rule 6 for F (178, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“T”] and VT_Buffer_overflow in [“F”] and LT_Availability in [“F”] and LT_Integrity in [“T”] then F
Rule 7 for F (510, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“T”] and VT_Buffer_overflow in [“F”] and LT_Availability in [“T”] then F Rule 8 for F (1,553, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“T”] and VT_Buffer_overflow in [“T”] then F Rule 9 for F (383, 0.008) if LT_Confidentiality in [“T”] and LT_Sec_Prot_Other in [“F”] and LT_Integrity in [“F”] and VT_Input_validation_error in [“F”] and VT_Design_Error in [“F”] then F Rule 10 for F (376, 0.007) if LT_Confidentiality in [“T”] and LT_Sec_Prot_Other in [“F”] and LT_Integrity in [“F”] and VT_Input_validation_error in [“F”] and VT_Design_Error in [“T”] then F Rule 11 for F (388, 0.006) if LT_Confidentiality in [“T”] and LT_Sec_Prot_Other in [“F”] and LT_Integrity in [“F”] and VT_Input_validation_error in [“T”] then F Rule 12 for F (146, 0.01) if LT_Confidentiality in [“T”] and LT_Sec_Prot_Other in [“F”] and LT_Integrity in [“T”] then F Rule 13 for F (158, 0.01) if LT_Confidentiality in [“T”] and LT_Sec_Prot_Other in [“T”] then F Default: F
Table C-0-8 Low Severity CHAID Rules for F - contains 7 rule(s) Rule 1 for F (662, 0.009) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“F”] and LT_Obtain_all_priv in [“F”] and AR_Launch_remotely in [“F”] then F Rule 2 for F (1,516, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“F”] and LT_Obtain_all_priv in [“F”] and AR_Launch_remotely in [“T”] then F Rule 3 for F (708, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“F”] and LT_Obtain_all_priv in [“T”] then F Rule 4 for F (2,988, 0.01)
if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“T”] then F Rule 5 for F (1,147, 0.007) if LT_Confidentiality in [“T”] and LT_Sec_Prot_Other in [“F”] and LT_Integrity in [“F”] then F Rule 6 for F (146, 0.01) if LT_Confidentiality in [“T”] and LT_Sec_Prot_Other in [“F”] and LT_Integrity in [“T”] then F Rule 7 for F (158, 0.01) if LT_Confidentiality in [“T”] and LT_Sec_Prot_Other in [“T”] then F Default: F
Table C-0-9 Low Severity C&R
70
Appendix D: Severity Web Graphs
Figure D-1 High Severity Web Graph
71
Figure D-2 Medium Severity Web Graph
72
Figure D-3 Low Severity Web Grap
73