knowledge discovery in cyber vulnerability … discovery in cyber vulnerability databases ......

Knowledge Discovery in Cyber Vulnerability Databases

Sean Tierney

A project report

submitted in partial fulfillment of the requirements for the degree of

Master of Science University of Washington

2005

Committee: Committee Chair: Isabelle Bichindaritz, Ph.D.

Committee Member: Don McLane

Program Authorized to Offer Degree: Computing and Software Systems

University of Washington Graduate School

Abstract

This project investigates the applicability of data mining repositories of the

various vulnerabilities and exposures to abuse, misuse, and other compromises of

computing hardware and software. The discovery, disclosure, correction and avoidance

of vulnerabilities in software and computing systems is essential to maintaining the

availability, confidentiality, and integrity of these systems. While significant effort is

invested in the above steps as well as intrusion detection, limited effort has been applied

to the study and comprehension of the cumulative body of vulnerabilities. Nonetheless,

cyber vulnerability databases contain undiscovered patterns which are novel, interesting

and of value to researchers and industry professionals. One such repository is the ICAT

Metabase maintained by the National Institute for Standards and Technology (NIST).

The goal of this project is to mine a cyber vulnerability database in order to

discover co-occurring vulnerability attributes, to predict vulnerability classes, and to

determine patterns of vulnerability. This report starts by providing background in

vulnerability research, discussing relevant machine learning concepts, and reviewing

related literature. Following, the specific data mining activities are presented with an

evaluation and analysis. Data mining the vulnerability database makes several important

contributions. The results demonstrate that previously undiscovered, novel, and

interesting knowledge was detected. Association mining was used to learn which

vulnerability attributes frequently occur together. This includes strong associations for

each of the severity levels, which facilitates the attribution of specific features to given

severity levels. Classification rules were mined and applied to a dataset of new examples

to determine the accuracy of class prediction. The resulting data models were

successfully used to classify high severity vulnerabilities. A comparison was made

between the machine learned rules and those used to produce the dataset, which

demonstrates that the metrics originally used to determine severity level could be

represented by machine learning algorithms. Clustering, classification, and association

results were analyzed to enumerate the novel or interesting patterns discovered in the

dataset. This characterization of vulnerabilities enables the observation of new patterns

iii

such as the correlation between medium severity vulnerabilities and the characteristics

for denial of service attacks. This project concludes by identifying current issues in data

mining vulnerability databases and by outlining some direction for future work.

iv

Table of Contents List of Figures .............................................................................................................................................. vi

List of Tables................................................................................................................................................ vi

1 Introduction......................................................................................................................................... 1 1.1 Thesis and Problem Statement .................................................................................................... 1 1.2 Purpose........................................................................................................................................ 1 1.3 Goals ........................................................................................................................................... 2 1.4 Method......................................................................................................................................... 2 1.5 Audience ...................................................................................................................................... 3 1.6 Organization................................................................................................................................ 4

2 Background ......................................................................................................................................... 4 2.1 Vulnerabilities, Exposures and Exploits...................................................................................... 4

2.1.1 Discovery and Disclosure ....................................................................................................... 5 2.1.2 Tracking.................................................................................................................................. 6 2.1.3 Detection and Assessment ...................................................................................................... 6

2.2 Data Mining ................................................................................................................................ 7 2.2.1 Preprocessing.......................................................................................................................... 7 2.2.2 Classification .......................................................................................................................... 7 2.2.3 Association ............................................................................................................................. 8 2.2.4 Clustering ............................................................................................................................... 9

2.3 Literature Review ........................................................................................................................ 9 2.3.1 Data Mining in Vulnerability Databases................................................................................. 9 2.3.2 Windows of Vulnerability: A Case Study Analysis.............................................................. 11 2.3.3 Data Mining for Security ...................................................................................................... 13

3 The Investigation............................................................................................................................... 15 3.1 Scope ......................................................................................................................................... 15 3.2 Resources and Technologies ..................................................................................................... 15

3.2.1 Training/Test Dataset ........................................................................................................... 15 3.2.1.1 Descriptive Summary.................................................................................................. 15 3.2.1.2 Preprocessing .............................................................................................................. 21

3.2.2 Evaluation Dataset ................................................................................................................ 22 3.2.3 Tools ..................................................................................................................................... 23

3.2.3.1 WEKA......................................................................................................................... 23 3.2.3.2 Clementine .................................................................................................................. 24

4 Results Analysis and Evaluation...................................................................................................... 26 4.1 Frequent Feature Sets ............................................................................................................... 26 4.2 Class Prediction ........................................................................................................................ 28 4.3 Comparison of ML rules to NIST Metrics ................................................................................. 29

4.3.1 Association Rules ................................................................................................................. 30

v

4.3.2 Classification Rules .............................................................................................................. 35 4.4 Pattern Discovery...................................................................................................................... 38

5 Educational Statement...................................................................................................................... 43

6 Further Work.................................................................................................................................... 44

7 Conclusion ......................................................................................................................................... 45

References ................................................................................................................................................... 47

Appendix A: Data Dictionary.................................................................................................................... 48

A.1 Purpose and Scope.............................................................................................................................. 48

Appendix B: Association Rules ................................................................................................................. 61

Appendix C: Classification Rules.............................................................................................................. 65

Appendix D: Severity Web Graphs .......................................................................................................... 71

Appendix E: Presentation Slides .................................................................... Error! Bookmark not defined.

List of Figures Figure 2-1. Disclosure Life-cycle ...................................................................................... 6 Figure 2-2. Host Life-cycle.............................................................................................. 11 Figure 2-3. Rate of Intrusions .......................................................................................... 13 Figure 3-1 Severity Web Graph........................................................................................ 18 Figure 3-2 Attack Range................................................................................................... 19 Figure 3-3 Loss Type ........................................................................................................ 20 Figure 3-4Vulnerability Type ........................................................................................... 21 Figure 3-5 Exposed Component ....................................................................................... 21 Figure 3-6 Weka GUI ....................................................................................................... 24

List of Tables Table 1-1. Partial Table of CRISP-DM Tasks ................................................................... 3 Table 2-1. Leading Sources of Vulnerability Research and Disclosure ............................ 5 Table 2-2. Leading vulnerability information repositories ................................................ 6 Table 3-1 Dataset Attributes ............................................................................................. 17 Table 3-2 – Distribution of Vulnerability Attributes ........................................................ 19 Table A-1 Dataset Sample ................................................................................................ 57 Table A-2 Dataset sample in ARFF format ...................................................................... 58

vi

1 Introduction This project investigates the applicability of data mining cyber security

vulnerabilities. Its goal is to mine a cyber vulnerability database in order to discover co-

occurring vulnerability attributes, to predict vulnerability classes, and to determine

patterns of vulnerability. Databases such as the ICAT Metabase maintained by the

National Institute for Standards and Technology (NIST) are repositories of the various

vulnerabilities and exposures to misuse, abuse, or similar compromises impacting

computing hardware and software. There has been significant research into both

hardware and software based cyber vulnerabilities. However, past research has focused

on discovery, detection, and assessment. Thus far there has been little study of the

overall body of cyber security vulnerability. This project analyzes the corpus of cyber

security vulnerabilities as represented in such databases.

1.1 Thesis and Problem Statement

Cyber vulnerability databases contain undiscovered patterns which are novel,

interesting and of value to researchers and industry professionals. Past as well as

ongoing research into cyber security vulnerabilities has tended not to consider the

aggregate body of vulnerabilities. The majority of academic work has been directed at

detecting the various attacks, intrusions, and similar activity that makes use of these

vulnerabilities. Industry professionals and Information Security enthusiasts have focused

either on discovering new vulnerabilities or detecting and assessing those that exist

within the enterprise or institution. By focusing on other topics, a potentially large body

of knowledge has been left untapped. This data needs to be explored to determine what

new knowledge can be discovered, what information can be applied to new areas or in

new ways to existing work.

1.2 Purpose

The predominant use of vulnerability databases has been to enumerate software

and hardware vulnerabilities for use in assessing specific systems (computers,

infrastructure, services, etc). While those activities are not trivial, it is important to study

1

the corpus of known vulnerability and to identify novel and interesting patterns. The

objective of this research project is to mine a cyber vulnerability database in order to

discover co-occurring vulnerability attributes, to predict vulnerability classes, and to

determine patterns of vulnerability. The focus will be to answer some questions identified

as potentially of interest to computer security professionals.

1.3 Goals

In order to achieve the purpose and objectives of this project the following goals

were established. Research and understand the purpose and uses of vulnerability

databases. Develop a comprehensive understanding of the contents and structure of the

publicly available vulnerability databases. Utilize this knowledge to select and process

an appropriate dataset. Apply association, classification and clustering techniques in

order to answer four research questions.

Q1: Which vulnerability attributes frequently occur together?

Q2: How accurately can the class of new examples be predicted?

Q3: How well do the machine learning derived rules compare to those used to

produce the dataset?

Q4: What novel or interesting patterns can be discovered in the dataset?

1.4 Method

To the maximum extent possible this project follows the methodology proposed

by the Cross-Industry Standard Process for Data Mining (CRISP-DM) [1]. Table 1-1

describes the specific phases employed; which include business understanding, data

understanding, data preparation, modeling, and evaluation with the deployment phase

omitted.

The business understanding phase was performed at the beginning of the project.

It included the research used to assess the situation such as literature review and study of

data mining principles and practices. Related work in data mining and vulnerability

assessment were examined to identify the data mining goals. The project plan was

developed as part of the initial project proposal.

2

Business Understanding

Data Understanding

Data Preparation

Modeling Evaluation

Determine Bus Obj • Background • Objectives • Success Criteria Assess Situation • Inventory of Resources • Requirements, Assumptions, and Constraints • Risks and Contingencies • Terminology • Costs and Benefits Determine DM Goals • DM Goals • DM Success Criteria Produce Project Plan • Project Plan • Initial Assessment of Tools and Techniques

Collect Initial Data • Initial Data Collection Report Describe Data • Data Description Report Explore Data • Data Exploration Report Verify Data Quality • Data Quality Report

Select Data • Rationale for Inclusion / Exclusion Clean Data • Data Cleaning Report Construct Data • Derived Attributes • Generated Records Integrate Data • Merged Data Format Data • Reformatted Data • Dataset • Dataset Description

Select Modeling Techniques • Modeling Technique • Modeling Assumptions Generate Test Design • Test Design Build Model • Parameter Settings • Models • Model Descriptions Assess Model • Model Assessment • Revised Parameter Settings

Evaluate Results • Assessment of DM Results • Approved Models Review Process • Review of Process Determine Next Steps • List of Possible Actions • Decision

Table 1-1. Partial Table of CRISP-DM Tasks [1]

Data understanding began with the selection of the dataset. The data was

explored and quality verified. The descriptive data summary in section 3.2.1.1 was

produced as a result. Section 3.2.1.2 covers the data preparation phase in the discussion

of the preprocessing activities, which consisted of cleaning, integration, reduction,

attribute selection and construction.

The modeling and evaluation phases are discussed in section 4. These include the

selection of the modeling techniques, model building, analysis and evaluation of the

results.

1.5 Audience

The target audience for this report are those in the computer science and

information technologies fields, whether academic or industry. The research may be of

specific benefit to individuals working with attacks, compromises, exposures, or

vulnerabilities in computing and software systems. Background knowledge is provided

in both the cyber vulnerabilities and the data mining techniques used. However, the

background information is not exhaustive, and readers are likely to benefit from prior

knowledge or experience with this topic.

3

1.6 Organization

This report is organized in the following manner. Section 2 provides background

on cyber vulnerabilities, discussing the discovery, reporting, tracking and detection

aspects of the field as well as providing a review of applicable literature. Section 3

describes the investigation including the scope, merit, dataset and tools employed.

Section 4 discusses and analyzes the results. The educational statement is provided in

section 5. Future work and suggestions for additional research is covered in section 6.

Section 7 contains the concluding remarks. A detailed data dictionary and other pertinent

materials are provided in the appendices.

2 Background

2.1 Vulnerabilities, Exposures and Exploits

There is some debate regarding the distinction between vulnerabilities and

exposures. In many of the references and literature reviewed for this research,

vulnerabilities are often viewed as flaws which may be leveraged to violate the security

policy or abuse the intended purpose of the system. In contrast, an exposure is a

component which is performing as designed. However, it may be used in an unintended

or undesirable fashion, which also results in a security violation or abuse. The Mitre

Corporation, developers of the industry standard Common Vulnerability and Exposures

(CVE) Dictionary [2], offer this view. Vulnerabilities in computer hardware and

software are generally considered to be facts about the system seen as problems “under

any commonly used security policies”. Exposures are only considered to be problems

under some security policies. Taking yet another approach, NIST simply refers to both as

vulnerabilities. Although, they also use the term “exposed component” to specify what

would be exploited in an attack.

Consider the UNIX services chargen and echo. Chargen, when contacted via

TCP, generates a continuous stream of characters, until the client disconnects. Echo

simple echoes back whatever is sent to it. Both of these may be very useful for

diagnostics and measurement. However, these two services may be exploited in a ping

pong attack, which bounces the stream from the chargen service between hosts causing a

4

degradation or denial of services for the hosts and the networks involved. In this

example, chargen and echo are performing as designed, however, their mere presence

opens the system to attack. Whether these services represent vulnerabilities or exposures

is academic and should be considered violations of any reasonable security policies.

Thus, in the context of this research, vulnerability and exposure may be used

interchangeably as a reference to any entry in the dataset.

2.1.1 Discovery and Disclosure

In the early days of network computing, vulnerabilities were largely discovered

by skilled systems administrators, crackers, or by accident. However, that has largely

given way to what is now referred to as the security researcher. This term applies to

those working for commercial or public interest as well as more nefarious minded

individuals. It may include hackers, crackers, developers and programmers. These

researchers often employ techniques such as code auditing, input fuzzing, fault

monitoring, execution and runtime tracing. A non-exhaustive list of organizations

engaged in either research or disclosure is provided in table 2-1. Source Audience URL CERT Public www.cert.orgeEye commercial www.eeye.comF-Secure commercial www.f-secure.comiDefense commercial www.idefense.comInternet Security Systems commercial www.iss.comInternet Strom Center Public www.isc.sans.comNeohapsis commercial www.neohapsisQualys commercial www.qualys.com

Table 2-1. Leading Sources of Vulnerability Research and Disclosure

Once discovered, and regardless of the aims of the researcher, a vulnerability

eventually works its way to public disclosure. If the vulnerability is handled by a

reputable researcher or firm, the vendor will typically be notified prior to further

disclosure. However, in the case of the malicious hacker or with Vulnerability Sharing

Clubs (VSC), knowledge of the exposure spreads until there is either an incident or an

issue that results in acknowledgment. The disclosure life-cycle spans from discovery to

full, public disclosure (see figure 2-1).

5

http://www.cert.org/

http://www.eeye.com/

http://www.f-secure.com/

http://www.idefense.com/

http://www.iss.com/

http://www.isc.sans.com/

http://www.neohapsis/

http://www.qualys.com/

Figure 2-1. Disclosure Life-cycle

2.1.2 Tracking

While the organizations mentioned above concentrate on discovery and

disclosure, there are others, such those in table 2-2, which focus on keeping track of

known vulnerabilities. The CVE product from Mitre forms the hub of vulnerability

tracking by providing the commonly used definition, description, and refernce or

identification number for vulnerabilities. As described on the CVE website, newly

disclosed vulnerabilities are assigned a candidate number indicating either review or

acceptance status for the CVE dictionary. The other organizations form the spokes with

each providing a repository of vulnerabilities. While the repositories may differ in the

range and depth of detail related to specific vulnerabilities, they reference the CVE ID

numbers. Source Audience URL CERIAS academic www.cerias.purdue.eduMITRE public www.mitre.orgNIST public www.nist.govOSVDB public www.osvdb.orgSecunia Commercial www.secunia.comSPI Dynamics commercial www.spidynamics.com

Table 2-2. Leading vulnerability information repositories

2.1.3 Detection and Assessment

Detection and assessment are arguably the most critical use for vulnerability data.

However, they are undertaken both by individuals desiring to secure their systems against

intrusion and those intent on exploiting vulnerabilities. The information contained in the

disclosures and repositories facilitate the development of signatures or profiles used by

the various tools for detecting exploiters that specific systems are vulnerable to. There

are a variety of open source and commercially available vulnerability assessment

scanners such as Nessus and SAINT which may be used by both groups. While each tool

or service provider maintains their own dataset of vulnerabilities, they work to ensure

compatibility with CVE and other vulnerability tracking organizations.

6

http://www.cerias.purdue.edu/

http://www.mitre.org/

http://www.nist.gov/

http://www.osvdb.org/

http://www.spidynamics.com/

2.2 Data Mining

Data mining as described by Hand, et al [3] is the analysis of datasets to learn

unsuspected relationships as well as summarize the data in a novel, understandable, or

useful manner. It is often performed in the context of knowledge discover in databases

(KDD) and involves data selection, preprocessing, identification of data mining goals,

selection of data mining tasks, and assessment of results. The data mining tasks typically

involve the application of data mining algorithms for classification, association, or

cluster. These topics are discussed further in following sections.

2.2.1 Preprocessing

Data mining is typically applied to data which has been collected for a purpose

other than mining. As such there are often inconsistencies, or incompatibilities in the

dataset which require addressing prior to beginning the data mining process. Han and

Kamber [4] discuss data preprocessing at length covering cleaning, transformation and

reduction. Data cleaning addresses missing values, noise, and inconsistencies. Missing

values may be handled by filling in the value or ignoring the case. Noisy data is

smoothed using a variety of techniques such as binning values into groups, clustering,

and fitting data to a function. Inconsistent data may be addressed by manual or

automated routines which reference external sources. Data Transformation is concerned

with converting or aggregating the data into formats useful for the data mining tasks.

This may involve combining attributes, normalization, or constructing new attributes.

Data reduction is the processing of the dataset to achieve a smaller representation of the

data while maintaining the integrity of the original data. Applicable techniques include

data cube aggregation, dimension reduction where irrelevant attributes are removed, and

data compression.

2.2.2 Classification

Classification seeks to both describe the current dataset and predict group

membership of new cases[5]. According to Han and Kamber, it establishes a concise

description of each class which distinguishes it from others. This is accomplished by

examining training cases and building predictive patterns such as classifications rules,

7

decision tables or trees. Classification rules take the IF, THEN form. An example from

Witten and Frank [11] is the following:

IF outlook = sunny and humidity = high THEN play = no

Classification is also performed by decision tree induction. This is one of the simplest

and most successful forms or learning algorithm. It accepts input in the form of attributes

and returns a decision, which is the predicted output for the input given [6]. Classifier

accuracy can be estimated through techniques such as the holdout method, random sub-

sampling and cross-validation.

2.2.3 Association

Association rule mining aims to identify groups of items that frequently occur

together as the result of transactions [7]. Examples may include items consumers

purchase together or the sequential series of web pages viewed by a user. As illustrated

in Han and Kamber [4], the output knowledge is represented as a set of association rule in

the form A ⇒ B (read as A implies B) such as the following association that purchasers

of a computer also buy financial management software at the same time.

Computer ⇒ financial_management_software

Buys(X, “computer”) ⇒ buys (X, “financial_management_software”)

Three measures of interestingness of association rules are support, confidence, and lift.

Support is the percentage of instances in the dataset for which the rule holds. Confidence

indicates the percentage of occurrences where if a rule contains A it also contains B. Lift

is the ratio of probability of A and B occurring together and the probability of the same

event with A and B occurring independently.

[Count of cases with both A and B]Support [A ⇒ B] = [Total count of cases]

[Count of cases with both A and B]Confidence [A ⇒ B]= [Count of cases with A]

Confidence [A ⇒ B]Lift [A ⇒ B]= Support [B]

8

2.2.4 Clustering

Han and Kamber describe a cluster as a collection of objects similar to others in

the same cluster and dissimilar to objects outside the cluster. The process of grouping

these objects is called clustering. Clustering methods include partitioning, hierarchical,

and density. Given a predetermined number of partitions to create, the partitioning

methods classify the cases into that number of partitions. Next, they iteratively relocate

cases from one partition to another in an attempt to improve the partitions. Hierarchical

methods can be separated into agglomerative and divisive. In an agglomerative approach

objects start out in separate groups and are merged into the top level of the hierarchy.

With the divisive methods, objects start out in one group and are successively split up

until the termination condition is reached. Density based methods, as the name implies,

form the cluster based on minimum density in the neighborhood of objects rather than the

distance between them. Unlike classification and association, clustering is difficult to

evaluate. Ultimately the evaluation comes in the form of an interestingness measure

assigned by the user or domain expert.

2.3 Literature Review

Various articles are presented which are relevant to this research. Specifically,

they cover data mining in security, vulnerability analysis and the single available article

on data mining vulnerability databases.

2.3.1 Data Mining in Vulnerability Databases

‘Data Mining in Vulnerability Databases’ [8] is a project update covering work

conducted at the Darmstadt University of Technology in 2000. Schumacher, et al [8]

discuss their research on the Testbed for Reliable, Ubiquitous, Secure Transactional,

Event-driven and Distributed Systems (TRUSTED) . The research appears to be in

several areas centered on the development of a vulnerability database (VDB).

VDB providers and collections of vulnerability information are available from

many sources. Computer Emergency Response Centers (CERT) have been established in

various countries including the US, Germany, and Australia. They publish warnings for

highly dangerous vulnerabilities or those impacting large groups of people. Other

9

sources include mailing lists, newsletters, and news groups available from commercial,

government, and private organizations. The article references FIRST, ISS X-Force,

CIAC, NT Bugtraq, Phrack, Rootshell, Security Focus, INFILSEC, and others. Many of

these sources are no longer available, while others have been combined, and new sources

have emerged.

An underlying aim of research under TRUSTED is the security and confidence in

IT systems. It is not only concerned with these systems remaining free of software bugs,

but that there remain under the control and operation of the systems owners. To this end

much of the work is centered on assessment, prognosis, and avoidance. Work on

assessment includes the comparison of comparable compromises and recommendations

of counter-measures. Work in prognosis covers the likelihood a given vulnerability will

occur. While avoidance seeks to steer clear of known flaws in future software. The

authors observe that background information as well as specific details will be required to

complete their work. Some of this can be accomplished by experts; however, there is far

too much data for human analysis alone.

With regard to data mining in VDBs, they propose that it can suit two purposes:

discovery of new patterns and prediction of class outcome of new cases. However,

before selecting mining techniques, there are four challenges faced by this project:

sufficient training instances, known classifications, concept of “free from vulnerabilities”,

and description of vulnerabilities. It also appears that the lack of a widely accepted and

used ontology also poses a significant obstacle for the identification and classification of

vulnerabilities.

This article [8] also features a lengthy discussion of the operating model to be

used for maintaining VDBs. Proposed organizational forms considered are centralized,

federated, open source, and Balkanized. In the centralized model there would be

precisely one VDB into which all disclosures would be contributed. The federated model

would be a conglomeration of separately owned datasets horizontally partitioned by

subject. The open-source model is similar to that used for software development and

allows for all users to access or contribute to all data. The Balkanized model was in use

at the time the article was written and is predominately the case today. In this model

there is no coordination or control of the various VDBs.

10

Additional topics covered in this article include the population, access, confidence

in the database and the protection of contributors. While the title and abstract indicate

that this article covers data mining of vulnerability databases, there is no mention of the

actual data mining activities or results. It is in essence a very high level discussion of the

authors work. No other mention of this work or TRUSTED could be located.

Nonetheless, this article is included in the literature review for completeness.

2.3.2 Windows of Vulnerability: A Case Study Analysis

This article was featured in IEEE Computer, December 2000 [9]. The authors

claim that well after fixes are provided, systems often remain vulnerable. They present

several case studies and propose a life-cycle model.

While vulnerabilities transition through distinct states ranging from the creation of

the flaw or exposure, through discovery and exploit, and on to patching and removal,

systems also transition through distinct stages relative to vulnerabilities. These systems’

states are often an oscillation between hardened and vulnerable with the periodic

excursion into compromised, as expressed in the figure below. In the article hardened is

described as having all security related patches and corrections applied. Hardening is a

continuous process. A system enters the vulnerable state when security related

corrections have not been made. Compromise occurs when the system is no longer

sufficiently hardened against current threats and a vulnerability is exploited.

Figure 2-2. Host Life-cycle [9]

The life-cycle model proposed by Arbaugh and McHugh [9] is more detailed than

what is typically used in analysis, capturing all possible states in which vulnerabilities

11

can enter during their lifetime. These states include: birth, discovery, disclosure,

correction, publicity, scripting, and death.

The authors select their study data from the CERT/CC database which covers a

period from 1996 thru 1999. While they observe that the CERT/CC is perhaps the best

possible source for data of this type, there are some issues. First they point out that the

records are self-selecting, in that only a subset of attacks will be reported. This is due to

many factors including reluctance by an organization to divulge this type of information.

Second, exploit of vulnerabilities is influenced by human factors such as interestingness.

A given attack may be very popular for a time, and then pass into obscurity, not because

the vulnerability has been removed, rather because it is no longer interesting.

The authors derive three case studies from the CERT data based on vulnerabilities

for the Phf Common Gateway Interface (CGI), Internet Message Access Protocol

(IMAP), and Berkeley Internet Domain (BIND) service. Phf CGI utilized server-side

scripting to provide web server functionality based on the UNIX phonebook command

(ph) concatenated with user input. In this incident an implementation error was exploited

allowing attackers to execute arbitrary code. The vulnerability was disclosed in February

1996, followed quickly by instructions on how to correct the issues. The first scripted

exploit was published in June 1996. The majority of the activity occurred after the script

was published, attempting a simple less affective attack. This supported the author’s

hypothesis that scripting significantly increases the exploitation rate. IMAP incident

resulted from an error in the source code which allowed the use of a long username to

cause a buffer overflow. The vulnerability was posted in March 1997, along with fixes

for the source code. The first known scripting of the vulnerability was in May 1997. An

additional flaw was reported, exploited and scripted a year later. However, the CERT

announcements combined the two. This incident illustrated how much network scanning

and probing are used to identify vulnerable systems. The BIND incident also contained a

flaw resulting in a buffer overflow. Announced in April 1998, automated exploits were

available two months later. The authors note that due to the critical role of BIND in the

internet infrastructure, they would have expected much closer management and

mitigation of the vulnerability.

12

Figure 2-3. Rate of Intrusions [9]

The authors mention the ongoing debate whether to disclose vulnerabilities or not.

They cite their research as evidence that automation of the exploit, not the disclosure

itself is the key to large scale use. Additionally, while patches are typically available

shortly after disclosure, large numbers of systems remain vulnerable, stating “deployment

of corrections is woefully inadequate … many systems remain vulnerable to security

flaws months or even years after corrections become available” [9].

2.3.3 Data Mining for Security

The article Data Mining for Security appeared in NEC Journal of Advanced

Technology, Winter 2005 [10]. The authors describe several new applications to perform

different aspects of intrusion detection utilizing data mining algorithms. The three

applications are SmartSifter which is used for outlier detection, ChangeFinder for

change-point detection, and an anomaly detection engine called AccessTrace.

The authors claim that the outlier detection engine, SmartSifter, is adaptive,

efficient, and highly accurate. It works by learning a statistical model based on past

examples and applies it to the current data. The anomaly score calculated for each datum

will be high for outliers. This permits intrusion detection efforts to focus more

efficiently. A histogram density is used for calculating the statistical model of discrete

variables; while a Gaussian mixture model is used for continuous variables. The

parameters of the model are learned using an unsupervised, on-line, discounting learning

algorithm. When tested against the KDDCup99 dataset, they found that it out performed

13

the Burge & Shawe-Taylor method. Additionally, they developed outlier-filters which

significantly improve the performance of SmartSifter and can be used to preprocess data

for outlier detection.

The ChangeFinder application is aimed at detecting worms or viruses by

examining the times series data in system logs. The authors report they have observed

that viruses and worms are frequently characterized by bursts of activity. A new

outbreak often causes a significant increase in access log entries. Therefore, a goal of

ChangeFinder is to detect the earliest point in time that a virus or worm emerges. During

the first stage of learning the on-line discounting algorithm is used to build an auto-

regression model and calculate an anomaly score using the Shannon information theory.

Next smoothing is performed using moving averages of the anomaly score. This is

followed by the building of a new auto-regression model based on the smoothed data.

AccessTracker is the anomalous behavior detection engine. It is used to detect

intrusion patterns such as those of a Trojan horse or a series of UNIX commands issued

by an intruder. It works by breaking time series data into sessions and building a

statistical model from them. The statistical model is based on a mixture of hidden

Markov models, such as those representing a user’s UNIX command history. Again, the

on-line discounting algorithm is used to learn the model followed by calculation of an

anomaly score for each session. AccessTracker dynamically tracks the changes in the

mixture of components in the model. Thus, detecting an increase in mixture components

would signal the emergence of a new pattern, while the decrease signals that the pattern

has disappeared.

The three applications differ to some degree in the accepted data, processing, and

intended use. At the same time they are very similar in their functionality. SmartSifter

cannot process time series data where as ChangeFinder can. Both are concerned with

detecting local anomalies by measuring how anomalous a specific data point is.

AccessTracker, however, is focused on the detection of patterns of anomalous behavior.

14

3 The Investigation

3.1 Scope

This research is a master’s project in computing and software systems. The

research was conducted between Spring 2005 and Winter 2006. While there are a

number of cyber vulnerability datasets, this project has remained focused on a single

version of one dataset available at the outset of the project. This project utilizes

algorithms readily available in the tools specified in section 3.3.2. Both of these tools are

extensible, however, no modification or enhancements to the algorithms have been made

beyond adjustment of parameters through the program interface due to the richness of the

methods available. The data models produced were analyzed for interestingness and

novelty. Analysis included evaluation metrics available in the tools, my own subjective

evaluation, and my advisor’s assessment. The project scope is limited to answering the

research questions provided in section 1.3.

3.2 Resources and Technologies

3.2.1 Training/Test Dataset

Table 2-2 lists six organizations which maintain vulnerability datasets potentially

appropriate for this project. There were varying degrees of availability and utility

associated with each of them. The ICAT Metabase was selected for several reasons. The

most significant factors were its close similarity to the CVE Dictionary described in

section 2.1 and its maturity. This dataset has been available in the format used here since

1998. An additional influencing factor was ease of access to the data; it can be

downloaded from the NIST website and has no licensing restrictions on the data. It

consists of a single 15MB database file “ICAT Master.mdb” in Microsoft Access format.

It was posted on to the NIST website, March 1, 2005 and downloaded March 7, 2005.

3.2.1.1 Descriptive Summary

From the original 87 attributes, 30 were selected for their value in identifying, evaluating,

or characterizing vulnerabilities by case, date, severity, range, loss, type, requirements,

15

and components. Attributes are categorized into case identification comprised of CVE

ID and publication date, Severity, Attack Requirements (AR), Loss Type (LT),

Vulnerability Type (VT), and Exposed Component (EC). Relevant attributes are listed in

Table 3-1 and descriptive statistics are provided in table 3-2.

Field Name Data Type Description CVE_ID Nominal The standard CVE name for the vulnerability. Publish_Date Nominal The date on which the vulnerability is published within

ICAT. Severity Ordinal The severity assigned to the vulnerability. AR_Launch_remotely Numerical Attack requires no previous access to the system, launch

remotely. AR_Launch_locally Numerical Attack requires launch on local system, must have some

previous access. AR_Target_access_attacker Numerical Attacker Requirements: Victim must access attackers

resource for attack to take place LT_Security_protection Numerical Loss Type: security protection, escalation of privileges. LT_Obtain_all_priv Numerical Loss Type: security protection, administrator access

gained. LT_Obtain_some_priv Numerical Loss Type: security protection, user level privilege

gained. LT_Confidentiality Numerical Loss Type: loss of confidentiality, theft of information. LT_Integrity Numerical Loss Type: loss of integrity, attacker can change the

information residing on or passing through a system. LT_Availability Numerical Loss Type: loss of availability. LT_Sec_Prot_Other Numerical Loss Type: Results in a loss of security protection where

some non-user privilege was gained by the attacker VT_Boundary_condition_error Numerical Vulnerability Type: boundary condition error. VT_Buffer_overflow Numerical Vulnerability Type: buffer overflow. VT_Access_validation_error Numerical Vulnerability Type: access validation error. VT_Exceptional_condition_error Numerical Vulnerability Type: exception condition handling error. VT_Environment_error Numerical Vulnerability Type: environmental error. VT_Configuration_error Numerical Vulnerability Type: configuration error. VT_Race_condition Numerical Vulnerability Type: race condition. VT_Other_vulnerability_type Numerical Vulnerability Type: not explicitly listed as an options VT_Design_Error Numerical Vulnerability Type: Design Error. EC_Operating_system Numerical Exposed Component: operating system. EC_Network_protocol_stack Numerical Exposed Component: network protocol stack of an

operating system EC_Non_server_application Numerical Exposed Component: non-server application. EC_Server_application Numerical Exposed Component: server application EC_Hardware Numerical Exposed Component: hardware EC_Communication_protocol Numerical Exposed Component: communication protocol EC_Encryption_module Numerical Exposed Component: encryption module EC_Other Numerical Exposed Component: some component not explicitly

listed EC_Specific_component Nominal Exposed Component: lists specific program.

16

Table 3-1 Dataset Attributes

Attribute

Mea

n

Std

. Err

or o

f M

ean

Var

ianc

e

Std

. Dev

iatio

n

AR_Launch_remotely 0.71 0.005 0.205 0.452 AR_Launch_locally 0.33 0.005 0.221 0.470 AR_Target_access_attacker 0.02 0.002 0.018 0.135 LT_Security_protection 0.59 0.006 0.242 0.492 LT_Obtain_all_priv 0.24 0.005 0.184 0.429 LT_Obtain_some_priv 0.13 0.004 0.114 0.338 LT_Confidentiality 0.20 0.005 0.159 0.399 LT_Integrity 0.13 0.004 0.113 0.336 LT_Availability 0.27 0.005 0.195 0.442 LT_Sec_Prot_Other 0.23 0.005 0.180 0.424 VT_Input_validation_error 0.47 0.006 0.249 0.499 VT_Boundary_condition_error 0.05 0.003 0.048 0.220 VT_Buffer_overflow 0.22 0.005 0.170 0.412 VT_Access_validation_error 0.11 0.004 0.095 0.308 VT_Exceptional_condition_error 0.10 0.004 0.093 0.306 VT_Environment_error 0.02 0.001 0.016 0.125 VT_Configuration_error 0.07 0.003 0.062 0.250 VT_Race_condition 0.02 0.002 0.021 0.147 VT_Other_vulnerability_type 0.02 0.001 0.015 0.124 VT_Design_Error 0.25 0.005 0.189 0.435 EC_Operating_system 0.19 0.005 0.152 0.390 EC_Network_protocol_stack 0.01 0.001 0.011 0.105 EC_Non_server_application 0.28 0.005 0.203 0.451 EC_Server_application 0.49 0.006 0.250 0.500 EC_Hardware 0.03 0.002 0.026 0.160 EC_Communication_protocol 0.00 0.000 0.001 0.026 EC_Encryption_module 0.01 0.001 0.006 0.080 EC_Other 0.01 0.001 0.012 0.110 CertAdv 0.04 0.002 0.039 0.196 High_Sev 0.50 0.006 0.250 0.500 Medium_Sev 0.44 0.006 0.246 0.496 Low_Sev 0.07 0.003 0.064 0.254

Table 3-2 Descriptive Statistics

Severity is a key attribute since it significantly influences how organizations react

to a specific vulnerability. The web graph in figure 3-1 illustrates the occurrence of other

attributes in relation to the severity levels. Additional web graphs can be found in

appendix D.

17

Figure 3-1 Severity Web Graph

18

It is important to note that vulnerability may have multiple features of the same

type, such as AR_Luanch_locally and AR_Launch_remotely, or LT_Integrity and

LT_Availability. Table 3-2 indicates the distribution of vulnerability features.

Table 3-3 – Distribution of Vulnerability Attributes

AR attributes would more accurately be termed as range. These fields specify

whether the attack can be initiated remotely or require execution from a vulnerable

system (see figure 3-2).

Launch locally

Target must access attacker

Launch remotely

0 1000 2000 3000 4000 5000 6000

Figure 3-2 Attack Range

19

There are seven LT attributes, which are listed in figure 3-3. These fields convey

the scope of loss. Loss of security protection indicates whether admin, user, or non-user

privileges are obtained on the vulnerable system. Loss of availability is equivalent to

some form of denial of service and can run the gamut from a simple reboot to

catastrophic system failure. Loss of confidentiality means that data was accessed or

stolen during the attack. Loss of integrity means that the system has been compromised

and the users can no longer trust the security and accuracy.

Escalation of privleages

Admin privleage gained

Non-user privleage gained

Confidentiality

Integrity

Availability

Non-user privleage gained

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Figure 3-3 Loss Type

VT attributes describe the manner in which the vulnerability manifests which

includes boundary conditions, buffer overflow, exception handling, environment,

configuration, and race condition (see figure 3-4).

20

Exceptional conditionerror

Environment error

Configuration error

Race condition

Other vulnerability

Accessvalidation error

Buffer overflow

Boundary condition error

Input validation error

Operating system

Design Error

0 500 1000 1500 2000 2500 3000 3500 4000

Figure 3-4 Vulnerability Type

EC attributes include nine fields and cover the particular part of the system that is

vulnerable such as hardware, operating system, application, communication or network

protocol and encryption module (see figure 3-5).

Other

Encryption module

Communication protocol

Hardware

Server application

Non-server application

Network protocol

Operating system

0 500 1000 1500 2000 2500 3000 3500 4000

Figure 3-5 Exposed Component

3.2.1.2 Preprocessing

It was necessary to perform several preprocessing tasks on the dataset prior to

beginning the data exploration and subsequent data mining. These included data

cleaning, reduction and transformation. The majority of this activity was performed

using Microsoft Access and Excel.

21

The data cleaning efforts have involved reconstructing missing values when

possible. This was addressed by ensuring there were entries for other attributes of that

type and then setting the missing value to false. As an example, in instances missing the

value for LT_Confidentiality, but with a true value in other fields of the Loss Type

category, the missing value was set to false. There were 39 records marked

**REJECT**, these were deleted from the dataset. Records where the missing value

could not be replaced were removed from the dataset.

Data reduction tasks included dimension reduction and attribute construction.

While the majority of the dataset was complete and consistent with expected entries,

there were a number of fields which have been deprecated by NIST and several others

which contain free form text. The 29 deprecated attributes were removed from the

dataset. The free form text attributes consisted of comments, legacy candidate numbers,

and lists of vendors and software. Exploration of this data reveled that it was highly

inconsistent in content and format. Since they were likely to induce noise in the dataset

these fields were removed. Since NIST’s requirements for assigning high severity

vulnerability included the issuing of an advisory from CERT, it was determined that a

new attribute indicating this was necessary. Attribute construction was used to derive a

new field from the 20 attributes used to reference associated records from other sources

such as those discussed in section 2. The specific attribute that was created was a

Boolean attribute for CERT Advisory.

This dataset is largely comprised of nominal data. The majority of the attributes

are Boolean represented as -1 for true and 0 for false. Data transformation was

performed on all of the numeric Boolean values setting them to “T” for true and “F” for

false.

3.2.2 Evaluation Dataset

In January 2006 a sample of new vulnerability cases was downloaded from the

NIST website for the purpose of evaluating the performance of data models generated in

the data mining phase. This dataset contained 274 new records as verified by a

comparison of the CVE_ID numbers. This dataset underwent preprocessing identical to

the training/test dataset.

22

3.2.3 Tools

3.2.3.1 WEKA

Weka is a popular data mining application developed at the University of Waikato

in New Zealand [11]. It contains a variety of both supervised and unsupervised machine

leaning algorithms for preprocessing, classification, association, and clustering tasks.

These algorithms are implemented in Java classes and organized as packages of related

classes. Users have the option of accessing the application from a command line

interface (CLI), through the graphic user interface (GUI), or via their own Java code.

Weka is released under the GNU General Public license and can be obtained at

www.cs.waikato.ac.nz/ml/weka.

Weka requires input data to be in the Attribute-Relation File Format (ARFF).

This is essentially an ASCII text file divided into header and data sections. The header

section contains a @RELATION tag for identifying the name and relation of the data.

This is followed by a list of all attributes in the dataset. They are required to be in

specific format @ATTRIBUTE <attribute name> <numeric or values>. In the case of

nominal attribute values rather than numeric, all possibilities must be included in braces.

An ARFF formatted sample of the dataset used for this project is included in Appendix

A.

23

http://www.cs.waikato.ac.nz/ml/weka

Figure 3-6 Weka GUI

The Weka GUI contains five components depicted in figure 3-6, the GUI

Chooser, Simple CLI, Explorer, Experiment and KnowledgeFlow Environments. The

first two perform just what their name implies. The Weka Explorer is an interactive data

mining environment with options to preprocess, classify, cluster, associate, and visualize

data. The user selects the dataset and any preprocessing filters desired. Next the data

mining task is selected, configured and started. The Experiment Environment has similar

functionally, but is not intended to be interactive. Rather the user configures and runs the

selected algorithms, with the results saved to a file. The KnowledgeFlow Environment is

comparable to the streams paradigm available in SPSS Clementine. The user builds a

stream from the available objects consisting of data sources, filters, classifiers, clusterers,

evaluators, and visualizers. Once configured, the stream is executed and results written

to the selected visualization object.

3.2.3.2 Clementine

Clementine is a commercial data mining application from SPSS Inc. It is

described as an enterprise strength data mining workbench designed around the CRISP-

24

DM standard [1]. Clementine has various features for accepting, processing, and mining

input data as well as representing and storing the results. Similar to other data mining

applications, Clementine has capabilities for building decision trees, neural networks,

statistical models, association, clustering, and text mining.

Figure 3-7 Clementine Workbench

Data mining in Clementine is based on the stream paradigm where a stream is

constructed of nodes used to access and manipulate the data, build the model, visualize,

and store results. Nodes are categorized by the functionality they provide which includes

sources, record operations, field operations, graphs, modeling, and output. Source nodes

are used to access the dataset, and include user input, databases, text files, SPSS and SAS

file formats. The record operations nodes allow the manipulation of whole records in the

datasets. They include aggregate, append, balance, distinct, merge, sample, select, and

sort. Field operations contain nodes for manipulating and transforming record attributes.

25

These include binning, derive, filter, partition, reclassify, and reorder. The graph nodes

such as distribution, evaluation histogram, and plot may be used for descriptive data

summary and dataset characterization as well as displaying the results. The various

classification, clustering, and association algorithms are implemented in the modeling

nodes. These include nodes for decision tree and rule induction, neural networks, logistic

regression, and text extraction in addition to others. The output nodes provide the

capabilities for accessing, displaying, and storing the models and results. Table, matrix,

audit, database, and text file are available among others. Once the stream has been built

and executed in Clementine, the resulting data model may be viewed and analyzed or

incorporated into the stream as a node.

4 Results Analysis and Evaluation This section focuses on the hypothesis that Cyber Vulnerability Databases contain

undiscovered patterns which are novel, interesting and of value to researchers and

industry professionals. An analysis and evaluation of the results is presented within the

context of each of the research questions. Association mining is applied to research

question Q1 in order to identify the frequent features sets. Research question Q2 is

addressed by classification mining to predict severity class. Research question Q3

examines both the association and classification rules, comparing them to the rules used

to produce the dataset. Research question Q4 summarizes the interesting patterns

discovered in the association and classification mining and discusses the clustering work.

4.1 Frequent Feature Sets

Q1: Which vulnerability attributes frequently occur together?

Research question Q1 seeks to further characterize the dataset as well as reveal

interesting or novel patterns in the association trends between attributes. Given the

granularity of the data models, results are generated for each of the severity levels.

Table 4-1 lists the frequent attribute sets. The Apriori and Generalized Rule Induction

(GRI) algorithms, as implemented in SPSS Clementine, were used to generate the

itemsets. Node configuration is addressed in section 4.3.1.

26

There are eight high severity item sets, 88% of the itemsets have attributes related

to the loss of security protection, whether all or some privileges were gained.

Additionally, 75% of the itemsets contain vulnerability attributes indicating that it could

be initiated remotely. The frequent attribute sets for medium severity vulnerabilities do

not have as clearly identifiable key characteristics as the other severity levels. However,

loss of any form of security protection is not a member of any of the item sets; whereas

the loss of availability and confidentiality do frequently occur. The eight low severity

feature sets are comprised of attributes from the four significant feature categories

described in section 3.2.1.1, although, only two of the sets have items from each group.

100% of the itemsets have loss of confidentiality. The rest of the itemsets have various

attributes from the other categories. Loss of confidentiality may be one of the most

representative characteristics. Server/non-server applications as the exposed component

comprise an additional key characteristic of low severity vulnerabilities.

H i g h

[AR_Launch_remotely,AR_Launch_locally,LT_Obtain_all_priv]

[AR_Launch_remotely,AR_Target_access_attacker]

[AR_Launch_remotely,EC_Server_application,LT_Sec_Prot_Other]

[AR_Launch_remotely,LT_Obtain_all_priv]

[AR_Launch_remotely,LT_Obtain_some_priv]

[AR_Launch_remotely,LT_Sec_Prot_Other,VT_Input_validation_error]

[LT_Obtain_all_priv,EC_Server_application]

[LT_Obtain_all_priv,VT_Buffer_overflow,VT_Input_validation_error]

M e d i u m

[AR_Launch_locally,LT_Availability],

[AR_Launch_locally,LT_Confidentiality]

[AR_Launch_locally,LT_Integrity]

[AR_Launch_remotely,LT_Availability,EC_Server_application,VT_Input_validation_error]

[AR_Launch_remotely,LT_Availability,VT_Exceptional_condition_error]

[AR_Launch_remotely,LT_Confidentiality]

[AR_Launch_remotely,LT_Integrity]

[AR_Launch_remotely,VT_Exceptional_condition_error,EC_Server_application]

[LT_Availability,EC_Non_server_application]

L o w

[AR_Launch_locally,LT_Confidentiality,EC_Non_server_application]

[AR_Launch_locally,LT_Confidentiality,EC_Operating_system]

[AR_Launch_remotely,LT_Confidentiality,AR_Launch_locally]

[AR_Launch_remotely,LT_Confidentiality,EC_Non_server_application]

[AR_Launch_remotely,LT_Confidentiality,EC_Server_application,VT_Input_validation_error]

[AR_Launch_remotely,LT_Confidentiality,EC_Server_application,VT_Design_Error]

[AR_Launch_remotely,LT_Confidentiality,VT_Design_Error]

[AR_Launch_remotely,LT_Confidentiality,VT_Input_validation_error]

Table 4-1 High, Medium, and Low Severity Frequent Attribute Sets

27

Much of the value for association mining lies in the generation of observations

about relationships between the attributes for each vulnerability record. The feature sets

aid in characterizing the dataset and help to gain insight. Thus when examining new

vulnerabilities, seeing that one attribute is present in that vulnerability may lead to the

discovery of the presence of other attributes which have a strong association with the

first. The frequent feature sets discovered in this research, as exhibited in table 4-1, are

representative of the dataset and therefore comprise interesting patterns.

4.2 Class Prediction

Q2: How accurately can the class of new examples be predicted?

Classification mining was performed using the C5.0, Classification and

Regression Tree (C&R), and Chi-Squared Automatic Interaction Detector (CHAID)

algorithms in Clementine. The classification modeling nodes were set to simple mode

and not to use partitioned data, weight, or frequency. C5.0 was configured for 10-fold

cross-validation. The resulting rule sets are discussed in section 4.3.2. Performance of

each data model is indicated in table 4-2. High severity classification performed

exceptionally well. Accuracy is above 86% on the training/test data and 94% for new

cases. Class prediction for high severity vulnerabilities is therefore considered very

accurate. Low and medium severity classification rules performed similarly, doing well

on training/test dataset but not on the new cases. The low severity rules for each

algorithm have greater than 93% accuracy on the training/test data. However, there are

two very notable issues with the rule sets which suggest over fitting. First, all of the rules

are instances where low severity is false, there are no rules where low = true. The C5.0

algorithm did not produce any rules for low severity (refer to table C-7 in Appendix C).

Second, the high accuracy on training data is countered by poor accuracy on new

examples from the evaluation dataset discussed in section 3.2.2.

28

Model Training / Test Dataset

% Correct New Case Dataset % Correct

High Severity (C&R) 86.92% 94.89% High Severity (C5.0) 88.90% 98.18% High Severity (CHAID) 86.92% 94.89% Medium Severity (C&R) 80.03% 44.16% Medium Severity (C5.0) 83.11% 44.16% Medium Severity (CHAID) 80.03% 44.16% Low Severity (C&R) 93.08% 48.91% Low Severity (C5.0) 93.08% 49.91% Low Severity (CHAID) 93.08% 48.91%

Table 4-2

4.3 Comparison of ML rules to NIST Metrics

Q3: How do the machine learning derived rules compare to those used to

produce the dataset?

As previously discussed the dataset is derived from the ICAT Metabase produced

by the National Institute of Standards and Technology (NIST). Although NIST has

recently converted to the Common Vulnerability Scoring System (CVSS), the severity

rating assigned to records in the dataset is based on the following:

A vulnerability is High Severity if:

1. It allows a remote attacker to violate the security protection of a system (i.e. gain

some sort of user, root, or application account)

2. It allows a local attack that gains complete control of a system

3. It is important enough to have an associated CERT/CC advisory or US-CERT

alert.

A vulnerability is Medium Severity if:

1. It does not meet the definition of either “high” or “low” severity

A vulnerability is Low Severity if:

1. The vulnerability does not typically yield valuable information or control over a

system but instead gives the attacker knowledge that may help the attacker

find and exploit other vulnerabilities

2. NIST feels that the vulnerability is inconsequential for most organizations

29

4.3.1 Association Rules

Association rule mining has typically been applied to market basket analysis and

the examination of what products consumers purchase together with the objective of

increasing sales. Early association mining in this project produced models with large

numbers of strong association rules. Unfortunately, the results were heavily skewed

towards a small number of attributes and were not inherently interesting. Figure 4-1, is a

histogram representing the record count (Y-axis) for each of the resulting association

rules (X-axis). The bars are comprised of color segments representing each of the

attributes in a given feature set. This outcome was not surprising since this is a real

world dataset without an equal distribution of attributes throughout the records.

Figure 4-1 ARM Histogram

The Apriori and Generalized Rule Induction (GRI) algorithms, as implemented in

SPSS Clementine, are used to discover the associations and generate the rules.

Configuring the settings for support and confidence is largely based on trial and error.

The thresholds may be set to high in which case the resulting rules may be too obvious to

have much value. In contrast the thresholds may be set arbitrarily low, producing too

many rules to be accurately analyzed. Since these algorithms perform best in this tool

30

when the attribute values are binary (0/1, T/F) an additional transformation was

performed to divide the Severity attribute into three separate Boolean fields for High,

Medium, and Low. A Clementine data mining Stream is prepared with the appropriate

nodes to access the data source and perform the mining. Based on this refinement to the

dataset and the selected algorithms, eight data models are prepared as per table 4-3,

below.

Each rule is presented in the form Antecedents => Consequent (Support %,

Confidence %, Lift %). These rules are sorted in descending order by confidence,

support, and then lift. Since multiple algorithms were used, variations of equivalent rules

were discovered in some instances and duplicates were removed. Appendix B contains

all rules and indicates which algorithm was used. Additionally some rules have a Lift

percentage below 1.0 and are considered to be negatively correlated. However, the rule

is included for completeness.

Algorithm Target Attribute Minimum

Support Minimum

Confidence Apriori Any 10% 80% Apriori High Severity 9% 80% Apriori Medium Severity 5% 60% Apriori Low Severity 2% 30%

GRI Any 5% 85% GRI High Severity 1% 40% GRI Medium Severity 5% 40% GRI Low Severity 1% 20%

Table 4-3 Clementine Node Configuration

The high severity rule set contains 19 rules. Two are negatively correlated and

two more are single item rules with less than 50% confidence. In spite of the high

confidence for most of the association rule, there were generally low support percentages.

The predictive capability of these rules was poor on both the training/test dataset and the

new cases. This suggests that the machine learned rules do not compare well to the

original metrics. Additionally, 14 of the rules each contain attributes relating to loss of

security protection, with LT_Obtain_all_priv as the most predominant. Rule five is a

single item rule for that attribute with a greater than 95% confidence. These patterns are

consistent with the NIST metric for high severity vulnerabilities, and as such are neither

novel nor interesting.

31

1. VT_Buffer_overflow AND LT_Obtain_all_priv AND VT_Input_validation_error => High_Sev (9.53, 97.42, 1.97)

2. LT_Obtain_all_priv AND VT_Input_validation_error => High_Sev (14.15, 97.39, 1.97)

3. VT_Buffer_overflow AND LT_Obtain_all_priv => High_Sev (9.72, 97.33, 1.96)

4. LT_Obtain_all_priv AND AR_Launch_locally => High_Sev (13.42, 95.62, 1.93)

5. LT_Obtain_all_priv => High_Sev (24.29, 95.62, 1.93)

6. LT_Obtain_all_priv AND AR_Launch_remotely => High_Sev (12.10, 95.14, 1.92)

7. LT_Obtain_all_priv AND EC_Server_application => High_Sev (9.21, 94.07, 1.90)

8. AR_Launch_remotely AND AR_Launch_locally AND LT_Obtain_all_priv => High_Sev (1.46, 92.52, 1.87)

9. LT_Obtain_some_priv AND AR_Launch_remotely => High_Sev (10.22, 91.18, 1.84)

10. LT_Sec_Prot_Other AND EC_Server_application AND AR_Launch_remotely => High_Sev (10.52, 84.29, 1.70)

11. LT_Sec_Prot_Other AND VT_Input_validation_error AND AR_Launch_remotely => High_Sev (9.99, 84.13, 1.70)

12. LT_Obtain_some_priv => High_Sev (13.19, 81.55, 1.65)

13. LT_Sec_Prot_Other AND AR_Launch_remotely => High_Sev (18.00, 80.41, 1.62)

14. LT_Sec_Prot_Other AND EC_Server_application => High_Sev (11.73, 80.07, 1.62)

15. AR_Launch_remotely AND AR_Target_access_attacker => High_Sev (1.04, 55.26, 1.12)

16. AR_Launch_locally => High_Sev (32.86, 49.81, 1.01)

17. AR_Launch_remotely => High_Sev (71.30, 49.61, 1.00)

18. AR_Launch_remotely AND AR_Launch_locally => High_Sev (5.71, 49.04, 0.99)

19. AR_Target_access_attacker AND EC_Non_server_application => High_Sev (1.28, 46.81, 0.95)

Table 4-4 High Severity Association Rules

A large number of medium severity association rules were generated. It is

suspected that this is due to the broad range of vulnerabilities assigned. The NIST metric

for this severity is simply that the vulnerability is neither high nor low. The association

rules produced had a moderate to high confidence, greater than 50%. However, they did

not perform well on the datasets with accuracy less than 44% on the training/test set and

less than 12% on the new case dataset. Thus they are no comparison to the source

metrics. Nonetheless, there are interesting patterns to be observed from this rule set.

Most of the rules contain a loss of availability and many have an exception condition

error as the vulnerability type. These attributes are often related to Denial of Service

(DoS) attacks. Therefore the pattern is interesting in that it represents a previously

unknown relationship between medium severity vulnerabilities and DoS attacks.

32

1. VT_Exceptional_condition_error AND LT_Availability => Medium_Sev (6.90, 86.73, 1.99)

2. VT_Exceptional_condition_error AND LT_Availability AND AR_Launch_remotely => Medium_Sev (6.26, 86.46, 1.98)

3. VT_Exceptional_condition_error AND EC_Server_application AND AR_Launch_remotely => Medium_Sev (5.18, 73.09, 1.68)

4. VT_Exceptional_condition_error AND AR_Launch_remotely => Medium_Sev (9.10, 72.82, 1.67)

5. VT_Exceptional_condition_error AND EC_Server_application => Medium_Sev (5.47, 72.75, 1.67)

6. VT_Exceptional_condition_error => Medium_Sev (10.44, 71.73, 1.64)

7. AR_Launch_locally AND LT_Availability => Medium_Sev (5.08, 71.24, 1.63)

8. LT_Availability => Medium_Sev (26.65, 69.72, 1.60)

9. AR_Launch_remotely AND LT_Availability => Medium_Sev (22.88, 69.69, 1.60)

10. LT_Availability AND EC_Server_application AND AR_Launch_remotely => Medium_Sev (13.43, 69.28, 1.59)

11. LT_Availability AND EC_Server_application => Medium_Sev (14.54, 69.08, 1.58)

12. LT_Integrity AND AR_Launch_locally => Medium_Sev (6.12, 68.08, 1.56)

13. LT_Confidentiality AND AR_Launch_locally => Medium_Sev (5.03, 63.32, 1.45)

14. LT_Integrity => Medium_Sev (13.00, 62.82, 1.44)

15. LT_Availability AND EC_Non_server_application => Medium_Sev (5.62, 62.04, 1.42)

16. LT_Availability AND VT_Input_validation_error AND AR_Launch_remotely => Medium_Sev (12.99, 61.20, 1.40)

17. LT_Availability AND VT_Input_validation_error AND EC_Server_application AND AR_Launch_remotely => Medium_Sev (8.28, 60.89, 1.40)

18. LT_Availability AND VT_Input_validation_error => Medium_Sev (14.36, 60.80, 1.39)

19. LT_Availability AND VT_Input_validation_error AND EC_Server_application => Medium_Sev (8.79, 60.50, 1.39)

20. AR_Launch_remotely AND LT_Integrity => Medium_Sev (7.34, 58.92, 1.35)

21. LT_Confidentiality => Medium_Sev (19.81, 55.07, 1.26)

22. AR_Launch_remotely AND LT_Confidentiality => Medium_Sev (15.66, 52.48, 1.20)

23. AR_Launch_locally => Medium_Sev (32.86, 45.62, 1.05)

24. AR_Launch_remotely => Medium_Sev (71.30, 42.70, 0.98)

Table 4-5 Medium Severity Association Rule

Due to the somewhat vague metric for low severity vulnerabilities, “we feel the

vulnerability is inconsequential …,” data mining for low severity rules took into account

both true and false cases. That is, cases where the attribute was absent as well as cases

where it was present. Whereas in mining high and medium severity rules, the Clementine

modeling nodes were configured to consider record attributes with true values only. The

low severity association rule set is comprised of 20 entries with low to moderate

confidence (21-66%). The corresponding performance on all datasets was poor clearly

indicating that the low severity rules do not map well to the criteria used in assigning the

actual severity levels. Even though the rules preformed poorly for prediction, there are

interesting patterns to be observed in the rule set. All of the low severity rules have a loss

of confidentiality, and no loss of availability, integrity, or escalation of account

33

privileges. This is an indication this class of severity can be characterized by

inappropriate disclosure of confidential information without the loss of security

protection.

1. LT_Confidentiality=T and LT_Availability=F and LT_Obtain_all_priv=F and LT_Sec_Prot_Other=F and LT_Integrity=F => Low_Sev=F (14.87, 65.29, 0.70)

2. LT_Confidentiality=T and LT_Obtain_all_priv=F and LT_Sec_Prot_Other=F and LT_Obtain_some_priv=F and LT_Integrity=F => Low_Sev=F (14.78, 65.10, 0.70)

3. LT_Confidentiality=T and LT_Availability=F and LT_Sec_Prot_Other=F and LT_Obtain_some_priv=F and LT_Integrity=F => Low_Sev=F (14.72, 64.84, 0.70)

4. LT_Confidentiality=T and LT_Availability=F and LT_Sec_Prot_Other=F and LT_Obtain_some_priv=F and LT_Integrity=F => Low_Sev=T (14.72, 35.16, 5.08)

5. LT_Confidentiality=T and LT_Obtain_all_priv=F and LT_Sec_Prot_Other=F and LT_Obtain_some_priv=F and LT_Integrity=F => Low_Sev=T (14.78, 34.90, 5.04)

6. LT_Confidentiality=T and LT_Availability=F and LT_Obtain_all_priv=F and LT_Sec_Prot_Other=F and LT_Integrity=F => Low_Sev=T (14.87, 34.71, 5.01)

7. LT_Confidentiality AND EC_Server_application AND VT_Input_validation_error AND AR_Launch_remotely => Low_Sev (4.90, 34.54, 4.99)

8. LT_Confidentiality AND VT_Input_validation_error AND AR_Launch_remotely => Low_Sev (6.09, 34.30, 4.95)

9. LT_Confidentiality AND EC_Server_application AND VT_Input_validation_error => Low_Sev (4.99, 34.25, 4.95)

10. LT_Confidentiality AND EC_Non_server_application AND AR_Launch_remotely => Low_Sev (3.32, 34.16, 4.93)

11. LT_Confidentiality AND VT_Input_validation_error => Low_Sev (6.38, 33.62, 4.85)

12. LT_Confidentiality AND VT_Design_Error AND AR_Launch_remotely => Low_Sev (5.09, 31.37, 4.53)

13. LT_Confidentiality AND VT_Design_Error AND EC_Server_application AND AR_Launch_remotely => Low_Sev (2.84, 31.25, 4.51)

14. LT_Confidentiality AND EC_Non_server_application => Low_Sev (4.88, 30.81, 4.45)

15. LT_Confidentiality AND VT_Design_Error AND EC_Non_server_application => Low_Sev (2.35, 30.23, 4.37)

16. AR_Launch_remotely AND LT_Confidentiality => Low_Sev (15.66, 29.03, 4.19)

17. LT_Confidentiality => Low_Sev (19.81, 26.74, 3.86)

18. AR_Launch_locally AND LT_Confidentiality AND EC_Non_server_application => Low_Sev (1.60, 23.93, 3.46)

19. AR_Launch_remotely AND AR_Launch_locally AND LT_Confidentiality => Low_Sev (1.17, 23.26, 3.36)

20. AR_Launch_locally AND LT_Confidentiality AND EC_Operating_system => Low_Sev (1.45, 20.75, 3.00)

Table 4-6 Low Severity Association Rules

Model

Training / Test Dataset % Correct

New Case Dataset % Correct

High Severity (Apriori) 49.50% 37.23% High Severity (GRI) 49.50% 37.23% Medium Severity (Apriori) 43.58% 11.31% Medium Severity (GRI) 43.58% 11.31% Low Severity (Apriori) 4.18% 12.04 Low Severity (GRI) 6.92% 51.09%

Table 4-7 Association Rule Performance

34

4.3.2 Classification Rules

Each of the algorithms produced similar rule sets, although the CHAID rule sets

were comprehensive. CHAID rules for high, medium, and low severity level are

provided in tables 4-7, 4-8, and 4-9 respectively, with the remaining rule sets provided in

Appendix C.

There were 18 rules generated for high severity vulnerabilities which performed

exceptionally well on both the training set and new cases. The C5.0 algorithm performed

well on the training/test dataset with a 88.9% accuracy, however, it performed best of all

the classification algorithms on the new cases with 98.18%. Given the solid

performance, the rules are considered to be a good comparison to the NIST metric.

However, the patterns detected in the rule set were intuitive and therefore uninteresting.

Loss of security protection, in particular LT_Obtain_all_priv, was predominant. As with

the association rules this is due, at least in part, to the NIST metric for high severity

vulnerabilities.

35

1. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=F AND VT_Buffer_overflow=F AND LT_Availability=F THEN High=F

2. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=F AND VT_Buffer_overflow=F AND LT_Availability=T THEN High=F

3. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=F AND VT_Buffer_overflow=T AND LT_Availability=T THEN High=F

4. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=T AND AR_Launch_remotely=F THEN High=F

5. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=F AND EC_Server_application=F THEN High=F

6. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=F AND EC_Server_application=T THEN High=F

7. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=F AND VT_Buffer_overflow=T AND LT_Availability=F THEN High=T

8. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=T AND AR_Launch_remotely=T AND VT_Input_validation_error=F THEN High=T

9. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=T AND AR_Launch_remotely=TAND VT_Input_validation_error=T THEN High=T

10. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=T AND VT_Exceptional_condition_error=F AND EC_Server_application=F THEN High=T

11. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=T AND VT_Exceptional_condition_error=F AND EC_Server_application=T THEN High=T

12. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=T AND VT_Exceptional_condition_error=T THEN High=T

13. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=F AND EC_Non_server_application=F THEN High=T

14. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=F AND EC_Non_server_application=T THEN High=T

15. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=T AND LT_Obtain_some_priv=F AND LT_Availability=FAND AR_Launch_locally=F THEN High=T

16. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=T AND LT_Obtain_some_priv=F AND LT_Availability=F AND AR_Launch_locally=T THEN High=T

17. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=T AND LT_Obtain_some_priv=F AND LT_Availability=T THEN High=T

18. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=T AND LT_Obtain_some_priv=T THEN High=T

Table 4-8 High Severity CHAID Rule Set

Medium severity classification rule mining produced 18 rules in the CHAID rule

set. Eleven of these rules are negative instances where the case is not rated as a medium

severity. The remaining seven are true cases. Rules generated from all three algorithms

preformed well on the training/test dataset with accuracies above 80% (see table 4-10).

However, they each had accuracies below 45% and performed equally poorly on the new

case dataset.

36

1. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=T AND AR_Launch_remotely=T AND VT_Input_validation_error=F THEN Medium=F

2. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=T AND AR_Launch_remotely=T AND VT_Input_validation_error=T THEN Medium=F

3. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=T AND VT_Exceptional_condition_error=F AND EC_Non_server_application=F THEN Medium=F

4. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=T AND VT_Exceptional_condition_error=F AND EC_Non_server_application=T THEN Medium=F

5. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=T AND VT_Exceptional_condition_error=T THEN Medium=F

6. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=F AND EC_Non_server_application=F AND AR_Launch_remotely=F THEN Medium=F

7. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=F AND EC_Non_server_application=F AND AR_Launch_remotely=T THEN Medium=F

8. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=F AND EC_Non_server_application=T THEN Medium=F

9. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=T AND LT_Obtain_some_priv=F AND LT_Availability=F THEN Medium=F

10. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=T AND LT_Obtain_some_priv=F AND LT_Availability=T THEN Medium=F

11. IF LT_Obtain_all_priv=T AND VT_Input_validation_error=T AND LT_Obtain_some_priv=T THEN Medium=F

12. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=F AND LT_Availability=F AND LT_Integrity=F THEN Medium=T

13. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=F AND LT_Availability=F AND LT_Integrity=T THEN Medium=T

14. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=F AND LT_Availability=T AND VT_Buffer_overflow=F THEN Medium=T

15. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=F AND LT_Availability=T AND VT_Buffer_overflow=T THEN Medium=T

16. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=F AND LT_Obtain_some_priv=T AND AR_Launch_remotely=F THEN Medium=T

17. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=F AND EC_Server_application=F THEN Medium=T

18. IF LT_Obtain_all_priv=F AND LT_Sec_Prot_Other=T AND AR_Launch_remotely=F AND EC_Server_application=T THEN Medium=T

Table 4-9 Medium Severity CHAID Rule Set

Low severity classification rules performed best on the training/test data, out of

the three severity levels. Rules for each algorithm have greater than 93% accuracy.

However, there are two very notable issues with the rule sets which suggest over fitting.

First, all of the rules are instances where low severity is false, there are no rules where

low = true. The C5.0 algorithm did not produce any rules for low severity (refer to

Appendix C, table C-7). Second, the high accuracy on training data is countered by poor

accuracy on new examples in the evaluation dataset.

37

1. if LT_Confidentiality=F AND VT_Input_validation_error=F AND LT_Obtain_all_priv=F AND AR_Launch_remotely=F THEN Low=F

2. if LT_Confidentiality=F AND VT_Input_validation_error=F AND LT_Obtain_all_priv=F AND AR_Launch_remotely=T AND EC_Server_application=F THEN Low=F

3. if LT_Confidentiality=F AND VT_Input_validation_error=F AND LT_Obtain_all_priv=F AND AR_Launch_remotely=T AND EC_Server_application=T THEN Low=F

4. if LT_Confidentiality=F AND VT_Input_validation_error=F AND LT_Obtain_all_priv=T THEN Low=F

5. if LT_Confidentiality=F AND VT_Input_validation_error=T AND VT_Buffer_overflow=F AND LT_Availability=F AND LT_Integrity=F THEN Low=F

6. if LT_Confidentiality=F AND VT_Input_validation_error=T AND VT_Buffer_overflow=F AND LT_Availability=F AND LT_Integrity=T THEN Low=F

7. if LT_Confidentiality=F AND VT_Input_validation_error=T AND VT_Buffer_overflow=F AND LT_Availability=T THEN Low=F

8. if LT_Confidentiality=F AND VT_Input_validation_error=T AND VT_Buffer_overflow=T THEN Low=F

9. if LT_Confidentiality=T AND LT_Sec_Prot_Other=F AND LT_Integrity=F AND VT_Input_validation_error=F AND VT_Design_Error=F THEN Low=F

10. if LT_Confidentiality=T AND LT_Sec_Prot_Other=F AND LT_Integrity=F AND VT_Input_validation_error=F AND VT_Design_Error=T THEN Low=F

11. if LT_Confidentiality=T AND LT_Sec_Prot_Other=F AND LT_Integrity=F AND VT_Input_validation_error=T THEN Low=F

12. if LT_Confidentiality=T AND LT_Sec_Prot_Other=F AND LT_Integrity=T THEN Low=F

13. if LT_Confidentiality=T AND LT_Sec_Prot_Other=T THEN Low=F

Table 4-10 Low Severity CHAID Rule Set

4.4 Pattern Discovery

Q4: What novel or interesting patterns can be discovered from the dataset?

Hand et al [3], describe data mining as “the analysis of (often large) observational

data sets to find unsuspected relationships and to summarize the data in novel ways that

are both understandable and useful to the data owner.” The association and classification

rules discussed above as well as in the appendices, contain many interesting patterns

since they involve interesting and important attributes, have high confidence and support

and they reveal unsuspected or new information [12]. While the association and

classification results were previously depicted in rule or fact form, they may also be

observed as the following patterns:

1. High severity vulnerabilities are characterized by gaining all privileges

associated with input validation error and buffer overflow.

2. Medium Severity vulnerabilities have several characteristics commonly

shared by Denial of Service Attacks. They typically consist of a loss of

38

availability and are remotely launched. They are frequently connected to an

exception error condition and buffer overflow.

3. Low severity vulnerabilities most frequently involve a loss of confidentiality

and do not include a loss of security protection where the attacker gained

unauthorized privileges.

Clustering was performed using SPSS TwoStep clustering algorithm configured

to use all available attributes. Maximum tree depth was set to 5 levels with maximum 8

branches per leaf node. The number of clusters specified was 3. Clustering is frequently

a difficult mode for which to evaluate results. Witten and Frank [11] observe that while

association and classification have objective criterion to judge success, clustering does

not. There is no intrinsic measure of right or wrong for clustering. Despite the lack of

criterion or indicators some observation can still be made from the results. Table 4-11

lists the frequency count and percent of attribute occurrence in each cluster. Tables 4-12

thru 4-14 provide a statistical profile of the tree clusters.

1. Cluster 1 consists exclusively of high severity vulnerabilities, which

facilitates the clear decomposition of the remaining characteristics. Of the four attributes

related to loss of security protection, 75% are related to high severity.

2. Cluster 2 corresponds to low severity and does not contain a significant

percentage of any attribute subset.

3. The medium severity can be found in cluster 3. This cluster contains the

majority of losses of confidentiality and availability.

39

4.

Cluster 1 Cluster 2 Cluster 3 Attributes Frequency Percent Frequency Percent Frequency Percent AR_Launch_remotely 2,591 49.6% 402 7.7% 2,230 42.7% AR_Launch_locally 1,199 49.8% 110 4.6% 1,098 45.6% AR_Target_access_attacker 69 50.4% 13 9.5% 55 40.1% LT_Security_protection 3,430 79.6% 33 0.8% 848 19.7% LT_Obtain_all_priv 1,701 95.6% 5 0.3% 73 4.1% LT_Obtain_some_priv 787 81.6% 7 0.7% 171 17.7% LT_Confidentiality 264 18.2% 388 26.7% 799 55.1% LT_Integrity 314 33.0% 40 4.2% 598 62.8% LT_Availability 543 27.8% 48 2.5% 1,361 69.7% LT_Sec_Prot_Other 1,198 69.7% 22 1.3% 499 29.0% VT_Input_validation_error 2,064 59.7% 182 5.3% 1,209 35.0% VT_Boundary_condition_error 153 41.0% 11 2.9% 209 56.0% VT_Buffer_overflow 1,216 76.4% 7 0.4% 369 23.2% VT_Access_validation_error 367 47.1% 60 7.7% 353 45.3% VT_Exceptional_condition_error 166 21.7% 50 6.5% 548 71.7% VT_Environment_error 58 49.6% 6 5.1% 53 45.3% VT_Configuration_error 233 47.6% 50 10.2% 207 42.2% VT_Race_condition 70 43.5% 7 4.3% 84 52.2% VT_Other_vulnerability_type 52 45.2% 11 9.6% 52 45.2% VT_Design_Error 778 42.1% 184 9.9% 888 48.0% EC_Operating_system 703 51.4% 56 4.1% 609 44.5% EC_Network_protocol_stack 25 30.5% 3 3.7% 54 65.9% EC_Non_server_application 1,080 51.9% 148 7.1% 851 40.9% EC_Server_application 1,778 49.1% 283 7.8% 1,563 43.1% EC_Hardware 66 34.2% 16 8.3% 111 57.5% EC_Communication_protocol 5 100.0% 0 0.0% 0 0.0% EC_Encryption_module 15 31.9% 3 6.4% 29 61.7% EC_Other 39 43.8% 9 10.1% 41 46.1% CertAdv 226 76.9% 3 1.0% 65 22.1% High_Sev 3,626 100.0% 0 0.0% 0 0.0% Medium_Sev 0 0.0% 0 0.0% 3,192 100.0% Low_Sev 0 0.0% 507 100.0% 0 0.0%

Table 4-11 SPSS Two Step Cluster Frequency Table

40

A

ttrib

utes

Mea

n

Std

. Err

or o

f Mea

n

Var

ianc

e

Kur

tosi

s

Std

. Err

or o

f Kur

tosi

s

Ske

wne

ss

Std

. Err

or o

f Ske

wne

ss

Std

. Dev

iatio

n

Gro

uped

Med

ian

AR_Launch_remotely 0.71 0.008 0.204 -1.097 0.081 -0.951 0.041 0.452 0.71 AR_Launch_locally 0.33 0.008 0.221 -1.482 0.081 0.720 0.041 0.471 0.33 AR_Target_access_attacker 0.02 0.002 0.019 47.637 0.081 7.044 0.041 0.137 0.02 LT_Security_protection 0.95 0.004 0.051 13.578 0.081 -3.946 0.041 0.226 0.95 LT_Obtain_all_priv 0.47 0.008 0.249 -1.986 0.081 0.124 0.041 0.499 0.47 LT_Obtain_some_priv 0.22 0.007 0.170 -0.114 0.081 1.373 0.041 0.412 0.22 LT_Confidentiality 0.07 0.004 0.068 8.827 0.081 3.290 0.041 0.260 0.07 LT_Integrity 0.09 0.005 0.079 6.653 0.081 2.941 0.041 0.281 0.09 LT_Availability 0.15 0.006 0.127 1.858 0.081 1.964 0.041 0.357 0.15 LT_Sec_Prot_Other 0.33 0.008 0.221 -1.480 0.081 0.721 0.041 0.470 0.33 VT_Input_validation_error 0.57 0.008 0.245 -1.923 0.081 -0.280 0.041 0.495 0.57 VT_Boundary_condition_error 0.04 0.003 0.040 18.771 0.081 4.556 0.041 0.201 0.04 VT_Buffer_overflow 0.34 0.008 0.223 -1.514 0.081 0.698 0.041 0.472 0.34 VT_Access_validation_error 0.10 0.005 0.091 5.001 0.081 2.645 0.041 0.302 0.10 VT_Exceptional_condition_error 0.05 0.003 0.044 16.916 0.081 4.348 0.041 0.209 0.05 VT_Environment_error 0.02 0.002 0.016 57.615 0.081 7.719 0.041 0.125 0.02 VT_Configuration_error 0.06 0.004 0.060 10.647 0.081 3.555 0.041 0.245 0.06 VT_Race_condition 0.02 0.002 0.019 46.886 0.081 6.990 0.041 0.138 0.02 VT_Other_vulnerability_type 0.01 0.002 0.014 64.836 0.081 8.173 0.041 0.119 0.01 VT_Design_Error 0.21 0.007 0.169 -0.065 0.081 1.391 0.041 0.411 0.21 EC_Operating_system 0.19 0.007 0.156 0.401 0.081 1.549 0.041 0.395 0.19 EC_Network_protocol_stack 0.01 0.001 0.007 140.242 0.081 11.923 0.041 0.083 0.01 EC_Non_server_application 0.30 0.008 0.209 -1.218 0.081 0.884 0.041 0.457 0.30 EC_Server_application 0.49 0.008 0.250 -2.000 0.081 0.039 0.041 0.500 0.49 EC_Hardware 0.02 0.002 0.018 50.029 0.081 7.211 0.041 0.134 0.02 EC_Communication_protocol 0.00 0.001 0.001 721.197 0.081 26.885 0.041 0.037 0.00 EC_Encryption_module 0.00 0.001 0.004 237.066 0.081 15.458 0.041 0.064 0.00 EC_Other 0.01 0.002 0.011 88.108 0.081 9.490 0.041 0.103 0.01 CertAdv 0.06 0.004 0.058 11.128 0.081 3.622 0.041 0.242 0.06 High_Sev 1.00 0.000 0.000 . . . . 0.000 1.00 Medium_Sev 0.00 0.000 0.000 . . . . 0.000 0.00 Low_Sev 0.00 0.000 0.000 . . . . 0.000 0.00

Table 4-12 SPSS Two Step Cluster 1 Profile

41

Attr

ibut

es

Mea

n

Std

. Erro

r of M

ean

Var

ianc

e

Kur

tosi

s

Std

. Err

or o

f Kur

tosi

s

Ske

wne

ss

Std

. Erro

r of S

kew

ness

Std

. Dev

iatio

n

Gro

uped

Med

ian

AR_Launch_remotely 0.79 0.018 0.165 0.103 0.217 -1.450 0.108 0.406 0.79 AR_Launch_locally 0.22 0.018 0.170 -0.103 0.217 1.377 0.108 0.413 0.22 AR_Target_access_attacker 0.03 0.007 0.025 34.376 0.217 6.020 0.108 0.158 0.03 LT_Security_protection 0.07 0.011 0.061 10.549 0.217 3.537 0.108 0.247 0.07 LT_Obtain_all_priv 0.01 0.004 0.010 97.379 0.217 9.950 0.108 0.099 0.01 LT_Obtain_some_priv 0.01 0.005 0.014 68.124 0.217 8.358 0.108 0.117 0.01 LT_Confidentiality 0.77 0.019 0.180 -0.425 0.217 -1.256 0.108 0.424 0.77 LT_Integrity 0.08 0.012 0.073 7.850 0.217 3.133 0.108 0.270 0.08 LT_Availability 0.09 0.013 0.086 5.735 0.217 2.777 0.108 0.293 0.09 LT_Sec_Prot_Other 0.04 0.009 0.042 18.282 0.217 4.496 0.108 0.204 0.04 VT_Input_validation_error 0.36 0.021 0.231 -1.659 0.217 0.590 0.108 0.480 0.36 VT_Boundary_condition_error 0.02 0.006 0.021 41.533 0.217 6.586 0.108 0.146 0.02 VT_Buffer_overflow 0.01 0.005 0.014 68.124 0.217 8.358 0.108 0.117 0.01 VT_Access_validation_error 0.12 0.014 0.105 3.632 0.217 2.370 0.108 0.323 0.12 VT_Exceptional_condition_error 0.10 0.013 0.089 5.313 0.217 2.700 0.108 0.298 0.10 VT_Environment_error 0.01 0.005 0.012 80.314 0.217 9.055 0.108 0.108 0.01 VT_Configuration_error 0.10 0.013 0.089 5.313 0.217 2.700 0.108 0.298 0.10 VT_Race_condition 0.01 0.005 0.014 68.124 0.217 8.358 0.108 0.117 0.01 VT_Other_vulnerability_type 0.02 0.006 0.021 41.533 0.217 6.586 0.108 0.146 0.02 VT_Design_Error 0.36 0.021 0.232 -1.680 0.217 0.572 0.108 0.481 0.36 EC_Operating_system 0.11 0.014 0.098 4.231 0.217 2.493 0.108 0.314 0.11 EC_Network_protocol_stack 0.01 0.003 0.006 165.647 0.217 12.923 0.108 0.077 0.01 EC_Non_server_application 0.29 0.020 0.207 -1.162 0.217 0.918 0.108 0.455 0.29 EC_Server_application 0.56 0.022 0.247 -1.952 0.217 -0.235 0.108 0.497 0.56 EC_Hardware 0.03 0.008 0.031 26.997 0.217 5.375 0.108 0.175 0.03 EC_Communication_protocol 0.00 0.000 0.000 . . . . 0.000 0.00 EC_Encryption_module 0.01 0.003 0.006 165.647 0.217 12.923 0.108 0.077 0.01 EC_Other 0.02 0.006 0.017 51.873 0.217 7.326 0.108 0.132 0.02 CertAdv 0.01 0.003 0.006 165.647 0.217 12.923 0.108 0.077 0.01 High_Sev 0.00 0.000 0.000 . . . . 0.000 0.00 Medium_Sev 0.00 0.000 0.000 . . . . 0.000 0.00 Low_Sev 1.00 0.000 0.000 . . . . 0.000 1.00


42

Attr

ibut

es

Mea

n

Std

. Erro

r of M

ean

Var

ianc

e

Kur

tosi

s

Std

. Err

or o

f Kur

tosi

s

Ske

wne

ss

Std

. Erro

r of S

kew

ness

Geo

met

ric M

ean

Std

. Dev

iatio

n

Gro

uped

Med

ian

AR_Launch_remotely 0.70 0.008 0.211 -1.251 0.087 -0.866 0.043 0.00 0.459 0.70 AR_Launch_locally 0.34 0.008 0.226 -1.569 0.087 0.657 0.043 0.00 0.475 0.34 AR_Target_access_attacker 0.02 0.002 0.017 53.139 0.087 7.423 0.043 0.00 0.130 0.02 LT_Security_protection 0.27 0.008 0.195 -0.874 0.087 1.062 0.043 0.00 0.442 0.27 LT_Obtain_all_priv 0.02 0.003 0.022 38.812 0.087 6.387 0.043 0.00 0.150 0.02 LT_Obtain_some_priv 0.05 0.004 0.051 13.747 0.087 3.967 0.043 0.00 0.225 0.05 LT_Confidentiality 0.25 0.008 0.188 -0.670 0.087 1.153 0.043 0.00 0.433 0.25 LT_Integrity 0.19 0.007 0.152 0.571 0.087 1.603 0.043 0.00 0.390 0.19 LT_Availability 0.43 0.009 0.245 -1.912 0.087 0.298 0.043 0.00 0.495 0.43 LT_Sec_Prot_Other 0.16 0.006 0.132 1.586 0.087 1.894 0.043 0.00 0.363 0.16 VT_Input_validation_error 0.38 0.009 0.235 -1.751 0.087 0.500 0.043 0.00 0.485 0.38 VT_Boundary_condition_error 0.07 0.004 0.061 10.361 0.087 3.515 0.043 0.00 0.247 0.07 VT_Buffer_overflow 0.12 0.006 0.102 3.789 0.087 2.406 0.043 0.00 0.320 0.12 VT_Access_validation_error 0.11 0.006 0.098 4.175 0.087 2.484 0.043 0.00 0.314 0.11 VT_Exceptional_condition_error 0.17 0.007 0.142 1.036 0.087 1.742 0.043 0.00 0.377 0.17 VT_Environment_error 0.02 0.002 0.016 55.332 0.087 7.569 0.043 0.00 0.128 0.02 VT_Configuration_error 0.06 0.004 0.061 10.508 0.087 3.536 0.043 0.00 0.246 0.06 VT_Race_condition 0.03 0.003 0.026 33.081 0.087 5.921 0.043 0.00 0.160 0.03 VT_Other_vulnerability_type 0.02 0.002 0.016 56.492 0.087 7.646 0.043 0.00 0.127 0.02 VT_Design_Error 0.28 0.008 0.201 -1.020 0.087 0.990 0.043 0.00 0.448 0.28 EC_Operating_system 0.19 0.007 0.154 0.480 0.087 1.575 0.043 0.00 0.393 0.19 EC_Network_protocol_stack 0.02 0.002 0.017 54.215 0.087 7.495 0.043 0.00 0.129 0.02 EC_Non_server_application 0.27 0.008 0.196 -0.885 0.087 1.056 0.043 0.00 0.442 0.27 EC_Server_application 0.49 0.009 0.250 -2.000 0.087 0.041 0.043 0.00 0.500 0.49 EC_Hardware 0.03 0.003 0.034 23.832 0.087 5.081 0.043 0.00 0.183 0.03 EC_Communication_protocol 0.00 0.000 0.000 . . . . 0.00 0.000 0.00 EC_Encryption_module 0.01 0.002 0.009 105.245 0.087 10.353 0.043 0.00 0.095 0.01 EC_Other 0.01 0.002 0.013 72.983 0.087 8.657 0.043 0.00 0.113 0.01 CertAdv 0.02 0.003 0.020 44.200 0.087 6.795 0.043 0.00 0.141 0.02 High_Sev 0.00 0.000 0.000 . . . . 0.00 0.000 0.00 Medium_Sev 1.00 0.000 0.000 . . . . 1.00 0.000 1.00 Low_Sev 0.00 0.000 0.000 . . . . 0.00 0.000 0.00


5 Educational Statement This project has drawn upon my experience and studies in several areas as well as

resulting in new learning. Specific courses taken at the Institute of Technology,

University of Washington, Tacoma, which have aided my work include TCSS 435

43

Artificial Intelligence and Knowledge Acquisition, TCSS555 Data Mining, and TCSS

598 Master’s Seminar. The AI and Data Mining courses formed the educational

foundation of this project by helping me to comprehend the relevant machine learning

concepts. They guided and influenced the research conducted during this project by

indicating areas where additional knowledge was needed and in selecting specific data

mining tasks. The Master’s Seminar was a benefit in conducting research and preparing

the documentation. New knowledge learned includes a deeper understanding of data

mining process and algorithms, which I intend to apply to further studies. I have also

achieved a significant comprehension of vulnerability databases which I can apply to

further study as well as in my career.

6 Further Work There are several areas proposed for further work including the incorporation of

changes to the ICAT Metabase, additional work with cluster generation and analysis, and

text extraction. In addition, alternate datasets should be considered for related, future

data mining activities.

During the course of this research, the ICAT Metabase underwent several

changes. First, NIST updated and converted the database to XML format and re-released

it as the National Vulnerability Database. The conversion to XML presents some

challenge for the data mining tools used in this project since neither of them is capable of

accurately using the data in its native format, importing, or converting it into a usable

format. Therefore the development of preprocessing scripts or other automation is

proposed to address this issue. The second change introduced by NIST was the move

from the severity rating metrics discussed in section 4.3 to the Common Vulnerability

Score System (CVSS). The new metric is a numeric value derived from a qualitative

assessment of different aspects of the vulnerability.

Additional research into clustering of the vulnerability data is the second area of

proposed future work. Even though there were a variety of clustering algorithms

available, between the tools used in this project, there were issues in working with them.

Most significantly, there were shortcomings in presentation, visualization, and analysis of

the clustering results.

44

Text extraction is the final area of future work. Each vulnerability record

contained several text fields. The information contained in these fields includes a

description of the vulnerability as well as vendor, product name, and version information.

There is a great deal of variance in the contents and quality of the data. In many cases

multiple terms are used to reference the same subject. Thus these fields cannot be used in

the current format. The use of text extraction and possibly named entity recognition is

proposed for the derivation of new attributes which could contribute to further data

mining efforts.

7 Conclusion This project has investigated the applicability of data mining vulnerability

databases, namely the ICAT Metabase, whereas related work had sought to discover new

vulnerabilities, or detect and assess existing ones. It has provided a background in the

field of vulnerability, exploit, and exposure research. It has described relevant data

mining methods and presented a review of related literature. By studying a representative

corpus of known vulnerabilities and applying various data mining techniques it has

demonstrated that cyber vulnerability databases contain previously undiscovered patterns

which are novel, interesting and of value to researchers and industry professionals.

Association mining was used to learn which vulnerability attributes frequently occur

together. Results show for example that high severity item sets are mostly associated with

the loss of security protection and the capacity to be initiated remotely, while the low

severity item sets are associated with loss of confidentiality. Classification rules were

mined and applied to a dataset of new examples to determine the accuracy of class

prediction, showing that high severity can be predicted from other attributes with

excellent accuracy, while low and medium severity did not provide as high accuracy. A

comparison was made between the machine learned rules and those used to produce the

dataset. Results suggest that the rules learnt by the system permitted to discriminate well

among severity level, except for low severity, which shows that the NIST rules can often

be reproduced from the database. Clustering, classification, and association results were

analyzed to enumerate the novel or interesting patterns discovered in the dataset.

Clustering grouped high severity vulnerabilities much better than medium and low

45

vulnerabilities, and discovered alternate interesting groupings. Finally, this report has

proposed future areas of study such as the inclusion of alternate datasets, changes in

severity metrics, and additional data mining algorithms, as well as addressing the

likelihood that additional novel or interesting patterns remain to be discovered.

46

References [1] J. C. P. Chapman, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, and R. Wirth, "CRISP-DM 1.0 - Step-by-step data mining guide," CRISP-DM Consortium, 1999.

[2] Mitre, "CVE - Common Vulnerabilities and Exposures," Mitre, 2005.

[3] H. M. David J. Hand, Padhraic Smyth, Principles of Data Mining (Adaptive Computation and Machine Learning): The MIT Press, 2001.

[4] M. K. Jiawei Han, Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufmann, 2001.

[5] H. A. Edelstein, Introduction to Data Mining and Knowledge Discovery, Third ed. Potomac: Two Crows Corporation, 1999.

[6] P. N. Stuart J. Russell, Artificial Intelligence: A Modern Approach, Second ed. Upper Saddle RIver, NJ: Prentice Hall, 2002.

[7] P. Giudici, Applied Data Mining. West Sussex: Wiley, 2003.

[8] C. H. M. Schumacher, M. Hurler, and A. Buchmann, "Data-Mining in Vulnerability Databases," in Computer Science. Darmstadt: Darmstadt University of Technology, 2000, pp. 12.

[9] W. F. W. Arbaugh, and J. McHugh, "Windows of vulnerability: a case study analysis," Computer, vol. 33, pp. 52-59, 2000.

[10] J. T. K. Yamanishi, and Y. Maruyama, "Data mining for security," NEC Journal of Advanced Technology, vol. 2, pp. 63-69, 2005.

[11] I. Witten and E. Frank, Data Mining: Practical machine learning tools with Java implementations. San Francisco: Morgan Kaufmann, 2000.

[12] A. Azmy, "SuperQuery; Data Mining for Everyone," WP1, 2004.

47

Appendix A: Data Dictionary

A.1 Purpose and Scope This document describes the metadata elements of the dataset attributes used in this

project. The database, “ICAT Master.mdb” is a 15MB file in Microsoft Access format,

prepared by the National Institute of Standards. The database contains several tables and

an aggregated record set of 7553 entries with 87 attributes. However, the task-relevant

data is restricted to a much smaller set of attributes while including all records. From the

original 87 attributes, 30 are selected for their value in identifying, evaluating, or

characterizing the specific vulnerabilities by case, date, severity, range, loss type,

vulnerability type, exposed component, and exposed component.

A.1.2 Reference Guide

A.1.2.1 Documentation Elements of this dictionary contain the following information:

TagName Type Allowable data type

Size size in bytes or characters

AllowZeroLength: TRUE or False

Attributes: Fixed Size, Variable Length, Updatable

Description: Element description and function

Values Range of values. When data type = text, example is provided

Required: TRUE or FALSE

A.1.2.2 Data Types Data types found in the task-relevant data includes the following:

Type Description Date/Time Used to sore dates and times, stores 8 bytes

Example: M/D/YYYY MM/DD/YYYY

Long Integer An integer, 8 bytes Memo Used to store length text and numbers up to 63,999

characters Text Used for text or numbers up to 255 characters

48

A.1.2.3 Entry Identification CVE_ID Type Text

Size 255

AllowZeroLength: FALSE

Attributes: Variable Length, Updatable

Description: The standard CVE or CAN name for the vulnerability

Values CAN-1999-0001 CVE-2002-1088

Required: FALSE

Publish_Date Type Date/Time

Size 8


Attributes: Fixed Size, Updatable

Description:

The date on which the vulnerability is published within ICAT (except for pre-2000 vulnerabilities where we use the discovery date)

Required: FALSE

CVE_Description Type Memo

Size N/A



Description: Short paragraph description describing the vulnerability

Values

Required: FALSE

Severity Type Text

Size 255



Description: The severity assigned the vulnerability

Values

Low Medium High

Required: FALSE

49

A.1.2.4 Range AR_Launch_remotely Type Long Integer

Size 4



Description: Attacker Requirements: attacker may launch the attack remotely

Values

-1 or 0, a Boolean representation where -1 indicates true and 0 is false, a Boolean representation where -1 indicates true and 0 is false

Required: FALSE

AR_Launch_locally Type Long Integer

Size 4



CollatingOrder: General

Values -1 or 0, a Boolean representation where -1 indicates true and 0 is false

Description: Attacker Requirements: attacker may launch the attack locally

Required: FALSE

AR_Target_access_attacker Type Long Integer

Size 4



CollatingOrder: General


Description: Attacker Requirements: attacker may launch the attack locally

Required: FALSE

A.1.2.5 Loss Type LT_Security_protection Type Long Integer

Size 4



Description: Loss Type: the vulnerability causes a loss of security protection


Required: FALSE

50

LT_Obtain_all_priv Type Long Integer

Size 4



Description:

Loss Type: the vulnerability causes a loss of security protection where administrator access was gained by the attacker


Required: FALSE

LT_Obtain_some_priv Type Long Integer

Size 4



Description:

Loss Type: the vulnerability causes a loss of security protection where user level priviledge was gained by the attacker


Required: FALSE

LT_Confidentiality Type Long Integer

Size 4



Description: Loss Type: the vulnerability causes a loss of confidentiality


Required: FALSE

LT_Integrity Type Long Integer

Size 4



Description: Loss Type: the vulnerability causes a loss of integrity


Required: FALSE

LT_Availability Type Long Integer

Size 4



Description: Loss Type: the vulnerability causes a loss of availability

Values -1 or 0, a Boolean representation where -1 indicates

51

true and 0 is false

Required: FALSE

LT_Sec_Prot_Other

Type Long Integer

Size 4



Description:

Loss Type: the vulnerability causes a loss of security protection where some non-user privilege was gained by the attacker


Required: FALSE

A.1.2.6 Vulnerability Type VT_Input_validation_error Type Long Integer

Size 4



Description: Vulnerability Type: the vulnerability is of the type "input validation error"


Required: FALSE

VT_Boundary_condition_error Type Long Integer

Size 4



Description: Vulnerability Type: the vulnerability is of the type "boundary condition error"


Required: FALSE

VT_Buffer_overflow Type Long Integer

Size 4



Description: Vulnerability Type: the vulnerability is of the type "buffer overflow"


Required: FALSE

52

VT_Access_validation_error Type Long Integer

Size 4



Description: Vulnerability Type: the vulnerability is of the type "access validation error"


Required: FALSE

VT_Exceptional_condition_error Type Long Integer

Size 4



Description: Vulnerability Type: the vulnerability is of the type "exception condition handling error"


Required: FALSE

VT_Environment_error Type Long Integer

Size 4



Description: Vulnerability Type: the vulnerability is of the type "environmental error"


Required: FALSE

VT_Configuration_error Type Long Integer

Size 4



Description: Vulnerability Type: the vulnerability is of the type "configuration error"


Required: FALSE

53

VT_Race_condition Type Long Integer

Size 4



Description: Vulnerability Type: the vulnerability is of the type "race condition"


Required: FALSE

VT_Design_Error Type Long Integer

Size 4



Description: Vulnerability Type: Design Error


Required: FALSE

A.1.2.7 Exposed component EC_Operating_system Type Long Integer

Size 4



Description: Exposed Component: vulnerability occurs within an operating system


Required: FALSE

EC_Network_protocol_stack Type Long Integer

Size 4



Description:

Exposed Component: vulnerability occurs within a network protocol stack of an operating system


Required: FALSE

54

EC_Non_server_application Type Long Integer

Size 4



Description: Exposed Component: vulnerability occurs within a non-server application


Required: FALSE

EC_Server_application Type Long Integer

Size 4



Description: Exposed Component: vulnerability occurs within a server application


Required: FALSE

EC_Hardware Type Long Integer

Size 4



Description: Exposed Component: vulnerability occurs within hardware


Required: FALSE

EC_Communication_protocol Type Long Integer

Size 4



Description: Exposed Component: vulnerability occurs within a communication protocol


Required: FALSE

55

EC_Encryption_module Type Long Integer

Size 4



Description: Exposed Component: vulnerability occurs within an encryption module


Required: FALSE

EC_Other Type Long Integer

Size 4



Description: Exposed Component: vulnerability occurs within some component not explicitly listed


Required: FALSE

56

CAN-1999-0001

12/30/1999

T

F

F

F

F

F

F

F

T

F

F

F

F

F

F

F

F

F

F

F

T

T

F

F

F

F

F

F

CAN-1999-0004

12/16/1997

T

F

F

F

F

F

F

F

T

F

F

F

T

F

F

F

F

F

F

F

F

F

F

F

F

F

F

T

CAN-1999-0015

12/16/1997

T

F

F

F

F

F

F

F

T

F

F

T

F

F

F

F

F

F

F

F

F

F

F

F

F

T

F

F

CAN-1999-0030

7/16/1997

F

T

F

T

T

F

F

F

F

F

F

F

T

F

F

F

F

F

F

F

T

F

F

F

F

F

F

F

CAN-1999-0061

10/2/1997

T

F

F

T

F

T

F

T

F

T

F

F

F

F

F

F

F

T

F

F

T

F

T

F

F

F

F

F

CAN-1999-0076

7/1/1997

T

F

F

F

F

F

F

F

T

F

F

F

T

F

F

F

F

F

F

F

F

F

F

F

F

F

F

T

CVE_ID

Publish_Date

Severity

AR_Launch_remotely

AR_Launch_locally

AR_Target_access_attacker

LT_Security_protection

LT_Obtain_all_priv

LT_Obtain_some_priv

LT_Confidentiality

LT_Integrity

LT_Availability

LT_Sec_Prot_Other

VT_Input_validation_error

VT_Boundary_condition_error

VT_Buffer_overflow

VT_Access_validation_error

VT_Exceptional_condition_error

VT_Environment_error

VT_Configuration_error

VT_Race_condition

VT_Design_Error

EC_Operating_system

EC_Network_protocol_stack

EC_Non_server_application

EC_Server_application

EC_Hardware

EC_Communication_protocol

EC_Encryption_module

EC_Other

Table A-1 Training/Test Dataset Sample

57

@relation vulnerability.symbolic @attribute AR_Launch_remotely {T,F} @attribute AR_Launch_locally {T,F} @attribute AR_Target_access_attacker {T,F} @attribute LT_Security_protection {T,F} @attribute LT_Obtain_all_priv {T,F} @attribute LT_Obtain_some_priv {T,F} @attribute LT_Confidentiality {T,F} @attribute LT_Integrity {T,F} @attribute LT_Availability {T,F} @attribute LT_Sec_Prot_Other {T,F} @attribute VT_Input_validation_error {T,F} @attribute VT_Boundary_condition_error {T,F} @attribute VT_Buffer_overflow {T,F} @attribute VT_Access_validation_error {T,F} @attribute VT_Exceptional_condition_error {T,F} @attribute VT_Environment_error {T,F} @attribute VT_Configuration_error {T,F} @attribute VT_Race_condition {T,F} @attribute VT_Other_vulnerability_type {T,F} @attribute VT_Design_Error {T,F} @attribute EC_Operating_system {T,F} @attribute EC_Network_protocol_stack {T,F} @attribute EC_Non_server_application {T,F} @attribute EC_Server_application {T,F} @attribute EC_Hardware {T,F} @attribute EC_Communication_protocol {T,F} @attribute EC_Encryption_module {T,F} @attribute EC_Other {T,F} @attribute CertAdv {T,F} @attribute Severity {Low, Medium, High} @data T,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,F,F,F,T,T,F,F,F,F,F,F,T,High T,F,F,F,F,F,F,F,T,F,F,F,T,F,F,F,F,F,F,F,F,F,F,F,F,F,F,T,T,Medium T,F,F,F,F,F,F,F,T,F,F,T,F,F,F,F,F,F,F,F,F,F,F,F,F,T,F,F,T,High F,T,F,T,T,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,T,High T,F,F,T,F,T,F,T,F,T,F,F,F,F,F,F,F,T,F,F,T,F,T,F,F,F,F,F,F,High T,F,F,F,F,F,F,F,T,F,F,F,T,F,F,F,F,F,F,F,F,F,F,F,F,F,F,T,F,High F,T,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,T,T,High T,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,T,F,F,F,High F,F,F,T,T,F,F,F,F,F,F,F,F,F,F,F,F,T,F,F,T,F,F,F,F,F,F,F,F,High F,T,F,T,T,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,High T,F,F,T,T,F,F,F,T,F,F,F,T,F,F,F,F,F,F,F,F,F,F,T,F,F,F,F,F,High T,F,F,F,F,F,F,F,T,F,T,F,F,F,F,F,F,F,F,F,F,F,F,F,F,T,F,F,T,High F,T,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,F,T,F,F,F,F,T,F,F,F,F,F,High F,T,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,F,T,F,F,F,F,T,F,F,F,F,F,High T,F,F,F,F,F,F,F,T,F,F,F,T,F,F,F,F,F,F,F,F,F,F,F,F,T,F,F,F,High F,T,F,T,F,T,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,T,F,F,F,F,F,High F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,Low F,T,F,T,T,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,Medium F,T,F,T,F,T,F,F,F,F,F,F,F,F,F,F,F,T,F,F,T,F,F,F,F,F,F,F,F,Medium F,F,F,F,F,F,F,F,T,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,F,T,F,F,F,High

Table A-2 Training/Test Dataset sample in ARFF format

58

CVE-2005-0006

5/2/05

Low

T

F

F

F

F

F

F

F

T

F

T

F

F

F

F

F

F

F

F

F

F

F

T

F

F

F

F

F

CVE-2005-0005

5/2/05

High

T

F

F

T

F

T

F

F

F

F

T

F

T

F

F

F

F

F

F

F

F

F

T

F

F

F

F

F

CVE-2005-0004

4/14/05

Medium

F

T

F

T

F

T

T

F

F

F

F

F

F

F

F

F

F

F

F

T

F

F

T

F

F

F

F

F

CVE-2005-0003

4/14/05

Low

F

T

F

F

F

F

F

F

T

F

F

F

F

F

T

F

F

F

F

F

F

F

T

F

F

F

F

F

CVE-2005-0002

5/2/05

High

T

F

F

T

T

T

F

F

F

F

F

F

F

T

F

F

F

F

F

F

F

F

T

F

F

F

F

F

CVE-2005-0001

5/2/05

High

F

T

F

T

T

F

F

F

F

F

F

F

F

F

F

F

F

T

F

F

F

F

T

F

F

F

F

F

CVE_ID

Publish_Date

Severity

AR_Launch_remotely

AR_Launch_locally

AR_Target_access_attacker

LT_Security_protection

LT_Obtain_all_priv

LT_Obtain_some_priv

LT_Confidentiality

LT_Integrity

LT_Availability

LT_Sec_Prot_Other

VT_Input_validation_error

VT_Boundary_condition_error

VT_Buffer_overflow

VT_Access_validation_error

VT_Exceptional_condition_error

VT_Environment_error

VT_Configuration_error

VT_Race_condition

VT_Other_vulnerability_type

VT_Design_Error

EC_Operating_system

EC_Network_protocol_stack

EC_Non_server_application

EC_Server_application

EC_Hardware

EC_Communication_protocol

EC_Encryption_module

EC

Other

Table A-3 New Example Dataset

59

Appendix B: Association Rules Rules for T - contains 13 rule(s) Rule 1 for T (965, 0.816) if LT_Obtain_some_priv = T then T Rule 2 for T (1,779, 0.956) if LT_Obtain_all_priv = T then T Rule 3 for T (748, 0.912) if LT_Obtain_some_priv = T and AR_Launch_remotely = T then T Rule 4 for T (858, 0.801) if LT_Sec_Prot_Other = T and EC_Server_application = T then T Rule 5 for T (1,317, 0.804) if LT_Sec_Prot_Other = T and AR_Launch_remotely = T then T Rule 6 for T (711, 0.973) if VT_Buffer_overflow = T and LT_Obtain_all_priv = T then T Rule 7 for T (982, 0.956) if LT_Obtain_all_priv = T and AR_Launch_locally = T then T Rule 8 for T (674, 0.941)

if LT_Obtain_all_priv = T and EC_Server_application = T then T Rule 9 for T (1,035, 0.974) if LT_Obtain_all_priv = T and VT_Input_validation_error = T then T Rule 10 for T (885, 0.951) if LT_Obtain_all_priv = T and AR_Launch_remotely = T then T Rule 11 for T (770, 0.843) if LT_Sec_Prot_Other = T and EC_Server_application = T and AR_Launch_remotely = T then T Rule 12 for T (731, 0.841) if LT_Sec_Prot_Other = T and VT_Input_validation_error = T and AR_Launch_remotely = T then T Rule 13 for T (697, 0.974) if VT_Buffer_overflow = T and LT_Obtain_all_priv = T and VT_Input_validation_error = T then T Default: T

Table B-0-1 High Severity Apriori Rules for T - contains 9 rule(s) Rule 1 for T (1,779, 0.956) if LT_Obtain_all_priv = T then T Rule 2 for T (982, 0.956) if AR_Launch_locally = T and LT_Obtain_all_priv = T then T Rule 3 for T (885, 0.951) if AR_Launch_remotely = T and LT_Obtain_all_priv = T then T Rule 4 for T (107, 0.925) if AR_Launch_remotely = T and AR_Launch_locally = T and LT_Obtain_all_priv = T then T Rule 5 for T (76, 0.553)

if AR_Launch_remotely = T and AR_Target_access_attacker = T then T Rule 6 for T (94, 0.468) if AR_Target_access_attacker = T and EC_Non_server_application = T then T Rule 7 for T (2,407, 0.498) if AR_Launch_locally = T then T Rule 8 for T (418, 0.49) if AR_Launch_remotely = T and AR_Launch_locally = T then T Rule 9 for T (5,223, 0.496) if AR_Launch_remotely = T then T Default: T

Table B-0-2 High Severity GRI

61

Rules for T - contains 19 rule(s) Rule 1 for T (764, 0.717) if VT_Exceptional_condition_error = T then T Rule 2 for T (952, 0.628) if LT_Integrity = T then T Rule 3 for T (1,952, 0.697) if LT_Availability = T then T Rule 4 for T (505, 0.867) if VT_Exceptional_condition_error = T and LT_Availability = T then T Rule 5 for T (400, 0.728) if VT_Exceptional_condition_error = T and EC_Server_application = T then T Rule 6 for T (666, 0.728) if VT_Exceptional_condition_error = T and AR_Launch_remotely = T then T Rule 7 for T (448, 0.681) if LT_Integrity = T and AR_Launch_locally = T then T Rule 8 for T (368, 0.633) if LT_Confidentiality = T and AR_Launch_locally = T then T Rule 9 for T (411, 0.62) if LT_Availability = T and EC_Non_server_application = T then T Rule 10 for T (372, 0.712) if LT_Availability = T and AR_Launch_locally = T then T Rule 11 for T (1,051, 0.608) if LT_Availability = T and VT_Input_validation_error = T

then T Rule 12 for T (1,064, 0.691) if LT_Availability = T and EC_Server_application = T then T Rule 13 for T (1,676, 0.697) if LT_Availability = T and AR_Launch_remotely = T then T Rule 14 for T (458, 0.865) if VT_Exceptional_condition_error = T and LT_Availability = T and AR_Launch_remotely = T then T Rule 15 for T (379, 0.731) if VT_Exceptional_condition_error = T and EC_Server_application = T and AR_Launch_remotely = T then T Rule 16 for T (643, 0.605) if LT_Availability = T and VT_Input_validation_error = T and EC_Server_application = T then T Rule 17 for T (951, 0.612) if LT_Availability = T and VT_Input_validation_error = T and AR_Launch_remotely = T then T Rule 18 for T (983, 0.693) if LT_Availability = T and EC_Server_application = T and AR_Launch_remotely = T then T Rule 19 for T (606, 0.609) if LT_Availability = T and VT_Input_validation_error = T and EC_Server_application = T and AR_Launch_remotely = T then T Default: T

Table B-0-3 Medium Severity Apriori Rules for T - contains 11 rule(s) Rule 1 for T (1,952, 0.697) if LT_Availability = T then T Rule 2 for T (1,676, 0.697) if AR_Launch_remotely = T and LT_Availability = T then T Rule 3 for T (952, 0.628) if LT_Integrity = T then T Rule 4 for T (372, 0.712) if AR_Launch_locally = T and LT_Availability = T then T Rule 5 for T (448, 0.681) if AR_Launch_locally = T and LT_Integrity = T then T Rule 6 for T (1,451, 0.551) if LT_Confidentiality = T

then T Rule 7 for T (368, 0.633) if AR_Launch_locally = T and LT_Confidentiality = T then T Rule 8 for T (538, 0.589) if AR_Launch_remotely = T and LT_Integrity = T then T Rule 9 for T (1,147, 0.525) if AR_Launch_remotely = T and LT_Confidentiality = T then T Rule 10 for T (2,407, 0.456) if AR_Launch_locally = T then T Rule 11 for T (5,223, 0.427) if AR_Launch_remotely = T then T Default: T

Table B-0-4 Medium Severity GRI

62

Rules for T - contains 9 rule(s) Rule 1 for T (357, 0.308) if LT_Confidentiality = T and EC_Non_server_application = T then T Rule 2 for T (467, 0.336) if LT_Confidentiality = T and VT_Input_validation_error = T then T Rule 3 for T (172, 0.302) if LT_Confidentiality = T and VT_Design_Error = T and EC_Non_server_application = T then T Rule 4 for T (373, 0.314) if LT_Confidentiality = T and VT_Design_Error = T and AR_Launch_remotely = T then T Rule 5 for T (243, 0.342) if LT_Confidentiality = T and EC_Non_server_application = T and AR_Launch_remotely = T then T

Rule 6 for T (365, 0.342) if LT_Confidentiality = T and EC_Server_application = T and VT_Input_validation_error = T then T Rule 7 for T (446, 0.343) if LT_Confidentiality = T and VT_Input_validation_error = T and AR_Launch_remotely = T then T Rule 8 for T (208, 0.312) if LT_Confidentiality = T and VT_Design_Error = T and EC_Server_application = T and AR_Launch_remotely = T then T Rule 9 for T (359, 0.345) if LT_Confidentiality = T and EC_Server_application = T and VT_Input_validation_error = T and AR_Launch_remotely = T then T Default: t

Table B-0-5 Low Severity Apriori Rules for T - contains 5 rule(s) Rule 1 for T (1,451, 0.267) if LT_Confidentiality = T then T Rule 2 for T (1,147, 0.29) if AR_Launch_remotely = T and LT_Confidentiality = T then T Rule 3 for T (117, 0.239) if AR_Launch_locally = T and LT_Confidentiality = T and EC_Non_server_application = T

then T Rule 4 for T (86, 0.233) if AR_Launch_remotely = T and AR_Launch_locally = T and LT_Confidentiality = T then T Rule 5 for T (106, 0.208) if AR_Launch_locally = T and LT_Confidentiality = T and EC_Operating_system = T then T Default: T

Table B-0-6 Low Severity GRI

63

Appendix C: Classification Rules Rules for T - contains 6 rule(s) Rule 1 for T (1,779, 0.956) if LT_Obtain_all_priv = T then T Rule 2 for T (748, 0.911) if AR_Launch_remotely = T and LT_Obtain_some_priv = T then T Rule 3 for T (1,176, 0.898) if AR_Launch_remotely = T and LT_Availability = F and LT_Confidentiality = F and LT_Integrity = F and LT_Sec_Prot_Other = F then T Rule 4 for T (5, 0.857) if EC_Communication_protocol = T then T Rule 5 for T (941, 0.824) if AR_Launch_remotely = T

and AR_Target_access_attacker = F and LT_Sec_Prot_Other = T and VT_Design_Error = F then T Rule 6 for T (1,217, 0.823) if AR_Launch_remotely = T and AR_Target_access_attacker = F and LT_Sec_Prot_Other = T and VT_Exceptional_condition_error = F then T Rules for F - contains 2 rule(s) Rule 1 for F (76, 0.885) if LT_Obtain_all_priv = F and VT_Design_Error = T and VT_Exceptional_condition_error = T then F Rule 2 for F (5,546, 0.653) if LT_Obtain_all_priv = F then F Default: F

Table C-0-1 High Severity C5.0 Rules for F - contains 6 rule(s) Rule 1 for F (1,862, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and VT_Buffer_overflow in [“F”] and LT_Availability in [“F”] then F Rule 2 for F (1,220, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and VT_Buffer_overflow in [“F”] and LT_Availability in [“T”] then F Rule 3 for F (253, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and VT_Buffer_overflow in [“T”] and LT_Availability in [“T”] then F Rule 4 for F (121, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“F”] then F Rule 5 for F (268, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“F”] and EC_Server_application in [“F”] then F Rule 6 for F (77, 0.006) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“F”] and EC_Server_application in [“T”]

then F Rules for T - contains 6 rule(s) Rule 1 for T (102, 0.007) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and VT_Buffer_overflow in [“T”] and LT_Availability in [“F”] then T Rule 2 for T (155, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“T”] and VT_Input_validation_error in [“F”] then T Rule 3 for T (246, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“T”] and VT_Input_validation_error in [“T”] then T Rule 4 for T (1,154, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“F”] then T Rule 5 for T (88, 0.005) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“T”] then T Rule 6 for T (1,779, 0.01) if LT_Obtain_all_priv in [“T”] then T Default: F

Table C-0-2 High Severity C&R

65

Rules for F - contains 6 rule(s) Rule 1 for F (1,862, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and VT_Buffer_overflow in [“F”] and LT_Availability in [“F”] then F Rule 2 for F (1,220, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and VT_Buffer_overflow in [“F”] and LT_Availability in [“T”] then F Rule 3 for F (253, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and VT_Buffer_overflow in [“T”] and LT_Availability in [“T”] then F Rule 4 for F (121, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“F”] then F Rule 5 for F (268, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“F”] and EC_Server_application in [“F”] then F Rule 6 for F (77, 0.006) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“F”] and EC_Server_application in [“T”] then F Rules for T - contains 12 rule(s) Rule 1 for T (102, 0.007) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and VT_Buffer_overflow in [“T”] and LT_Availability in [“F”] then T Rule 2 for T (155, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“T”] and VT_Input_validation_error in [“F”] then T Rule 3 for T (246, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”]

and AR_Launch_remotely in [“T”] and VT_Input_validation_error in [“T”] then T Rule 4 for T (453, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“F”] and EC_Server_application in [“F”] then T Rule 5 for T (701, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“F”] and EC_Server_application in [“T”] then T Rule 6 for T (88, 0.005) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“T”] then T Rule 7 for T (512, 0.009) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“F”] and EC_Non_server_application in [“F”] then T Rule 8 for T (232, 0.01) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“F”] and EC_Non_server_application in [“T”] then T Rule 9 for T (267, 0.01) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“T”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“F”] and AR_Launch_locally in [“F”] then T Rule 10 for T (394, 0.01) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“T”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“F”] and AR_Launch_locally in [“T”] then T Rule 11 for T (88, 0.009) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“T”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“T”] then T Rule 12 for T (286, 0.01) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“T”] and LT_Obtain_some_priv in [“T”] then T Default: F

Table C-0-3 High Severity CHAID

66

Rules for T - contains 2 rule(s) Rule 1 for T (26, 0.821) if LT_Obtain_all_priv = F and LT_Sec_Prot_Other = T and VT_Exceptional_condition_error = T and VT_Design_Error = T then T Rule 2 for T (5,546, 0.562) if LT_Obtain_all_priv = F then T Rules for F - contains 11 rule(s) Rule 1 for F (408, 0.961) if AR_Launch_remotely = T and LT_Confidentiality = F and LT_Integrity = F and LT_Availability = F and LT_Sec_Prot_Other = F and VT_Buffer_overflow = T and EC_Network_protocol_stack = F then F Rule 2 for F (1,779, 0.958) if LT_Obtain_all_priv = T then F Rule 3 for F (1,030, 0.918) if AR_Launch_remotely = T and AR_Launch_locally = F and LT_Confidentiality = F and LT_Integrity = F and LT_Availability = F and LT_Sec_Prot_Other = F and EC_Network_protocol_stack = F then F Rule 4 for F (748, 0.912) if AR_Launch_remotely = T and LT_Obtain_some_priv = T then F Rule 5 for F (5, 0.857) if EC_Communication_protocol = T

then F Rule 6 for F (11, 0.846) if LT_Confidentiality = T and VT_Input_validation_error = T and VT_Access_validation_error = T and EC_Non_server_application = F then F Rule 7 for F (1,217, 0.832) if AR_Launch_remotely = T and AR_Target_access_attacker = F and LT_Sec_Prot_Other = T and VT_Exceptional_condition_error = F then F Rule 8 for F (941, 0.832) if AR_Launch_remotely = T and AR_Target_access_attacker = F and LT_Sec_Prot_Other = T and VT_Design_Error = F then F Rule 9 for F (47, 0.816) if LT_Integrity = F and LT_Availability = F and LT_Sec_Prot_Other = F and VT_Other_vulnerability_type = T then F Rule 10 for F (751, 0.795) if AR_Launch_remotely = T and AR_Target_access_attacker = F and LT_Integrity = F and LT_Availability = F and EC_Non_server_application = T then F Rule 11 for F (12, 0.786) if AR_Target_access_attacker = T and LT_Availability = T then F Default: F

Table C-0-4 Medium Severity C5.0

67

Rules for F - contains 11 rule(s) Rule 1 for F (155, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“T”] and VT_Input_validation_error in [“F”] then F Rule 2 for F (246, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“T”] and VT_Input_validation_error in [“T”] then F Rule 3 for F (852, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“F”] and EC_Non_server_application in [“F”] then F Rule 4 for F (302, 0.007) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“F”] and EC_Non_server_application in [“T”] then F Rule 5 for F (88, 0.005) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“T”] then F Rule 6 for F (313, 0.009) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“F”] and EC_Non_server_application in [“F”] and AR_Launch_remotely in [“F”] then F Rule 7 for F (199, 0.009) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“F”] and EC_Non_server_application in [“F”] and AR_Launch_remotely in [“T”] then F Rule 8 for F (232, 0.01) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“F”] and EC_Non_server_application in [“T”] then F Rule 9 for F (661, 0.01) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“T”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“F”] then F Rule 10 for F (88, 0.009)

if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“T”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“T”] then F Rule 11 for F (286, 0.01) if LT_Obtain_all_priv in [“T”] and VT_Input_validation_error in [“T”] and LT_Obtain_some_priv in [“T”] then F Rules for T - contains 7 rule(s) Rule 1 for T (1,391, 0.005) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“F”] and LT_Integrity in [“F”] then T Rule 2 for T (573, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“F”] and LT_Integrity in [“T”] then T Rule 3 for T (1,220, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“T”] and VT_Buffer_overflow in [“F”] then T Rule 4 for T (253, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“T”] and VT_Buffer_overflow in [“T”] then T Rule 5 for T (121, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“F”] then T Rule 6 for T (268, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“F”] and EC_Server_application in [“F”] then T Rule 7 for T (77, 0.006) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“F”] and EC_Server_application in [“T”] then T Default: F

Table C-0-5 Medium Severity CHAID

68

Rules for F - contains 5 rule(s) Rule 1 for F (155, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“T”] and VT_Input_validation_error in [“F”] then F Rule 2 for F (246, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“T”] and VT_Input_validation_error in [“T”] then F Rule 3 for F (1,154, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“F”] then F Rule 4 for F (88, 0.005) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“T”] and VT_Exceptional_condition_error in [“T”] then F Rule 5 for F (1,779, 0.01) if LT_Obtain_all_priv in [“T”] then F Rules for T - contains 7 rule(s) Rule 1 for T (1,391, 0.005) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“F”] and LT_Integrity in [“F”] then T Rule 2 for T (573, 0.008)

if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“F”] and LT_Integrity in [“T”] then T Rule 3 for T (1,220, 0.009) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“T”] and VT_Buffer_overflow in [“F”] then T Rule 4 for T (253, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“F”] and LT_Availability in [“T”] and VT_Buffer_overflow in [“T”] then T Rule 5 for T (121, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“F”] and LT_Obtain_some_priv in [“T”] and AR_Launch_remotely in [“F”] then T Rule 6 for T (268, 0.008) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“F”] and EC_Server_application in [“F”] then T Rule 7 for T (77, 0.006) if LT_Obtain_all_priv in [“F”] and LT_Sec_Prot_Other in [“T”] and AR_Launch_remotely in [“F”] and EC_Server_application in [“T”] then T Default: F

Table C-0-6 Medium Severity C&R Rules for T - contains 0 rule(s) Rules for F - contains 0 rule(s) Default: F

Table C-0-7 Low Severity C5.0

69

Rules for F - contains 13 rule(s) Rule 1 for F (662, 0.009) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“F”] and LT_Obtain_all_priv in [“F”] and AR_Launch_remotely in [“F”] then F Rule 2 for F (756, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“F”] and LT_Obtain_all_priv in [“F”] and AR_Launch_remotely in [“T”] and EC_Server_application in [“F”] then F Rule 3 for F (760, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“F”] and LT_Obtain_all_priv in [“F”] and AR_Launch_remotely in [“T”] and EC_Server_application in [“T”] then F Rule 4 for F (708, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“F”] and LT_Obtain_all_priv in [“T”] then F Rule 5 for F (747, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“T”] and VT_Buffer_overflow in [“F”] and LT_Availability in [“F”] and LT_Integrity in [“F”] then F Rule 6 for F (178, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“T”] and VT_Buffer_overflow in [“F”] and LT_Availability in [“F”] and LT_Integrity in [“T”] then F

Rule 7 for F (510, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“T”] and VT_Buffer_overflow in [“F”] and LT_Availability in [“T”] then F Rule 8 for F (1,553, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“T”] and VT_Buffer_overflow in [“T”] then F Rule 9 for F (383, 0.008) if LT_Confidentiality in [“T”] and LT_Sec_Prot_Other in [“F”] and LT_Integrity in [“F”] and VT_Input_validation_error in [“F”] and VT_Design_Error in [“F”] then F Rule 10 for F (376, 0.007) if LT_Confidentiality in [“T”] and LT_Sec_Prot_Other in [“F”] and LT_Integrity in [“F”] and VT_Input_validation_error in [“F”] and VT_Design_Error in [“T”] then F Rule 11 for F (388, 0.006) if LT_Confidentiality in [“T”] and LT_Sec_Prot_Other in [“F”] and LT_Integrity in [“F”] and VT_Input_validation_error in [“T”] then F Rule 12 for F (146, 0.01) if LT_Confidentiality in [“T”] and LT_Sec_Prot_Other in [“F”] and LT_Integrity in [“T”] then F Rule 13 for F (158, 0.01) if LT_Confidentiality in [“T”] and LT_Sec_Prot_Other in [“T”] then F Default: F

Table C-0-8 Low Severity CHAID Rules for F - contains 7 rule(s) Rule 1 for F (662, 0.009) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“F”] and LT_Obtain_all_priv in [“F”] and AR_Launch_remotely in [“F”] then F Rule 2 for F (1,516, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“F”] and LT_Obtain_all_priv in [“F”] and AR_Launch_remotely in [“T”] then F Rule 3 for F (708, 0.01) if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“F”] and LT_Obtain_all_priv in [“T”] then F Rule 4 for F (2,988, 0.01)

if LT_Confidentiality in [“F”] and VT_Input_validation_error in [“T”] then F Rule 5 for F (1,147, 0.007) if LT_Confidentiality in [“T”] and LT_Sec_Prot_Other in [“F”] and LT_Integrity in [“F”] then F Rule 6 for F (146, 0.01) if LT_Confidentiality in [“T”] and LT_Sec_Prot_Other in [“F”] and LT_Integrity in [“T”] then F Rule 7 for F (158, 0.01) if LT_Confidentiality in [“T”] and LT_Sec_Prot_Other in [“T”] then F Default: F

Table C-0-9 Low Severity C&R

70

Appendix D: Severity Web Graphs

Figure D-1 High Severity Web Graph

71

Figure D-2 Medium Severity Web Graph

72

Figure D-3 Low Severity Web Grap

73

knowledge discovery in cyber vulnerability … discovery in cyber vulnerability databases ......

Documents