optimizing web application fuzzing with genetic … · chapter 1 introduction..... 1 1.1 fuzz...
TRANSCRIPT
OPTIMIZING WEB APPLICATION FUZZING WITH GENETIC ALGORITHMSAND LANGUAGE THEORY
BY
SCOTT MICHAEL SEAL
A Thesis Submitted to the Graduate Faculty of
WAKE FOREST UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES
in Partial Fulfillment of the Requirements
for the Degree of
MASTER OF SCIENCE
Computer Science
May 2016
Winston-Salem, North Carolina
Approved By:
Errin Fulp, Ph.D., Advisor
William Turkett Ph.D., Chair
David John Ph.D.
Acknowledgments
There are many people who helped to make this thesis possible, and they deservedthanks and gratitude I could never fully provide. First and foremost, thanks to tomy family for being supportive over the past—what should have been two but thenbecame three—years. To Errin Fulp, whose patience in the beginning, middle and endof my academic career kept me afloat, who sparked my interest in computer security,who set me up with career and academic opportunities I did not deserve, who bailedme out of moderately serious (albeit laughable, and ridiculous) trouble...thank you.This research would have never happened if your door were not always open. Thanksto the Wake Forest Computer Science Department, in (no) particular (order): JenniferBurg, Daniel Canas, Sam Cho, Don Gage, David John, Paul Pauca, Stan Thomas,and William Turkett. Thanks to Todd Torgersen for patiently teaching me things Iwas already supposed to know, and for spending his free time helping me flesh outthe ideas that took this research from “eh” to worthwhile. Finally, a special personalthank-you to Sarah Reehl for putting up with my complaining, reading numerousiterations of this document, and for not supporting my near-daily urge to leave thiswork unfinished.
ii
Table of Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Figures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Fuzz Testing for Vulnerability Discovery . . . . . . . . . . . . . . . . 3
1.2 Fuzzing with Evolutionary Algorithms . . . . . . . . . . . . . . . . . 5
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2 Fuzzing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 History of Fuzzing and its Fundamentals Components . . . . . . . . . 9
2.2 Present Day Fuzzing . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 General Purpose Tools and Techniques . . . . . . . . . . . . . 15
2.2.2 Modern Fuzzing . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Web Application Fuzzing . . . . . . . . . . . . . . . . . . . . . 19
2.3 Fuzzing and Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . 23
2.4 Grammar Fuzzing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Advantages and Limitations . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 3 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Genetic Algorithm Components . . . . . . . . . . . . . . . . . 33
3.3 CHC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Problem Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Advantages and Limitations . . . . . . . . . . . . . . . . . . . . . . . 37
Chapter 4 Evolutionary Algorithm Web Fuzzing Framework . . . . . . . 39
4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.2 Attack Grammars . . . . . . . . . . . . . . . . . . . . . . . . . 43
iii
4.1.3 Fitness Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.4 Niche-penalty Heuristics-based Genetic Algorithm . . . . . . . 46
4.1.5 CHC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Advantages and Limitations . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 Testing Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Benchmark Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Markov Model Monte Carlo . . . . . . . . . . . . . . . . . . . 52
5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 Result and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.1 Fitness and Diversity . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.2 Exploits Found . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 6 Conclusion and Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Curriculum Vitae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
iv
List of Figures
1.1 OWASP Top 10 web vulnerabilities shows the frequency of web-basedinjection attacks, and the importance of defending against them . . . 3
1.2 The steps of the Microsoft Secure Development Life-cycle . . . . . . . 4
2.1 The general steps involved in fuzzing campaigns . . . . . . . . . . . . 11
2.2 An example excerpt of a peach pit used for Generation-based fuzzing [11] 17
2.3 Boundary testing recommendations according to Sutton et al. [44] . . 18
2.4 A vulnerable input form that can be exploited using SQL injection . . 22
2.5 Fitness heuristic categories considered by Seagle [53] . . . . . . . . . 25
2.6 An excerpt of a manually-written attack grammar for finding Cross-Site Scripting vulnerabilities [26] . . . . . . . . . . . . . . . . . . . . . 26
3.1 Three traditional crossover methods for creating new chromosomes [3] 34
4.1 Flow graph of preprocessing stage . . . . . . . . . . . . . . . . . . . . 42
4.2 Example extract of Parse tree derived from positive examples of SQLinjection tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 A flowchart of the heuristics-based Evolutionary Algorithm fuzzingframework proof-of-concept . . . . . . . . . . . . . . . . . . . . . . . 49
5.1 The front page of the testbed used for measuring the e↵ectiveness ofniche-penalty GA-based web fuzzing [15] . . . . . . . . . . . . . . . . 52
5.2 mean fitness per generation . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 median diversity of value and symbol representations per generationfor GA and CHC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 median diversity of value and symbol representations per generationfor Random and Markov Model Monte Carlo . . . . . . . . . . . . . . 57
5.5 Total unique exploits per simulation and average number of exploitsper trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.6 Average number of exploits found in each generation . . . . . . . . . 59
v
List of Tables
4.1 An example preprocessing of a positive example . . . . . . . . . . . . 44
5.1 Total exploits per simulation . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Number of trials per simulation type without an exploit . . . . . . . . 60
5.3 Highest number of exploit strings found throughout a singular trial . 60
vi
Abstract
The widespread availability and use of computing and internet resources require soft-ware developers to implement secure development standards and rigorous testing toprevent vulnerabilities. Due to human fallibility, programming errors and logical in-consistencies abound—thus, conventions for testing software are required to ensureConfidentiality, Integrity, and Availability of sensitive user data. A combination ofmanual inspection and automated analysis of programs is necessary to achieve thisgoal. Because of the massive size of many codebases, especially considering the in-corporation of third-party software and infrastructure, thorough manual code reviewby security experts is not always an option. Therefore, e↵ective automated methodsfor testing software systems are essential.
Fuzz testing is a popular technique for automating the discovery of bugs and se-curity errors in software systems ranging from UNIX utilities to web applications.Although mutation and generation-based fuzzing have been in use for many years,fuzzers that intelligently manage test case generation are actively being researched.In particular, optimally testing web applications with limited feedback remains elu-sive. This research presents a use of Evolutionary Algorithms to generate test caseswhich expose vulnerabilities in web applications. This thesis utilizes grammaticallyanalyzed positive examples of injection strings related to a common web vulnerabilityin order to build a set of attack grammars that guide fitness metrics and test casegeneration. In lieu of a manually written, exhaustive attack grammar, the set of at-tack grammars are automatically derived from positive examples. The e�cacy of thisalgorithm is compared to other methods of solution generation, such as Markov ModelMonte Carlo. Finally, two types of Evolutionary Algorithms (a Genetic Algorithmwith heuristic-based repopulation criteria and CHC) are implemented in the fuzzingframework, and evaluated according to their ability to e↵ectively narrow the searchspace. The results demonstrate that Evolutionary Algorithms with grammar-basedheuristics are able to find unique solutions that are grammatically similar, yet stillunique, to a corpus of positive examples.
vii
Chapter 1: Introduction
As computing technologies broaden their reach to individual consumers, commu-
nities, and corporate industries, providing security and privacy of those resources
becomes an important priority. Because of the expansive reach of internet services,
and the growing interconnectivity of our world, users expect software companies to
provide Confidentiality, Integrity, and Availability of sensitive data. According to a
US census report on computer and internet use in the United States, 2013 saw 83.8
percent of American households with ownership of a personal computer, and 73.5
percent with a high speed internet connection [23]. The International Telecommuni-
cations Union, an agency under the supervision of the United Nations, found that in
2015, 3.2 billion people are connected to the internet—a 600 percent increase since
the year 2000 [28]. With the advent of cloud computing infrastructures and browser
based applications replacing traditional desktop applications, security in the Internet
application sphere is a paramount concern of a corporation’s security posture. Thus,
the development of strategies for discovering errors and vulnerabilities in software is
an open area of both academic and industry research.
Computer security concerns have long been a part of the equation in delivering
useful and reliable software systems to end users. The first class of vulnerabilities
to emerge in consumer software systems were related to memory corruption bugs
in programs, such as Bu↵er/Heap Overflows and format-string vulnerabilities [22].
These vulnerabilities were the result of logical programming errors that allowed an
attacker to arbitrarily write to or read from memory which should be unavailable,
potentially leading to a full compromise of the system [14]. Despite the severity of such
bugs, early on, the impact of these errors was not substantial to end users. However,
1
as time progressed, and users begin to trust software companies with private and
commercial data, those vulnerabilities proved very damaging, and forced developers
to adopt secure coding practices.
The modern landscape of cybersecurity involves many of the same vulnerability
pitfalls of the past 30 years, as well as a litany of new attack surfaces due to the
ubiquity of the Internet activity and mobile devices. Remote code execution vulner-
abilities in web services, such as SQL injection and Cross Site Scripting (XSS), have
exploded in frequency in the past decade, and have dire implications for users and cor-
porations. Figure 1.1 shows the top four most common threats to web applications as
determined by the Open Web Application Security Project (OWASP) [41]. Because
of the widespread availability of software services, and the integration of third-party
tools and libraries, it becomes a nontrivial problem to manage the security posture of
a given application. In order to maintain a tenable consumer market base, software
companies must devote resources and manpower to develop standard procedures for
securing software. One such standard, the Microsoft Secure Development Life-cycle,
shown in Figure 1.2, demonstrates that security awareness is required in all stages
of the development process—from training developers and technicians to the design,
implementation, and maintenance of production software [8].
Software companies and service providers are responsible for rigorously uncov-
ering and fixing bugs in software, and rely on a variety of tools and techniques to
accomplish this task. Testing software for errors is vital to the software development
life-cycle. Techniques such as manual code review and static analysis provide some
insight into the behavior of an application. Unfortunately, they are insu�cient when
codebases become too large, or if visibility into source code or program internals is
limited. Automated testing seeks to fill this void by quickly testing input vectors of
applications and monitoring their handling of that input. This process faces some
2
Figure 1.1: OWASP Top 10 web vulnerabilities shows the frequency of web-basedinjection attacks, and the importance of defending against them
unique challenges—first, it must seek out all the possible branches of execution (cov-
erage). Second, It should also be equipped to determine if a program reaches an
unsafe state (or crashes). Third, it should develop a process for crafting the input in
an intelligent way to stress test input vectors without exhausting too much time.
1.1 Fuzz Testing for Vulnerability Discovery
One of the most e↵ective strategies for auditing software services for vulnerabilities
is fuzz testing. Fuzzing is a technique for automatically generating crafted input,
sending it to an application, and monitoring the behavior of that application in order
to ascertain if a given input causes undefined (or nefarious) behavior [44]. The tech-
nique was developed by Barton Miller et al. at the University of Wisconsin. Miller’s
research team developed programs that crafted randomized input to test how com-
mand line utilities commonly found in UNIX-based operating systems would react.
They were able to uncover errors in 25-33 percent of UNIX utilities [47], ushering in
a new paradigm in software testing.
3
Figure 1.2: The steps of the Microsoft Secure Development Life-cycle
From its inception in the late 1980s as a technique for testing UNIX utilities for
defects [47], fuzz testing has become an established technique for discovering bugs
and security vulnerabilities in software. Fuzzers have been able to uncover serious
vulnerabilities in everything from file parsers and language interpreters to network
protocols and binaries [44]. Di↵erent fuzzing strategies have advantages based on the
target in question—mutation-based fuzzing, in which in various portions of a valid
test case are mutated (such as bit flipping for binary data) can uncover di↵erent sorts
of software errors than Generation-based fuzzing types, which create new test cases
based on a model of expected input. Although simple fuzzing strategies still uncover
software bugs today, the advent of modern security measures for various vulnerabil-
ity classes has forced researchers to abandon simplistic fuzzing strategies in favor of
intelligent systems that optimally traverse the input search space. Modern fuzzing
strategies involve taking steps to reverse engineer the target, in order to discover
possible execution paths [42]. This situation is not always tenable—because of the
feedback limitations of black-box fuzzing for targets such as web applications, alter-
native instrumentation techniques are required. Because the search space involved in
fuzz testing is vast and often di�cult to define, guided search using heuristics is a
reasonable alternative.
4
1.2 Fuzzing with Evolutionary Algorithms
This thesis explores a fuzzing strategy that addresses this problem using Evolutionary
Algorithms, guided by fitness metrics based on the lexical and semantic structures
represented in a corpus of positive examples. Classical fuzzing methods seek to gen-
erate fuzz data that targets boundary and format assumptions of input parsed by a
target program. For this problem, the fundamental structures of known-bad attack
strings are analyzed, and its components are used for the creation of new test cases.
Evolutionary Algorithms are a good fit for optimizing fuzzing, because they can be
easily incorporated in the process of input generation, and aid in the reduction of the
search space. Scoring candidate solutions based on semantic structure gives an Evo-
lutionary Algorithm enough structure to avoid perpetuating nonsensical candidate
solutions while allowing for freedom to discover new payloads that uncover software
vulnerabilities. The goals and advantages of the framework developed by this thesis
include:
1. An intelligent reduction of search space and combinatoric complexity
Enumerating all possible permutations representing positive examples for even
reasonably sized EA candidates quickly becomes infeasible. Grouping n-tuples
of symbols into production rules of a grammar that accept candidates of a given
type allows for more intuitive evaluation of evolved payloads.
2. The generation of unique exploit candidate solutions This system uses
an Evolutionary Algorithm to generate and select candidates through previously
discovered solutions. Using the searching conventions at the core of Evolution-
ary Algorithms, coupled with the language-based fitness metrics derived from a
corpus, this system attempts to generate payloads that are unique, but seman-
tically similar to positive examples of known bad payloads.
5
3. Development of a language-theoretic basis for guiding an EA fuzzing
framework The fitness function at the core of the proof of concept fuzzing
framework uses grammars representative of curated positive examples. These
productions approximately describe the entire corpus of known-bad injection
strings. The utilization of these grammars combined with traditional application
response monitoring provides a formal manner by which to verify the fitness of a
given solution, and can be extended to other languages and frameworks capable
of semantic analysis.
The research described in this thesis is significant for several reasons. First, the
generation of semantic-structure groups learned from positive examples will guide the
Evolutionary Algorithm based on lexical tendencies of known exploitative input. Sec-
ond, the approach can use those semantic structure groups to both measure fitness
and generate payloads, creating more intuitive search guidance for the Evolutionary
Algorithm fuzzing framework. Lastly, this research represents a step towards auto-
matically building formally expressible grammars that can generate good candidate
solutions, and can be extended to test other software systems and protocols.
Consider a large scale end-user web application service that utilizes both front-
end browser-based technologies (Javascript) and back-end data stores (e.g., MySQL).
As the size of the code base increases, manual code review may fail to identify even
commonplace errors. Although static analysis and manual code review can reme-
diate the existence of some vulnerabilities, the source code of an application is not
always available. Black-box fuzzing—the term used to describe a testing scenario in
which the source code of an application is unavailable—requires less overhead than
grey/white-box methods (which have some or complete visibility into application in-
ternals, respectively). The key limitation of black-box fuzzing is its lack of source code
knowledge, and limited visibility into application internals. This research intends to
6
optimize black-box fuzzing by using the learned grammar structures of positive exam-
ples to promote semantically intuitive solutions and suppress non-conforming input
generated by the Evolutionary Algorithm.
The fuzz testing campaign is managed by an Evolutionary Algorithm (EA), a
searching strategy inspired by principles of biological evolution. Evolutionary Algo-
rithms search for better solutions by discovering new candidates through the recom-
bination of “fit” solutions in a population. Each chromosome (a possible solution)
consists of either a sequence of grammar symbols representing an injection string,
or grammar transitions that generate a given injection string. The Evolutionary Al-
gorithm utilizes selection, crossover, and mutation to perpetuate new generations of
chromosomes. The central idea that makes evolutionary algorithms e↵ective is that
chromosomes with higher fitness scores will be more likely to produce o↵spring (i.e.,
fitter chromosomes are probabilistically more likely to be selected for creating the
next generation). Crossover involves combining two chromosomes in a manner to
produce o↵spring for the next generation. In the context of our system, chromosomes
will recombine grammar symbol groups or transitions in order to guide searching the
input space based on the semantic information the chromosomes encode. Mutation of
chromosomes is modeled by randomly changing of symbols or grammar productions
in a given chromosome. Mutation is a necessary convention for maintaining diversity
within a population to avoid stagnation or convergence on a suboptimal plateau in
the search space.
1.3 Contributions
The research outlined in this document contributes in the following ways:
1. Produces a framework for fuzzing at the application-level using Evolutionary
Algorithms (EA) to evaluate and create input.
7
2. Explores the utilization of grammar-based fitness evaluation for guiding an Evo-
lutionary Algorithm-based fuzzer.
3. Analyzes the e↵ectiveness of the proof of concept solution compared to other
fuzzing methods, such as brute force, hand-selected payloads, and Markov
Model Monte Carlo input generation.
The thesis proceeds as follows: the second chapter discusses the history and compo-
nents of fuzz testing, various techniques for uncovering vulnerabilities with automated
testing, and the advantages and limitations of those methods. The third chapter de-
scribes Evolutionary Algorithms through discussion of their various manifestations
and use cases. The fourth chapter outlines the incorporation of EA’s into fuzzing
frameworks, and the strategies that represent the proof of concept approach to EA-
based application fuzzing. Chapter five explains the testing environment by which
our methods are evaluated, including an analysis of the results. Finally, chapter six
draws conclusions based on the results of our testing, and discusses avenues of future
research.
8
Chapter 2: Fuzzing
2.1 History of Fuzzing and its Fundamentals Com-
ponents
The process of testing software for logical errors and security bugs has long been a
part of the recommended software development life-cycle [38]. Most nontrivial systems
depend on a vast number of moving parts. From the design and development stages
all the way to production deployment and maintenance, many di↵erent people write
many lines of code, creating a situation in which errors are unavoidable. In order to
alleviate the impact of serious bugs in actively developed software, rigorous testing
of software is vital. For example, regression analysis ensures that new features and
changes made to an existing codebase do not introduce new (or previously observed)
software errors. Static code analysis performs source code checking to detect common
errors that introduce security-related vulnerabilities. Both of these approaches are
necessary weapons in the testing arsenal, but sometimes fail to account for unexpected
input, or detect incorrect implementations of correct processes. Automated input
testing fills this void by sending crafted input to a running example of the program,
monitoring the program’s response for undefined behavior or traversal into an unsafe
or inoperable state. Although dynamic analysis of software can be time consuming
and resource intensive, it allows developers to cover the spectrum of necessary tests
for ensuring software security when used in concert with static and manual analysis.
One such methodology of this testing ilk is referred to as fuzzing.
Fuzz testing was o�cially formalized in an academic setting by Barton Miller
et al. at the University of Madison, Wisconsin [47]. The idea of sending random
input to popular UNIX utilities was first explored when a thunderstorm tampered
9
with Miller’s remote connection, sending random character input to his terminal and
crashing the UNIX programs in use. Because the thunderstorm’s interfered with the
modem used for the remote session, Miller called the idea “fuzzing” [50]. Inspired
by the phenomenon, Dr. Miller designed a lab for the graduate students in his oper-
ating systems course, instructing them to write programs which send random input
to common UNIX utilities and monitor the results. One student’s submission un-
covered parsing vulnerabilities in many command line utilities that were at the time
considered stable. Thereafter, a software testing group formally explored fuzzing
and demonstrated the widespread input handling errors that plagued many UNIX
programs [47]. Today, fuzzing is an indispensable tool for security researchers and
software development teams responsible for testing and quality assurance. In order
to successfully deliver a fuzz testing campaign, fundamental components must be
in place for the creation of input data and monitoring program behavior. General
steps involved in fuzz testing methods are shown in Figure 2.1. Sutton et al. in
their comprehensive text on fuzzing methodologies, stipulate that although fuzzing
methodologies vary widely, following certain guidelines are more likely to guarantee
results [44].
The first step involved in the fuzzing process is to identify a target for the testing
campaign. In the literature, the target in question is often referred to as a “System
Under Test” (SUT) [39]. Although it is fair to say that any software system that
receives and processes user input is fair game, fuzzing is most e↵ective when input
directly a↵ects the program’s state in a measurable way. For example, fuzzing is
less e↵ective for identifying problems in cryptographic methods, because monitoring
vulnerabilities is di�cult, as opposed to fuzzing file format parsers, where the e↵ects
of the input (a crafted file) on the application are easy to monitor [44]. Fuzzers
are especially good at identifying implementation problems in parsers—programmers
10
Figure 2.1: The general steps involved in fuzzing campaigns
often make assumptions that expose applications to risks in edge cases, which lead
to problems in everything from memory corruption to remote code execution of web
apps [43,51]. The best targets for fuzzing require knowledge of the protocol, by what
means input is interpreted, and a reliable manner way of determining an unsafe state
The second step in the fuzzing process requires one to identify input channels
of a target application [44]. Identifying the input channels of a target application is
vital to reliably carry out a fuzzing campaign—otherwise, there would be no reason to
conduct fuzz testing in the first place. Although this point is obvious, the more subtle
implication of this step is that when designing a fuzzer, it is vital to have knowledge
of the application’s points of input (which are, in essence, its attack surface). Often
times, software errors are the result of incomplete or incorrect assumptions about
input data, and fuzzers are an e↵ective technique for testing those errors. To this
11
point, model-based security testing is often combined with fuzzing, since modeling
the flow of an applications reveals its attack surface e↵ectively [20]. Identifying the
set of inputs for a given target also informs the manner in which data generation
is conducted—knowing the protocols involved or the structure of input allows for
intelligent design decisions (e.g., one would fuzz file format parsers in a completely
di↵erent way than an input login form for a web page). Furthermore, even with
simple fuzzing techniques such as random mutation, it is vital to preserve fields of
input that an application interprets in order to be e↵ective [44].
In order to make certain that a given testing campaign is properly searching the
input space within a reasonable amount of time and computing resources, the task of
data generation is at the heart of an e↵ective fuzzing strategy. However thorough it
may be, enumerating all possible inputs for a given target would be untenable from
the standpoint of time and computing power. Target information and a knowledge of
the attack surface (determined by the inputs available for a given target application)
guide the components available for data input generation [44]. As an example, for
file formats parsers and network protocols, specifications for the format of data will
be agreed upon as standard and (hopefully) documented in an RFC or equivalent
text. Even for proprietary protocols, it is possible to capture tra�c or examples and
make assumptions about the underlying strucuture [44]. It is important to generate
negative test cases that the SUT will evaluate as structurally expected input—Holler
et al., in their framework for fuzzing web browser Javascript interpreters, were able to
uncover numerous bugs in Javascript interpreters by requiring their fuzzer to generate
test cases that were syntactically valid [36]. The e�cacy of protocol fuzzers falls in
the same category, as the preservation of packet structure is vital for testing the
underlying processing frameworks instead of being cast aside by basic error checking.
In order to achieve quality test results, the data generation stage must be sensitive
12
to the protocol at hand. For network protocols in particular, RFC information will
allow a developer to determine the structure of a packet (and a stream of packets).The
various structural components of the protocol will be the places in which to insert
generated data, so it is vital for the fuzzer to comply with the standard set for field
widths and sequential order.
After those steps are completed, the crafted data is sent to the application. Sutton
et al. explains in great detail a variety of input types where fuzz-generated data
should be injected for a given target [44]. For local applications such as desktop
binaries, command-line arguments and environmental variables are prime channels
by which to submit fuzzed data. Remote fuzzing campaigns, such as fuzzing an
FTP session or web applications, involves slightly more setup—fuzzer-generated input
sent over networks is often done by a tool such as scapy or using a combination of
browser emulator and an HTML parsing library [1,12,13]. E↵ective fuzzing campaigns
require information gathered from target application analysis in order ensure the
application does not ignore the crafted input because of failure to meet the required
data specifications. In fact, when a feedback loop is limited, in the case of remote fuzz
testing, it is important carefully format test cases. In the case of our research, instead
of relying on application feedback to develop properly structured test cases, semantic
structures from positive examples are evaluated and used to intelligently guide test
case generation. The e�ciency of this approach is discussed in future chapters.
Once the fuzzer’s crafted input has been sent to the application, it is necessary
to monitor the SUT for its response, and that the results are thoroughly recorded
for further analysis. In the ideal environment, a monitoring harness is watching the
execution of the target system, and is alerted when exceptions, crashes, or interrupts
are invoked. In the event that a fuzzer sends a negative test case that causes an
error, for the sake of reproducibility, metadata regarding the program’s state as well
13
as the crafted input itself should be logged for analysis [44]. The human element
of fuzzing in this stage is involved in analyzing these events to identify whether or
not the root cause of an error has security implications. Although utilities such as
!exploitable are useful for attempting to measure exceptions and crash states according
to their potential exploitability, they are still limited [52]. These methods are adept at
discovering exploit potential in memory corruption errors, but there is a fundamental
limit to what computer programs can determine regarding the security posture of
another system. At this stage, the results of the fuzzing campaign are best left for
human judgment and inspection.
The e�ciency of a fuzzer is typically measured according to how well it finds bugs,
and this is directly correlated to a fuzzer’s ability to explore the execution paths of a
target application. This is referred to as “code coverage” and is a common metric for
measuring the success of a fuzzer. Sutton et al. remark that this is a measurement
of the “amount of process state a fuzzer induces a target’s process to reach and
execute” [44]. Fuzzers attempt to find vulnerabilities and crash states according to
mishandling of user input, and the best way to measure if a fuzzer is likely to uncover
such a bug is to measure the percentage of execution paths that are covered. Another
important piece of measuring the e�cacy of a fuzz testing campaign or technique
is at the monitoring stage—error detection is vital to determining if a given input
causes a crash via a debugger or other heuristics. Finally, resource constraints are
ever-present, and require the developers of fuzzers to design and implement e�cient
code. All of these things together are used as a measuring stick for fuzzing tools and
techniques.
14
2.2 Present Day Fuzzing
2.2.1 General Purpose Tools and Techniques
Miller et al.’s first paper on the use of testing UNIX utility reliability with random
input demonstrated the simplicity of software testing at that time. Before then, soft-
ware systems were primarily tested for accuracy and e�ciency down “happy paths”,
or under execution environments that were expected according to the specifications
of the software itself. Although these testing methodologies are able to determine
whether or not a piece of code performs calculations correctly and e�ciently with
expected input, it does nothing to examine the manner in which a program handles
malformed (or malicious) data. The arrival of fuzz testing brought with it a new
mindset on testing approaches, and the responsibility of programs to reliably and
safely handle input.
In concert with Miller et al.’s approach, the first fuzzing tools were concerned
with injecting random data as input to applications, monitoring for exceptions and
crashes [44]. In the early stages, this method was very e↵ective: most software was
not written defensively, or with any sort of security mindset. Therefore, fuzz testing
was adept at triggering errors such as memory access violations because they placed
programmer assumptions under stress through unexpected input. Although today
these methods would be considered elementary, they set the stage for academic and
industry research for intelligent, informed software testing. Mutation-based fuzzing is
the name for the process of taking a valid input sample (one which would be correctly
parsed by the program in question) and mutating it semi-randomly, and using it as
a negative test case against a system under test (SUT). The primary advantage of
this method is speed: minimal setup is required, and because little or no time is
spent modeling the structure of the data, fuzzing campaigns are executed relatively
15
quickly. File formats and plaintext protocols with easily identifiable field values are
prime targets for mutation fuzzing. One such manifestation ofMutation-based fuzzing
tools is zzuf : this fuzzer intercepts network tra�c and file formats and performs “bit
flipping” on program input, which literally is the act of randomly changing a variable’s
bit from “0” to “1” (or vice versa) [18]. The disadvantage of this system is that the
success of this method is contingent up on the quality of the available samples [6].
Because of the chaotic nature of randomly flipping bits of input, the fuzzer runs the
risk of mangling the test case past the point in which the target program will even
accept it. The deficiencies of pure Mutation-based fuzzing prompted researchers to
develop fuzzers that made use of data modeling and other analytical approaches.
The other main category of fuzz testing strategies is Generation-based fuzzers.
Generation-based fuzzers seek to make a model of the data accepted by a target
application, and comply with its specification (or violate it, but intelligently) when
injecting crafted input. Sutton et al. refer to Mutation-based fuzzers as “dumb brute
force”, and Generation-based fuzzers as “intelligent brute force”: although the input
crafting mechanism for both methods rely on randomly changing input, Generation-
based fuzzers go to great lengths to ensure a test case follows the specification of a data
model [44]. Peach, a popular Generation-based fuzzing tool, requires a user to create
data models for the framework to use as guides for its fuzzing engines [11]. These
“peach pits” are the basis for the input generation phase of the fuzzing campaign.
An example is show in in Figure 2.2. The Peach fuzzing framework requires an XML-
structured description of the data model, as well as type-specific information useful
for fuzz testing campaigns (e.g., field length, expected content type, delimiters, etc.).
Although clearly Generation-based fuzzing requires more work up front to model the
data in question, its e↵orts are not without results. Miller and Peterson demonstrated
that for modern targets, Generation-based fuzzing was much more e�cient, able to
16
Figure 2.2: An example excerpt of a peach pit used for Generation-based fuzzing [11]
achieve coverage 76 percent better than mutation-based fuzzing of the same targets
[48].
One of the most e↵ective methods of boosting the e↵ectiveness of these algo-
rithms involves boundary checking. Boundary checking is the act of testing values
at the edges of expressible values in order to trigger vulnerabilities such as Integer
Overflows, which lead to access violations of memory and can lead to a full compro-
mise of the target system [44]. When these tests are included with random mutations
and inside fields specified by a data model in Generation-based fuzzing, the likeli-
hood of triggering vulnerabilities increases exponentially. A figure from Sutton et
al. demonstrating common boundaries to test is shown in Figure 2.3. This example
shows a group of interesting values for testing the boundaries of certain data type
widths (MAX32, for example, referring to the maximum value for a 32-bit integer).
Test case generation for classical fuzzing seeks to create values with high potential to
cause problems or expose improper programmer assumptions, and boundary testing
with the value show in Figure 2.3 is an e↵ective means to this end.
17
Figure 2.3: Boundary testing recommendations according to Sutton et al. [44]
2.2.2 Modern Fuzzing
Modern fuzzing techniques still follow the same basic techniques described above, but
with more refined metrics for providing intelligent searching of execution paths and
data generation. Reverse engineering binaries and file format parsers to enumerate
its execution paths has recently been a standard method for optimizing fuzzing. For
local binaries, it is possible to attach debuggers to running processes not only to
monitor changes to the target application, but make sense of its execution paths [44].
Seagle, in his framework for file format fuzzing, made use of a reverse engineering
framework which found the execution paths of his target and fed that information
back to the fitness function of his Genetic Algorithm [53]. The class of fuzz testing
techniques that attach debuggers to running processes and use that channel to in-
ject test cases is called in-memory fuzzing. In-memory fuzzers are very e↵ective for
their visibility into application internals, and the fact that the possibility exists to
set a breakpoint, save the machine state, execute crafted input, and return to the
previous place, allowing for fine-grained control of executing the program with sent
18
data and monitoring its response [44]. Vincenzo Iozzo demonstrates that expensive
reverse engineering operations for preprocessing can be avoided by applying function
hooking during the fuzzing process, and pruning execution paths explored through
the combination of real-time debugging and heuristics. By measuring the cyclomatic
complexity of a given function and performing loop detection, fuzzers can be use
these heuristics to search program execution paths more e�ciently, thereby achieving
more e↵ective code coverage [40]. Other approaches to fuzzing involve measuring the
influence of injected data into program state, known as taint analysis. This method is
simply the process of marking process states where untrusted data has been injected
or evaluated, in order to help fuzzing heuristics make better decisions regarding data
generation and code coverage. Bekrar et al. demonstrate that traditional fuzzing
frameworks informed by metaheuristics and taint analysis allows for more e�cient
determination of exploitable bugs [19]. Iozzo also relies on this method for his frame-
work, which allows for the measuring of data’s propagation through the execution of
a program [40].
2.2.3 Web Application Fuzzing
Application level fuzzing for web applications has uncovered bugs ranging from mem-
ory corruption vulnerabilities in underlying system software, to data exfiltration and
session hijacking via SQL injection and Cross-Site Scripting (XSS) [44]. For the pur-
poses of this thesis, a web application or web service (used interchangeably in this
document) is a computer program that is executed on a remote server which responds
to clients, that connect via the HTTP protocol. The web application targets of inter-
est in this research are those which receive and process input from a client. Fuzzing
web applications for injection vulnerabilities is intuitive, because the process of iden-
tifying a target’s attack surface is trivially easy. Furthermore, although application
19
internals are not usually available, much work has been done in curating sets of e↵ec-
tive injection strings for a wide range of web vulnerabilities [49]. Despite its intuitive
nature from the standpoint of target identification and input generation, application-
level fuzzing of web services su↵ers from a fatal flaw—most of the time, visibility into
a target application’s internals is limited or nonexistent. This means that measuring a
fuzzer based on code coverage requires monitoring capabilities that are not available.
Even still, the amount of work required to set up such a system must force testers and
researchers to question whether or not other testing methods (or manual review) are
more suited to the task at hand. Thus, the monitoring requirement of fuzzing must
be modified in order to determine whether or not a vulnerability has been uncovered.
At the application level, this could simply involve parsing the resulting HTML for a
desired response, such as in Duchene et al. [26].
Most web application fuzzing frameworks contain a crawler that finds webpages
and potentially vulnerable input forms, and a set of known bad payloads that encom-
pass typical exploit vectors, such as directory traversal, SQL injection and Cross-Site
Scripting (XSS) [44]. Tools such as Burp Proxy have the ability to spider an appli-
cation and apply a set of test cases to a given input form [2] Another widely popular
web application security tool is w3af, a web service scanner that uses a curated set of
well-known injections to test for common web application vulnerabilities [17] In the
same vein, the now inactive JBroFuzz provided a general purpose GUI-based fuzzing
framework that covered a range of scanning techniques for discovering web applica-
tion vulnerabilites [9]. Most of these tools are adept at covering low-hanging fruit,
and do not provide any exploitation information—they only can determine if a given
entry point is potentially exploitable.
A large majority of research questions explored regarding fuzzing and web appli-
cation targets involve testing a web service for client-side Javascript execution bugs.
20
Tripp et al., in their research on optimizing Cross-Site Scripting vulnerability test-
ing, parsed the output following the execution of a negative test case, and used the
information to prune their set of test cases, culling a large set of positive examples
based on response of web app [57]. Their research shows that the previously sig-
nificant problem of having a limited feedback loop for fuzzing web applications can
make use of heuristics and response analysis to guide input generation. By charac-
terizing their large corpus of Cross-Site Scripting (XSS) examples according to their
tokens, injected payloads that were filtered or otherwise rejected inform the next test
cases attempted. Tripp et al. demonstrate the ability to uncover vulnerabilities by
e�ciently pruning the space of test cases from their original corpus [57].
The technique of using heuristics to generate test case is very similar to taint
analysis, and is a popular method of information gathering in the monitoring stage of
the fuzzing cycle. Model inference testing attempts in the fuzz testing space attempts
to determine the impact of a negative test case upon a target application. This
information can be utilized as feedback for the intelligent generation of new test cases.
Duchene et al. used taint analysis along with a grammar-based genetic algorithm to
uncover Cross-Site Scripting (XSS) vulnerabilities [26]. Wang et al. used a hidden
Markov model based on Bayesian probability distributions to to generate test cases
for uncovering Cross-Site Scripting (XSS) vulnerabilities. Their work theorizes that
injections are the combination of attack vector and payload, making the primary goal
to determine the attack vector necessary for injecting a payload. Similar to Tripp et
al.’s research focus of tree pruning based on application response to injection, Wang
et al. attempts to learn from the target application’s response to crafted input and
probabilistically generate new injection payload according to Bayesian probability
distribution [59]. Although the results of their methods contained numerous false
positives, their approach has merit—building a probabilistic model for generating
21
tokens allows for search space flexibility not o↵ered by Tripp et al. [59]. This covers one
of the main disadvantages of common fuzzing frameworks, which were not historically
designed to intelligently guide the manner in which test cases are generated.
Figure 2.4: A vulnerable input form that can be exploited using SQL injection
An example of a PHP script vulnerable to SQL injection is listed in figure 2.4,
courtesy of a purposely vulnerable web application made for testing and training
called DVWA [5]. In this case, a PHP script receives the id parameter from a GET
request. The user controlled data in the $id parameter is evaluated on the server
machine as raw SQL, allowing for a malicious user to execute raw commands. This
can lead to data exfiltration, or even complete compromise of the backend machine.
In particular, this case demonstrates a system oblivious to security concerns. Most
modern web applications contain some measure of security, in the form of blacklists or
regular expressions that attempt to filter out and/or detect malicious input. Hansen
and Patterson show the ine↵ectiveness of using regular languages and pattern based
filters to defeat malicious input that is by nature context-free [34]. In the spirit of
22
their development of a language theoretic basis for security, this research attempts
to guide fuzzing based on the lexical structure of positive examples. By focusing on
approximating the “attack language” for a given class of vulnerability, this research
explores the use of language theory to optimize fuzz testing, and move towards a
verifiable language-based reasoning for the e↵ectiveness of certain test cases.
2.3 Fuzzing and Genetic Algorithms
Evolutionary Algorithms are a prime candidate for optimizing fuzz testing by in-
fluencing input creation in an intelligent way. Many researchers have successfully
incorporated Genetic Algorithms into their fuzzing frameworks for targets ranging
from file formats to network services and web applications [26, 42, 56]. Over the past
decade, fuzz testing of applications which utilize evolutionary algorithms to intelli-
gently guide input generation have seen tremendous success in both academic and
applied settings [32, 53]. Thanks in no small part to the power of modern hardware
and availability distributed systems, the complexity concerns that once rendered the
use of genetic algorithms ine�cient are no longer prohibitive [53].
Sherri Sparks et al. proved the value of Genetic Algorithms in optimizing solution
space searching for fuzzers in the development of their program called SIDEWINDER
[56]. After disassembling the System Under Test, execution paths of interest are enu-
merated based on whether or not it contains an unsafe function call [56]. Then
subgraphs containing these functions of interest (because of their propensity to be in-
volved in unsafe operations), are separate for further analysis. The next step involves
their Genetic Algorithm—each chromosome encodes production rules of a Context-
Free Grammar (CFG), which are then used in conjunction with probabilities of path
traversal across known problematic subgraphs [56]. At every point of execution for
a given negative test case, the probabilities of going to the next node in the graph
23
are calculated. Fitness is boosted if new edges are explored, and the Markov Model
heuristic used at the core of their fitness function is updated [56]. This research
demonstrates the e�cacy of using Genetic Algorithms and Context-Free Grammars
to create new test cases based on program feedback. Similarly, DeMott et al. per-
formed grey-box evolutionary fuzzing on targets, but use a more traditional Genetic
Algorithm search heuristic [25]. Grey-box fuzzing assumes that source code for a tar-
get application is unknown, but binary internals (including assembly code) is available
for analysis. In the same manner as Sparks et al., DeMott et al. first reverse engi-
neering the target target application to locate and categorize its execution paths [25].
Based on a valid sample of a test case, it builds a population and performs traditional
Genetic Algorithm operations on individual chromosomes. Fitness is scored based on
how many branches of execution a given test case follows, and its distance between
the current node of execution and a desired target (determined during static analysis).
Test cases that find new branches are promoted especially among a pool of candidate
solutions [25]. The work of Sparks et al. and DeMott et al. represents an important
step in using Genetic Algorithms and a heuristics feedback loop to optimize fuzzing
strategies.
Roger Seagle explored the e�cacy of incorporating a nonstandard Genetic Algo-
rithm called CHC to perform file format fuzzing [53]. His fitness heuristics combine
execution graph heuristics as well as considerations regarding function characteristics
(e.g., the number of arguments and local variables, number of assembly instructions,
etc.) [53]. A catalogue of the fitness function considerations is shown in Figure 2.5.
The resulting work, a distributed fuzzing framework entitled “Mamba” is a collec-
tion of Genetic Algorithm-based fuzzing strategies that were able to find more unique
defects than comparable file format fuzzing tools [53].
As mentioned in the previous section, Fabien Duchene et al. demonstrate the abil-
24
Figure 2.5: Fitness heuristic categories considered by Seagle [53]
ity of Genetic Algorithms alongside model taint analysis to produce fuzz data that
uncovered Cross-Site Scipting (XSS) attacks in a reliable manner. Cross-Site Script-
ing vulnerabilities emerge when front-end web code does not safely escape dynamic
HTML and Javascript. Using a crafted input string, an attacker can execute code that
can be used to perform session hijacking and remote code execution [4]. Duchene et al.
used a Genetic Algorithm and taint-based heuristics to perform fuzzing on a variety
of purposefully vulnerable testing websites [26]. Similar to Sparks et al., this research
encoded chromosomes of a population as productions of a manually written “attack
grammar” tailored to uncovering Cross-Site Scripting (XSS) vulnerabilities. In this
document, the term “attack grammar” describes a grammar used for the creation of
strings with a propensity towards uncovering a class of vulnerabilities related to input
injection (e.g., SQL or LDAP injection, Cross-site Scripting, etc.). This grammar can
be developed manually by an expert, or inferred from positive examples. Although
it is impossible to exhaustively account for all the strings of a given language which
uncover injection vulnerabilities, human intuition (or automatic inference from posi-
tive examples) can approximate the fundamental grammatical components involved.
Duchene et al. succeeded in outperforming such tools as w3af and JBroFuzz [26].
An excerpt of the attack grammar is shown in Figure 2.6 [26]. The open problem
25
of automatically deriving attack grammars discussed in this work is the basis for the
research discussed in this thesis.
Figure 2.6: An excerpt of a manually-written attack grammar for finding Cross-SiteScripting vulnerabilities [26]
2.4 Grammar Fuzzing
The term Grammar Fuzzing refers to the subset of fuzzers whose targets are language
parsers, compilers, and runtimes. Grammar fuzzing has been successful in uncovering
a high number of web browser bugs and vulnerabilities due to incorrect Javascript
parser implementations. Fuzz testing against language interpreters has a long history
of success at finding parser vulnerabilities in software. One of the most frequent
targets of this type of fuzzing is the web browser. Zalewski’smangleme browser fuzzers
is one of many tools aimed at testing a browser’s Javascript parser for implementation
bugs [44]. The mangleme fuzzer was a browser fuzzer design to cause crash states
26
as a result of improper handling of HTML input. Guo et al., developed a technique
for testing Javascript parser engines by taking fragments of valid Javascript code,
and reorganizing them to produce negative test cases [33]. Holler et al. developed
a similar tool for testing web browerser Javascript parsing engines in Firefox [36].
Their fuzzing framework was incorporated into a regression testing suite for Firefox.
The fuzzer executed code against a new release by generating negative test cases that
were produced by recombining fragments of Javascript code to make new test cases
by which to test a given interpreter [36]. Using this method, their team uncovered
160 bugs in the Mozilla Firefox browser’s Javascript parsing engine [36]. In parallel,
but unrelated work Yang et al. developed a language fuzzing tool called “blendfuzz”,
which used “grammar aware mutation” to take valid test cases and rearrange valid
subgraphs to produce test cases that test a language interpreter’s correctness [61].
2.5 Advantages and Limitations
Fuzz testing is an e↵ective technique for uncovering errors in software systems that
process user-controlled input. This has special gravity in the context of a program’s
security posture—oftentimes, fuzzers find bugs that lead to an attacker being able to
craft input that can exfiltrate sensitive data, cause DOS’es, or lead to remote code
execution.
One of its most important advantages is the speed with which fuzzing frameworks
can find software bugs and vulnerabilities. Despite there being the propensity for long
execution times, it is still much faster than employing humans to comb through huge
codebases in search of bugs. Furthermore, operating tests on a live manifestation of
an application uncovers bugs that are lost in manual code review and other testing
tasks subject to human fallibility. The modern day fuzzing landscape brings with it
a variety of frameworks and tools with specialized targets, allowing for developers to
27
easily incorporate fuzzing into their software testing methodologies. Fuzzers are adept
at finding “low-hanging fruit” of vulnerability classes, and are for general purpose
vulnerability assessment. Fuzzers also add a dimension to software testing by creating
inputs that humans would not be likely to conjure up themselves. This research
demonstrates the usefulness of software testing that combines heuristics which encode
“intuition”, with the freedom of Genetic Algorithm solution searching.
Fuzzing is not without its limitations, however. For starters, fuzzing techniques
are only capable of alerting an exception or crash state being triggered—not whether
or not a vulnerability is in progress. This makes fuzzers ine↵ective regarding the
discovery of complicated, multi-step vulnerabilities [44]. Another limiting aspect of
fuzzing is the fact that general purpose crash analysis, for the most part, remains
a manual human endeavor. Despite humans being able to intuit the exploitability
of a given software bug, there is a fundamental limit to the ability of computer
programs to determine whether or not a software error represents a vulnerability
that can be compromised. In respect to Generation-based fuzzers, or any fuzzers
dependent on a data model, complexity increases exponentially as a function of input
specification. In other words, the more complex the data model becomes, the more a
fuzzer will consume design and computing resources. Oftentimes, ensuring an e�cient
fuzz testing campaign can be impossible because the search space can be too large to
enumerate. Thus search space optimizers such as Genetic Algorithms have recently
become en vogue.
Fuzz testing for software defects and vulnerabilities has been proven to work across
a variety of targets and protocol specifications since the late 1980s [44]. Although it
has limitations—not the least of which, the monitoring of applications for undefined
behavior and the analysis of negative test cases post-campaign—fuzzing has secured
itself as a mainstay technique for security researchers and software testers. Rudi-
28
mentary fuzzing techniques, such as random Mutation-based and Generation-based
fuzzing are surprisingly e↵ective at uncovering critical vulnerability classes. Modern
research focuses on informing fuzzing methods with heuristics to intelligently guide
test case generation, in order to combat dumb fuzzing’s inability to learn from the
SUT’s response to previous test cases.
29
Chapter 3: Evolutionary Algorithms
3.1 History
Evolutionary Algorithms describe the set of algorithms—typically, focused on func-
tion optimization and search space reduction—whose fundamental components are
inspired by the phenomenon observable in evolutionary processes in the natural world.
The ideas and inspiration that ushered in the emergence of evolutionary computing
goes back to the mid 20th-century. Alan Turning, in “Computing Machinery and In-
telligence”, the same text in which he famously proposes his “imitation game”, specu-
lates a scenario in which a machine would be modeled after “the mind of a child”, with
the ability to receive sensory input, learn from stimuli, and use that prior information
to make inferences and conclusions regarding new encounters [58]. His description
of the learning capabilities of machines is steeped in the language of evolution, and
the fact that his analogy uses this language helps explain the rise of evolutionary
algorithms, and frames the way computer scientists conceived problem spaces in the
mid-20th century. Evolutionary Algorithms describe the subset of optimization algo-
rithms that attempt to solve search problems by modeling them after processes found
in natural evolution. Inspired by the works of Charles Darwin, computer scientists
began to research methods by which to mimic the evolutionary processes for solv-
ing mathematical problems. The processes evolutionary processes underway which
promote good qualities in species and suppresses undesirable traits can be emulated
in computing, and can be used to e�ciently optimize algorithms—especially those
concerning complex search spaces.
The first recorded examples of modeling evolutionary principles to solve com-
putation problems is found in work by Friedberg et al. in the late 1950s, which was
30
concerned with “finding a program that calculates a given input-output function” [24].
Bremermann’s work in 1962 showed early use of “simulated evolution” for the task
of numerical optimization functions [24]. In the mid-1960s, Lawrence Fogel and John
Holland published groundbreaking, established research in evolutionary programming
and genetic algorithms (respectively), setting the stage for the formalization of this
subfield of machine learning [24]. Since then, Evolutionary Algorithms have been ap-
plied to a wide arrange of optimization problems in numerical methods, engineering,
and computer security.
3.2 Genetic Algorithms
The term Genetic Algorithm describes the subset of Evolutionary Algorithms that
mimic evolutionary conventions to solve optimization problems. John H. Holland,
widely considered the “Father of Genetic Algorithms”, was inspired by the works of
Darwin, and the ability of natural evolution processes to find solutions to biolog-
ical problems. Genetic Algorithms attempt to solve problems by first establishing
a group of solutions (population), which in essence, represent the “gene pool” of a
given solution space. For each individual candidate solutions (chromosome), a fit-
ness function evaluates how well they solve the problem at hand (or whether or not
a correct solution has been found) and assigned a fitness value. The fitness scores
(which, calculated by the fitness function, are numerical representations of how well a
given chromosome solves the problem in question) determine selection, the process by
which a chromosome is chosen for the creation of the next generation’s chromosomes.
A new population is then created by the crossover operation, where the members
of the current population are selected and recombined with other chromosomes. In
typical scenarios, chromosomes with high fitness scores are more likely to be selected
for crossover (“survival of the fittest”), to ensure the genotype of a high scoring can-
31
didate will propagate to the next generation [45]. Pseudocode for the algorithm is
shown in Algorithm 1.
Algorithm 1 Genetic Algorithm
1: procedure Genetic Algorithm(popsize, numgens) . GA run withpopulation size chromosomes and numgens generations
2: initialize population()3: calculate fitness()4: while n 6= numgens do5: selected parents select(population)6: CPOP crossover(selected parents)7: mutate operator(CPOP )8: population CPOP9: calculate fitness()10: n n+ 111: end while12: end procedure
The success of Genetic Algorithms is established according to the Holland’s Schema
Theorem, sometimes referred to as the “Fundamental Theorem of Genetic Algo-
rithms”. Mitchell remarks that its popular interpretation states that “short, low-
order schemas”, which are groups of characteristics found in chromosomes, “whose
average fitness remains above the mean will receive exponentially increasing numbers
of samples...over time” [45]. Schemas describe a set of strings with common values at
certain positions, and represent the presence (or absence) of sub-components within a
set of chromosomes. This theorem describes the power of the crossover and mutation
operators of Genetic Algorithms to propagate good information, and undergo enough
deviation via mutation (referred to as population diversity) to guide search space in
an intelligent manner [45]. Although the Genetic Algorithm is designed to calculate
fitness for entire chromosomes, the implication is that the building blocks of those
chromosomes (schemas) are being evaluated as well, in a phenomenon referred to by
Holland as “implicit parallelism” [35,45]. Mitchell clarifies that the e↵ect of selection
based on fitness leads to a gradual preference towards instances of schemas with above
average implicit fitness scores [45]. This is the basic explanation for why Genetic Al-
32
gorithms excel at optimizing certain search space problems: by managing a pool of
solutions with enough schemata (i.e., building blocks) represented, the propagation
of new chromosomes via selection and crossover (based on fitness scores) will pass on
schemas with high fitness scores. When these high-performing schema are combined
with other high-performing schema, the likelihood of happening upon an optimal
solution is increased. Finally, mutation ensures that a gene pool properly satisfies
diversity requirements necessary to explore possible solutions. In this way, Genetic
Algorithms are useful for performing intelligent test case generation for fuzzers—by
implicitly recombining groups of substrings (schema), it is possible to generate unique
solutions, even in a multimodal search space.
3.2.1 Genetic Algorithm Components
The basic operators of Genetic Algorithms seek to mimic biological phenomenon ob-
served in natural life processes. The selection operator determines the manner in
which chromosomes are selected for reproduction operations that create the following
generation of candidate solutions [45]. Selection processes are typically informed by
fitness scores of individual chromosomes—chromosomes that have high fitness scores
are more likely to reproduce, which follows the “survival of the fittest” motif in biolog-
ical evolution [45]. For some nonstandard Genetic Algorithms such as CHC, selection
is performed in a pure random fashion [29]. However, most algorithms let fitness in-
fluence the manner in which chromosomes are selected for reproduction—a common
method (and the one used in the proof of concept for this thesis) is called elitism,
which means that the strongest chromosomes are always selected for reproduction [45].
This method ensures that the schemata found in high-fitness chromosomes live on to
future generations.
Once selection chooses a pair of parents, the crossover operator is the method by
33
(a) Single-point crossover (b) Multi-point crossover (c) Uniform crossover
Figure 3.1: Three traditional crossover methods for creating new chromosomes [3]
which new children are created for the next population [45]. The e�ciency of crossover
methods is largely contingent upon the task at hand—single-point crossover, for ex-
ample, merely selects a spot between two chromosomes, and build two children from
the combination of one parent’s first half and the other’s second [45]. Other methods
still include multipoint crossover and uniform, which simply involves taking half of
the di↵ering bits between two parents. These methods are shown in pictographic form
in Figure 3.2. Holland’s Schema Theorem supposes the power of Genetic Algorithms
comes from crossover’s way of propagating building blocks of above-average fitness
schemata to future generations [35].
Finally mutation is the process by which a given value in a chromosome is ran-
domly changed, analogous to chromosome mutations found in biological processes [45].
Mutation can involve flipping the values at certain bit positions of a chromosome, or
values for other types of chromosome encodings. Mitchell describes mutation as an
“insurance policy” against particular chromosome values being fixed and never being
evaluated as a candidate for change [45]. Holland posits that mutation is required to
maintain diversity across positions in a given chromosome representation.
34
3.3 CHC
CHC is a nonstandard version of the pure Genetic Algorithm developed by Eshel-
man [29]. The method was developed to counteract the main disadvantage to which
Genetic Algorithms are dispose: in multimodal search spaces, GA’s will often fixate
on a local optima and cease to continue searching. The steps of the CHC algorithm
are slightly di↵erent than regular Genetic Algorithms: crossover only occurs when the
di↵erence between two selected parents is high enough [29]. The CHC algorithm re-
quires the crossover operation technique to be Half-Uniform Crossover, which means
that half of the di↵ering bits of two parents will be swapped during crossover. New
generations are created from the highest n chromosomes between the parent popula-
tion and the children. Over time, the chromosomes will all begin to have the same
encoding, and no more children will be created. When that threshold is hit enough
times, a cataclysmic mutation operator is invoked [29]. This form of mutation takes
the chromosome with the highest fitness and, using it as a template, creates new
chromosomes by mutating 35 percent of the selected chromosome’s encoded bits [29].
Pseudocode for the algorithm is shown in Algorithm 2. Because CHC has a built-in
convention by which to escape plateaus in local minima or maxima, it tends to ex-
haust more search space than traditional Genetic Algorithms. This makes it a good
candidate for finding solutions multimodal search spaces. A distinct disadvantage of
CHC, however, lies in its tendency to spend less time in a given search area than
traditional Genetic Algorithms, leading to potentially missed solutions.
3.4 Problem Domain
Evolutionary algorithms have been applied as an optimization strategy for a wide
variety of computing tasks. Holland’s seminal work demonstrates the use of Ge-
35
Algorithm 2 CHC Algorithm
1: procedure Genetic Algorithm(population size, numgens)2: initialize population()3: threshold = L/4 . L is chromosome length4: while n 6= numgens do5: for i in population size/2 do6: select parents p1,p2 without replacement7: if Hamming(p1, p2) > threshold then8: CPOP HUX(p1, p2) . Half Uniform Crossover of p1,p29: end if10: end for11: if sizeof(CPOP ) == 0 then12: threshold -= 113: else14: calculate fitness(CPOP )15: Population equals best N individuals from (Population + CPOP)16: end if17: if threshold < 0 then18: Population cataclysmic mutation(Population)19: threshold L/420: end if21: n n+ 122: end while23: end procedure
netic Algorithms to solve the famous Prisonner’s Dilemma [35, 45]. In the Prisoner’s
Dilemma, two individuals are detained for colluding in criminal activity, and are
held in two separate cells with no communication activity [45]. The authorities o↵er
each prisoner the following deal: if a confession is given, and cooperation to testify
against one’s partner is consented, then the punishment doled out for the crime is
lessened [45]. However, if both parties admit to their crime and testify, the leniency
previously o↵ered is nullified. If neither testify against one another they will each
receive a moderately intense jail sentence [45]. Axelrod sought to determine whether
or not Genetic Algorithms can help decide the best strategy for each individual pris-
oner (which many tournaments showed was simply “TIT FOR TAT”, or a repetition
of the choice made by the other prisoner) [45]. Given the proper conditions, Axelrod
showed that Genetic Algorithms were able to find solutions which scored higher than
“TIT FOR TAT” [45]. This demonstrates the somewhat inexplicable ability of Ge-
36
netic Algorithms to propagate building blocks of good solutions to create new ones
which humans may not consider.
An example use of Genetic Algorithms to solve engineering problems is found in
Hornby et al.’s implementation of the algorithm to automatically perform antenna
design [37]. Before, antenna design was done manually, and consumed a great deal of
human design resources—the design of antennae requires an expert because of the vast
amount of knowledge necessary to produce quality designs [37]. In response, Hornby
et al. implemented an Evolutionary Algorithm which found novel antenna designs
that outperformed human generated solutions, according to the voltage standing wave
ratio and gain values of frequencies [37]. Evolutionary Algorithms have been applied
to problems as varied as financial portfolio optimization, game-theoretic problems as
described in the Prisoner’s Dilemma, and even the development of walking methods
for computer figures [21, 31]
3.5 Advantages and Limitations
Genetic Algorithms are useful for optimizations for a wide variety of problems and
domains. One of the main advantages of these algorithms is described in the Schema
Theorem previously discussed [35,45]. Genetic Algorithms excel at search space opti-
mization with nondeterministic solutions. Genetic Algorithms are also easy to concep-
tualize and implement, so once design decisions are established, Genetic Algorithms
are simple to incorporate into a variety of optimization schemes. The vast set of
parameters involved in tuning Genetic Algorithms is both a blessing and a curse. De
Jong remarks that often times, poorly tuned parameters do not create suboptimal
results [24]. This can lead to a great deal of frustration, however, when underperfor-
mance is observed, as parameter tuning does not necessarily map deterministically to
improved results.
37
The main disadvantage of Evolutionary Algorithms—and, really, any class of op-
timization algorithms—is based on the “No Free Lunch” (NFL) theorem [60]. Simply
put the NFL theorem states that there “cannot exist any algorithm for solving all
(e.g. optimization) problems that is generally (on average) superior” to any other op-
timization algorithm [24,60]. Another disadvantage concerns the fact that stochastic
processes, which are at the center of many Evolutionary Algorithms, rely on random
number generation, and the bias associated with potential incorrect pseudorandom
number generation can lead to problems. Furthermore, many search landscapes are
multimodal, meaning more than one optimal solution exists. Evolutionary algorithms
often have trouble in multimodal search spaces [24]. However, heuristics-based mea-
sure can be taken to ensure reasonably good search performance. The complexity of
chromosome representation and Genetic Algorithm operators can become unwieldy
and ine↵ectual without proper constraints [45]. This research demonstrates the e↵ects
of that consideration, as CHC underperforms because of its ine�cient implementa-
tion for the constraints of variable length chromosomes with complex representations.
Finally, accurate and precise fitness functions can prove di�cult to formulated given
the nature of many real world problems [45].
All told, Evolutionary Algorithms are a useful optimization strategy for certain
types of search space problems, and have a direct, positive results when incorporated
with the data generation aspect of fuzz testing frameworks. The remainder of this
research will concern their application to fuzz testing of web applications, with specific
focus on evolving payloads to exploit SQL injections.
38
Chapter 4: Evolutionary Algorithm
Web Fuzzing Framework
4.1 Approach
As previously discussed, the use of Evolutionary Algorithms to intelligently reduce
the search space for fuzzing campaigns has been proven e↵ective across a wide range
of targets [25,42,56]. Genetic Algorithms are useful for guiding the manner in which
input should be crafted, which can be modeled well as a search algorithm, making
them a good candidate for optimization. Sparks et al. modeled their chromosomes as
productions of a grammar which created a series of opcodes, used to uncover vulnera-
bilities in an FTP program [56]. In the web application sphere, Duchene et al. found
success revealing Cross-Site Scripting (XSS) vulnerabilities through a combination of
taint analysis and an evolutionary algorithm whose chromosome representations were
grammar productions of an “attack grammar” for Cross-Site Scripting (XSS) [26].
One of the limitations of this approach, however, is that their strategy required an
expert to manually write the “attack grammar” used to generate payloads for their
Evolutionary Algorithm [26]. This research explores techniques by which to auto-
matically derive grammars for an attack language by analyzing the lexical structures
of positive examples, and curating a set of productions which represent every string
found in the corpus. The goal is to amalgamate a group of grammar production
rules—which are grouped together based on a “fingerprinting” [30] algorithm for iden-
tifying SQL Injections examples—and use those to score fitness and/or to represent
chromosomes according to production rules.
39
4.1.1 Preprocessing
The purpose of the preprocessing phase of the EA fuzzing framework is to build a set
of attack grammars which encompass the lexical structure of the positive examples,
record the frequency of n-tuple groups of SQL tokens in the corpus, and to find the
frequency of transitions between n-tuple groups of tokens in the positive examples.
Analysis of positive examples of SQL injections allows the set of attack grammars
available to our Genetic Algorithm to be constructed. First, positive examples are
procured: the sample corpus for this set of experiments comes from Søen’s “Forced
Evolution” database, and from Click Security’s “Data Hacking” repository [46, 55].
The elements of the corpus were chosen in order to cover a wide range of di↵er-
ent SQL injection attacks, including boolean and UNION-based [10]. Boolean-based
SQL injections attempt to insert a boolean statement into a SQL query that will
always evaluate to true, thereby returning (exfiltrating) data from SQL queries that
should not be returned. UNION-based SQL injections, on the other hand, attempt to
match the output structure of a given SQL query to exfiltrate data from other tables,
server-specific values, or other sensitive information. Galbreath’s training set of SQL
injections were used in early stages of the project, but were not used for the experi-
ments outline in this document [30]. Although the set of examples from Galbreath’s
libinjection library were high quality, they favored UNION-based attacks too heavily
for our purposes, and represented more fingerprints for which the framework could
perform fitness calculations than could be used in a reasonable amount of time.
Once the corpus has been curated, the preprocessing stage kicks o↵ by lexing each
positive example into its SQL token representation. This research uses sqlparse, a non-
validating SQL lexer/parser [16]. Instead of writing a parser for our purposes, sqlparse
was chosen because it does not require a valid SQL string, and has robust tokenization
capabilities. The first quality mentioned is especially important because our positive
40
examples are merely fragments of SQL statements that represent malicious intent.
tokens are arranged into one, two, or three-tuples groups, and assigned a production
rule in the attack grammar according to their position within the original positive
example (a figure explaining this process is shown below). The frequency of the n-
tuple groups are recorded, and used for fitness metrics. In addition, the frequency
of transitions between n-tuple groups are recorded for use by the fitness function
as well as the Markov Model Monte Carlo algorithm. The information used by the
Evolutionary Algorithms tested in this research is grouped into one, two, and three-
tuples SQL token. A SQL token merely represents the symbolic value of a literal string
according to the SQL language specification. The reason three-tuples were chosen as
the maximum group of lexical tokens for a given terminal is based on research by
Mike Sconzo and Brian Wylie, whose work on data science for security was shown in
proceedings at Shmoocon in 2014 [46]. They demonstrate that 3-gram groupings of
SQL tokens carries enough information to determine whether or not a given string has
malicious intent [46]. Although they were approaching the problem of SQL injection
detection, the idea pertains to fuzzing as well: instead of relying on a human to write
an attack grammar based on known types of injections, the approach of this paper
requires tokenization, and then a grouping of these tokens based on their position.
The idea is that the Genetic Algorithm will be able to move these n-tuple groups
in di↵erent orders (via crossover) while still preserving attack grammar information.
A visual representation of an extract of the production tree is in Figure 4.2. After
the positive examples have been broken down into their semantic tokens, n-tuple
groupings, and n-tuple transition densities, the attack grammars that represent the
corpus are constructed. A visual manifestation of this process is shown in Figure 4.1
41
Corpus ofPositive Examples
Lex andTokenize Pos-itive Example
Build n-tuplesTable
Record MarkovTransitions
Apply Fingerprint
Construct/UpdateGrammars
Figure 4.1: Flow graph of preprocessing stage
42
4.1.2 Attack Grammars
The final phase of preprocessing involves creating a set of grammars whose produc-
tions end in nonterminals represented as n-tuple groups of SQL tokens. This forms
the basis of the proposed method’s chromosome representation and fitness function
calculations. The attack grammars are separated based on the “fingerprint” value
of the positive example, as calculated by Galbreath’s libinjection software [30]. A
fingerprint is calculated by approximating the type of SQL injection attack based on
the tokens that are present in a given string [30]. This ensures that exploit strings
that have similar structural components are grouped together, and their grammar
productions are grouped accordingly.
More formally, the algorithm derives a set of grammars that represent our corpus
of positive examples:
Attack G = {G0, G1, ... Gn�1} , (4.1)
where each grammar G contains production rules that generate strings that have the
same fingerprint, which classifies them to according to the semantic structure of one
or more positive examples. Formally, each grammar is a 4-tuple:
Gi = {V,⌃, R, S} (4.2)
V is a finite set of non-terminals (variables). In this research’s proof of concept im-
plementation, the variables correspond to positional indices where groups of n-tuples
are represented. For a given index of a fingerprint, multiple n-tuples are potential
productions for the grammar. S is a set of terminals which are the actual components
that comprise a valid string of a given language described by the grammar. The ter-
minals in this implementation are one, two, or three tuples of SQL tokens. S is the
start variable, and R is the set of production rules from S that derive terminals [54].
The production rules are purposefully crude in order to limit the time spent deriving
43
start
fp0
0fp0 1fp0 2fp0
(SINGLE, DDL, PUNCT ) (DML, ERROR)(INT, COMPARISON, INT )
(m� 1)fp0
fp1 fpn�1
Figure 4.2: Example extract of Parse tree derived from positive examples of SQLinjection tokens
strings of a given grammar, and for use in exploring the e�cacy of using sets of simple
grammars to approximate an attack language. Each grammar can be described as
follows:
Gfp =�Gfp(s)|s 2 LGfp
and G accepts s
(4.3)
The set of grammars themselves do not seek to accurately encompass the SQL lan-
guage specification—instead, the goal is to approximate the structures of positive
examples well enough to codify the semantic components of an attack language—in
the language of genetics, the phenotypical information available.
Value Fingerprint SQL Token Representation Grammar Productions
’ x OR 1 = 1’ sn&10 Error Name Keyword Inte-ger Comparison Integer Sin-gle
S ! 0 1 20! Error Name Key-word1 ! Integer Compari-son Integer2 ! Single
Table 4.1: An example preprocessing of a positive example
At the cost of more refined expression of a given “Attack Grammar”, such as those
found in Duchene et al., this technique aims to collect various permutations of lexical
44
symbols which represent “known bad” injection attempts (i.e., the semantic struc-
ture of examples of SQL injections) and place them in their positional context [27].
In use with the fuzzing framework, chromosomes are modeled as production for a
given fingerprint, producing a group of one, two, or three sequential tokens. This
design decision ensures that the Genetic Algorithm searches the input space with
genotypic components (tuples of SQL tokens derived from positive examples) that
are well enough preserved for focused searching. The heart of this research involves
exploring whether or not a precise attack grammar is required, or if it is su�cient to
encode shallow productions of grammars, grouped by a common lexical structure, and
allow for an Algorithm to recombine productions of di↵erent fingerprints to produce
new exploit strings. These new exploit strings will sometimes not have representative
fingerprint, and other times will conform to ones available in libinjection. The re-
sults demonstrate that their is value in this approach, especially since it is completely
automated. In theory, provided a lexer for a given target language is available, and
a method for codifying similar examples into fingerprints, it is possible to use this
framework for any type of fuzz testing campaign. Future research will explore us-
ing this framework to find Cross-Site Scripting (XSS) vulnerabilities and memory
corruption vulnerabilities in local binaries.
4.1.3 Fitness Evaluation
For this technique, the fitness function is a combination of three characteristics of a
given candidate solution. First, if a given chromosome successfully achieves a SQL
injection, it is heavily promoted within the population. In a related matter, fit-
ness scores of chromosomes which result in an invalid SQL statement are suppressed.
This begs the question of the proof of concept’s tenability against real-world systems.
While most web fuzzing campaigns are purely black box, it is not unreasonable to an-
45
alyze input forms and determine the type of SQL statement executed. Furthermore,
the current implementation scores this condition very weakly, to the point where it
could be removed without lasting e↵ect. The chromosome in question is scored based
on how well it conforms to the attack grammars built by positive examples. The
tokens of the chromosome in question are compared against the terminals of each
grammar at the corresponding positional indices. Instead of only denoting if a chro-
mosome is accepted or rejected by a grammar, if a given chromosome’s token groups
match a successive sequence of a grammar’s token groups, the fitness score is com-
pounded exponentially. In short, a chromosome that matches a contiguous grouping
of tokens fits a high portion of a grammar potential terminals, and is exponentially
promoted within the population (i.e., a given chromosome does a good job of ap-
proximately representing an attack language). This idea can be summarized in the
formula in figure
n�1X
i=0
m�1X
j=0
(xj == Gi,j) ⇤ k2 (4.4)
where n represents the total number of fingerprints, m represents the positional token
groups in the symbol representation of the chromosome, and k represents the number
of sequential matches found. k is reset to 0 in the event that a mismatch is found. This
formula, compounded with the other two metrics, comprises the fitness calculation
used for the experiments of this research.
4.1.4 Niche-penalty Heuristics-based Genetic Algorithm
The preprocessing previously discussed is used as heuristic information for intelli-
gently guiding input generation for a web application fuzzer. For intelligently guiding
input generation, thesis proposes the use of a niche-penalty, heuristics-based Genetic
Algorithm. The chromosomes of this approach are represented as sequences of 2-
46
tuples, which corresponds to an attack grammar (fingerprint), and a positional index
by which to select an n-tuple of SQL tokens. The selection method chosen for this
approach is pure elitism, which probabilistically selects parents based on probability
densities informed by fitness scores. Three crossover methods are evaluated (single-
point, two-point, and uniform), and compared against each other. The results show
uniform crossover to be the most e↵ective, which follows considering that the in-
dividual elements of a chromosome codify a great deal of information (up to three
tokens). Mutation rates of 0.1 was chosen in order to provide the algorithm with
enough chaos to avoid reaching a plateau too quickly, while still remaining within
conventional limits. Therefore, the results of this algorithm using uniform crossover
and a 0.1 mutation probability are the represented candidate in comparative analysis.
The tendency for Genetic Algorithms to plateau on local optima is especially
concerning for the search space in question—valid injection strings create a multi-
modal search landscape, so it is vital to encode the Genetic Algorithm with the tools
necessary to reinitialize its search direction. Based on trial and error, this proof of
concept measures the number of times in which the mean fitness for the population
demonstrates a downward trend, and after an accumulation of enough “strikes”, the
population is reinstantiated via the grammars created during preprocessing. Figure
5.2. It is clearly visible that the mean fitness per population takes a sudden dive at
distinct intervals. These events occur when the algorithm determines that a niche has
caused the Genetic Algorithm to plateau on a given value (or set of similar) values.
Figure 4.3 shows a top-level flow-chart graph of this method.
4.1.5 CHC
In theory, CHC is a perfect candidate for use in guiding test case generation for
fuzzing frameworks because it has the same structured operators as standard Ge-
47
netic Algorithms, but also includes a built-in mechanism by which to escape search
space plateaus. Seagle used CHC with tremendous success guiding test case gen-
eration for file format fuzzing [53]. The common implementation of CHC assumes
each chromosome uses binary encoding with fixed-length chromosomes [29, 53]. This
creates a natural situation in which a population will plateau on a given value, and
stop producing new children (referred to as “incest penalty” in the literature) [29].
The proof of concept for this research uses variable-length, multi-value encodings for
chromosomes, representing a problem when implementing CHC. The limitation is
that there is an encoding disconnect, in that two productions of di↵erent fingerprints
produce the same n-tuple of SQL tokens. This occurs because the productions of a
grammar generate one, two, or three-tuples of SQL tokens, and there can be matches
of these tokens at certain positions across various grammars. Previous experiments
with CHC using conventional methods on chromosomes with grammar production
representation showed that cataclysmic mutation rarely occurred, despite a plateau
being reached.
The workaround for this involves the following steps. First, in order to combat the
penalty of hamming distance metrics on variable length chromosomes, the Hamming
distance is calculated with the tokens of the minimum length (e.g., if parent 1 has
10 tokens, and parent two has 15, the first 10 tokens are compared in the Hamming
distance). This is subtracted from the Hamming distance calculated between the pro-
duction encoding of the two chromosomes. The production encodings are considered
the same if they represent the same positional index, and if the two fingerprints share
similar encoded components. This is measured by calculating the set intersection
between the individual characters of the fingerprint. Lastly, the threshold is given a
fixed value, and a “countdown” variable records the number of generations where no
children were generated. Once the countdown is less than zero, cataclysmic mutation
48
is executed. Although the modifications to CHC deviate from the algorithm’s con-
ventional conditions, the implementation supported by this research was able to find
valid exploit strings against the vulnerable service.
Figure 4.3: A flowchart of the heuristics-based Evolutionary Algorithm fuzzing frame-work proof-of-concept
4.2 Advantages and Limitations
The chief advantage of both using Evolutionary Algorithms with grammar-based
heuristics for chromosome representation and fitness calculation is that it provides
the fuzzer with guidance for input generation in a situation that has a very limited
feedback loop. Instead of relying on the search space covered by a manually created
corpus, this algorithm allows for intelligent searching of new exploit strings based on
the fundamental components of known malicious test cases. Another advantage not
explored by this research is the ease by which this approach can be parallelized. A
49
distributed system would remove resource constraints of the current implementation,
allowing for much larger population sizes, and more refined fitness calculations.
The most significant limitation of the proof of concept implementation is the run-
ning time. Because of the expensive fitness calculation, and the network bound limits
imposed by the input execution process between a program and a web service, the
algorithm takes a significant amount of time to complete execution. While the latter
limit can be minimized by replicating the target in a virtual machine and testing
locally, the former is unavoidable—in order to surmise how well a given chromosome
“fits” the approximated attack language, it must be compared against the produc-
tions represented by the grammars. A significant portion of this research involves
determining the heuristics involved in recognizing a plateau state reached by CHC or
the Genetic Algorithm. The current manifestation of those heuristics could be con-
sidered too crude—an area of open research would involve refining those heuristics
based on each particular System Under Test (SUT).
50
Chapter 5: Experimental Results
5.1 Testing Environment
In order to assess the e↵ectiveness of an EA-based web application fuzzer, whose
fitness metrics and chromosome representation center on positive examples of known
SQL injection attempts, an intentionally vulnerable web application was instantiated
for the purpose of testing. Trustwave’s SpiderLabs security research group created
a testbed of vulnerable web applications called the “Magical Code Injection Rain-
bow” [15]. For these experiments, the process of crawling an application seeking a
potentially vulnerable input vector is bypassed to focus on the e�cacy of the algo-
rithms for evolving payloads. That said, Duchene et al. proved the usefulness of
model-taint analysis for guiding fitness metrics, and it will be an actionable goal for
future work [26]. All the results were derived from tests run on a Macbook Air, in
a closed loop against MCIR’s “SQLol” vulnerable testbed, running as a service pro-
vided in OWASP’s “Broken Web Application” Virtual Machine [7]. This testbed was
chosen because of the simplicity of modeling di↵erent types of vulnerabilities related
to SQL injections, and for its highly configurable parameters [15]. The front page of
the vulnerable service in discussion is shown in Figure 5.1.
5.2 Benchmark Simulation
5.2.1 Random
The lower baseline used for comparing the e�ciency of niche penalty heuristic-based
Genetic Algorithm and CHC is a simulation where each token of a chromosome are
randomly chosen. The chromosome are in pure random searching are modeled solely
51
Figure 5.1: The front page of the testbed used for measuring the e↵ectiveness ofniche-penalty GA-based web fuzzing [15]
on the SQL token representation. SQL tokens produce values by randomly selecting
from a choice of values for that token (based on preprocessing of positive examples).
Each chromosome is subject to the same fitness metrics as the other simulation types.
This benchmark is necessary to ensure that intelligent search-space algorithms are
not vastly under performing, and that the grammatical information used in fitness
calculation reasonably outperform chaotic token selection.
5.2.2 Markov Model Monte Carlo
The final comparative method for web application SQL fuzzing is based on a Markov
Model Monte Carlo implementation of population building. Chromosomes of vari-
able length are instantiated according to a Markov-transition lookup table generated
during preprocessing: the n-tuple transitions of tokens in the positive examples are
weighted according to frequency. The steps of chromosome instantiation are as fol-
lows. First, a given transition from
SQL TupleA ! SQL TupleB (5.1)
52
is selected in pure random fashion. The following transitions from:
SQL TupleB ! SQL TupleC ...SQL Tuple(n� 2)! SQL Tuple(n� 1) (5.2)
are selected according to Monte Carlo probabilistic selection based on densities of the
transition frequencies. Chromosome instantiation terminates when a given tuple has
no more transitions, or the maximum chromosome length for the trial run is reached.
The value representations of chromosomes are chosen the say way in which the GA
and CHC implementations derive values—each SQL token has a set of values with
frequencies. For each token, a weighted random choice selects a value. This represents
the approach to input generation that is chiefly concerned with generating chromo-
somes that represent popular transitions between n-tuples from the corpus. The idea
is that by conjoining a sequence of well-represented transitions, the probability of
finding a valid SQL injection is increased. The results demonstrate that this is a well-
reasoned observation, as it outperforms the proof of concept CHC implementation by
a wide margin.
5.3 Evaluation Metrics
The heuristics-based GA and CHC are compared to Random simulation, Markov
Model Monte Carlo, and a corpus-based simulation that simply sends each positive
example to the test bed (a baseline measuring stick, of sorts). Each experiment
contains 20 runs of a given simulation and parameter setting, using the same 20 seed
values across each di↵erent experiment. The population for each simulation type
includes 20 chromosomes, and each trial run executes the corresponding simulation
type for 200 generations. The number of trials per simulation type is selected in
order to reduce the variability introduced by the proof of concept’s use of random
to generate literal values from SQL tokens. The heuristics based GA, CHC, and
53
Markov Model Monte Carlo methods both operate with SQL tokens instead of raw
values, so variability regarding the manifestation of a given test case leads to some
inconsistencies. Because the notion of code coverage is untenable given the scenario
environment, each algorithm is evaluated according to two metrics:
1. the number of exploits found during the trial run
2. the average fitness per generation
Fitness is calculated for each chromosome the same way across each simulation type.
Di↵erent crossover types (single-point, two-point, and uniform) for the heuristics
based GA were evaluated, and Half Uniform Crossover was used for the CHC im-
plementation. It is determined that an exploit string is found when the contents of
the database table in question are dumped. This requires knowledge that would not
be available in a black box setting. However, it could be simulated—if a certain num-
ber of rows are returned, and it is significantly di↵erent than the contents of normal
queries, one can ascertain that an injection string was found.
5.4 Result and Analysis
5.4.1 Fitness and Diversity
Figure 5.2 shows the average fitness of each simulation type across the 20 trial runs
of 200 generations. Each line corresponds to the mean fitness per generation for each
simulation type. The CHC function had the highest overall fitness score, followed by
the heuristics-based Genetic Algorithm, Markov Model Monte Carlo, and Random.
The flat line in the middle of the graph is the fitness score of the corpus of positive
examples, which is the fitness score of sending the positive examples as input to the
fuzzing framework. The high mean fitness score of CHC can be attributed to the cat-
aclysmic mutation operator, which instantiates an entire population based on a copy
54
of the highest performing chromosome, whereas the heuristic-based GA reinstantiates
the population by building chromosomes based on attack grammar productions. The
fitness scores of the simulation types across the generations indicates that the EA
based methods have a higher chance of finding valid exploit strings. The observations
indicate that this is only partially true—Markov Model Monte Carlo proved to be
a very consistent method for finding valid exploit strings, yet does not demonstrate
very high mean fitness scores. This indicates that the fitness calculations need to
be more refined in order to accurately assess the true fitness of a chromosome. This
will be a primary subject of future work, as the e�cacy of the EA-based algorithms
are directly influenced by the fitness function’s validity. The diversity of symbol
Figure 5.2: mean fitness per generation
and value representations per population for both CHC and heuristics based GA are
shown in Figure 5.3, and individually for Random and Markov Model Monte Carlo
55
Figure 5.3: median diversity of value and symbol representations per generation forGA and CHC
56
(a) Median Diversity of Value and Symbols per Population for Random Simulation
(b) Median Diversity of Value and Symbols per Population for Markov Model Monte Carlo
Figure 5.4: median diversity of value and symbol representations per generation forRandom and Markov Model Monte Carlo
57
(a) Total number of unique exploits found in 3 experi-
mental trials (20 runs per trial)
(b) Average number of exploits per
trial simulation
Figure 5.5: Total unique exploits per simulation and average number of exploits pertrial
process in Figure 5.4. Diversity was measured for each generation using Python’s
di✏ib module, on both the value representation (actual payload generated) and sym-
bol representation (the SQL tokens that represent a payload). The smoothing trend
shown in Figure 5.3 demonstrates the phenomenon of the algorithm focusing in on
a particular solution for generating exploit strings. The pattern of fluctuation are
caused by the restart heuristics for the GA and CHC methods—in order to escape
plateaus, a necessary condition for searching multimodal spaces, conventions for re-
instantiating populations are required. Figure 5.4 demonstrates the fluctuation in
diversity when search space guidance is not in place.
5.4.2 Exploits Found
The Genetic Algorithm with niche-penalty heuristics is shown to produce a high
number number of unique exploit strings for a given experiment of 20 runs. Across
each simulation, no trials generated an exploit string that was identical to one found in
the original corpus of positive examples. Figure 5.5 shows the total number of exploits
found per simulation type across the trials, as well as the average number of exploits
found per trial for each simulation. Although the GA method of produced the most
58
Figure 5.6: Average number of exploits found in each generation
valid exploit strings for our testbed, the Markov Model Monte Carlo algorithm for
producing payloads was more consistent—every single trial run of the Markov Model
Monte Carlo simulation found a valid exploit string, whereas the other methods each
had at least one trial finish unsuccessfully. Figure 5.6 shows the average number of
exploit strings found per generation, which is an average of all the trials for a given
simulation. It is clear that both CHC and heuristic based GA, when happening upon a
valid injection string, do a good job of zeroing-in on the structure of that chromosome,
producing a hight number of exploit strings in successive generations. The dearth in
exploits found between pockets of high success in CHC demonstrates the lack of
consistency with which the algorithm determines cataclysmic mutation. This may be
insurmountable given the conditions of variable length chromosomes which encode
production rules. While CHC found more exploit payloads than the random method
59
trials, the CHC trials produced 20 total duplicate strings from previous generations,
likely due to the fact that our modified Hamming distance metric and threshold
countdown were too sensitive, allowing for high performing strings to live to the next
generation.
SimulationType
PopulationInitializer
ChromosomeRepresentation
CrossoverMethod
TotalUniqueExploits
GA Grammar Production uniform 272MMMC NA Symbol NA 157CHC Grammar Production NA 70PositiveExamples
NA NA NA 39
Random Random Symbol NA 6
Table 5.1: Total exploits per simulation
SimulationType
PopulationInitializer
ChromosomeRepresentation
CrossoverMethod
TrialsWithoutExploit
GA Grammar Production uniform 2MMMC NA NA NA 0CHC Grammar Production NA 17Random Random Symbol NA 16
Table 5.2: Number of trials per simulation type without an exploit
SimulationType
PopulationInitializer
ChromosomeRepresentation
CrossoverMethod
Max Ex-ploitsFound
GA Grammar Production uniform 184MMMC NA Symbol NA 13CHC Grammar Production NA 27Random Random Symbol NA 2
Table 5.3: Highest number of exploit strings found throughout a singular trial
For the GA and CHC algorithms, the average number of exploits per trial are
skewed— a small number of trial cases account for most of the exploit strings found
in a given experiment, and many of the runs did not find any at all. This has less
60
impact upon the results for the heuristics based niche-penalty GA, as most of the
seeded trial runs found at least one unique exploit string. Markov Model Monte
Carlo was consistent, as every run found at least one valid injection string. Table 5.1
shows the total number of unique exploits found across all trials for a given simulation
type. As previously mentioned, the experimental simulations did not generate any
valid exploit strings identical to one found in the original corpus of positive examples.
The number of trials with zero-exploits found per simulation type is summarized
in table 5.2. A reasonable question regarding the e�cacy of the niche-penalty GA
method can be drawn—having such a high-performing outlier skews the results of the
overall e↵ectiveness of the algorithm. When the high performing trial is removed,
the simulation performs no better than Markov Model Monte Carlo. While this is a
reasonable claim, the fact that the proposed GA method zeroes in so well upon a set
of semantically similar yet unique exploit strings demonstrates the potential of this
method for e�cient application level fuzzing, and points to the necessity of developing
this method further.
61
Chapter 6: Conclusion and Future Work
The research presented in this thesis demonstrates the e↵ectiveness of fuzzing
frameworks guided by Evolutionary Algorithms, and the improvements to be gained
by using fitness metrics and chromosome representations modeled after the structure
of positive injection examples. As opposed to manually writing “attack grammars” for
a given language input class, generating a set of shallow grammatical representations
of known nefarious injection strings is shown to improve the Evolutionary Algorithm’s
search for strings, despite the multimodal search space of valid exploit strings.
Although the heuristics based, niche-penalty Genetic Algorithm found the most
valid SQL injections, it had some inconsistencies—for some of the trials, no exploit
strings were found despite a vulnerability being present. This indicates that fitness
function metrics should be revisited, or the framework itself should rely upon a more
informative feedback loop. It is possible that the reinitialization heuristics were too
simplistic and/or sensitive—this will create a situation in which the GA will not have
enough time to explore a subspace that contains a valid injection. This is point is
further evidenced by the fact that, although CHC had the highest mean fitness scores
per generation, it had the lowest e�cacy of the two evaluated evolutionary algorithms.
Another key limitation of the current proof of concept code is run time—the av-
erage execution time is long enough to raise questions of production-level quality.
A distributed system for calculating chromosome fitness and the process of send-
ing input and monitoring results would result in a significant performance speedup.
Theoretically, this framework can be extended to any sort of search space with pos-
itive examples that can be tokenized into lexical groups. Therefore, an open area
of research extending outward from this work would involve ensuring extensibility to
62
web-based attacks such as Cross-Site Scripting (XSS). Other target scenarios, such
as file format parsers and language interpreters are also fertile areas for research and
testing.
The most important next step is to fit this framework for testing web applications
for Cross-Site Scripting (XSS) vulnerabilities—they are much more common today,
and the feedback loop allows for higher quality fuzzing heuristics and fitness function
metrics. Further research will also make use of modern in-memory fuzzing methods
and GA heuristics based on execution paths in conjunction with our grammar-based
function methods.
In summary, the use of grammar-related heuristics in Evolutionary Algorithms
to intelligently guide payload generation for application level fuzzers is shown to
produce unique exploit strings that are based on the lexical structure of positive
examples. These results confirm that using semantic level information for encoding
chromosomes has the desired e↵ect of propagating injection information while still
searching multimodal spaces with refined information at hand. The further refinement
of fitness heuristics will be the next step in the maturation of this process, as well as
further testing with di↵erent target languages and systems.
63
Bibliography
[1] Beautiful soup html/xml parsing library for python. https://www.crummy.com/
software/BeautifulSoup/. Accessed:2014/12/09.
[2] Burp suite web intercept proxy. https://portswigger.net/burp/. Accessed:
2016/04/21.
[3] Creationwiki genetic algorithm. http://creationwiki.org/Genetic_
algorithm. Accessed: 2016/03/15.
[4] Cross-site scripting (xss). https://www.owasp.org/index.php/Cross-site_
Scripting_(XSS). Accessed:2016/04/23.
[5] Damn vulnerable web application. http://www.dvwa.co.uk/. Ac-
cessed:2013/10/23.
[6] Dan guido fuzzing introduction fall 2010. https://fuzzinginfo.files.
wordpress.com/2012/05/fuzzingintro_fall2010.pdf. Accessed:2015/12/6.
[7] Homepage for owasp broken web application. https://www.owasp.org/index.
php/OWASP_Broken_Web_Applications_Project. Accessed:2015/09/10.
[8] The microsoft sdl. https://blogs.msdn.com/blogfiles/publicsector/
WindowsLiveWriter/ReferenceMicrosoftSecurityDevelopmentLif_7279/
image_2.png. Accessed: 2016/04/17.
[9] Owasp jbrofuzz web application fuzzer. https://www.owasp.org/index.php/
JBroFuzz. Accessed: 2016/04/21.
64
[10] Owasp sql injection explanation. https://www.owasp.org/index.php/SQL_
Injection. Accessed:2016/04/24.
[11] Peach fuzzer. http://www.peachfuzzer.com/. Accessed: 2016/02/08.
[12] Python network packet crafting library. http://www.secdev.org/projects/
scapy/. Accessed:2014/09/22.
[13] Selenium browser automation. http://www.seleniumhq.org/. Ac-
cessed:2015/01/10.
[14] Smashing the stack for fun and profit. http://insecure.org/stf/smashstack.
html. Accessed: 2016/02/04.
[15] Spiderlabs magical code injection rainbow testbed. https://github.com/
SpiderLabs/MCIR. Accessed: 2016/04/21.
[16] sqlparse non-validating sql parser module for python. https://github.com/
andialbrecht/sqlparse. Accessed:2016/07/18.
[17] w3af web application security scanner. http://w3af.org/. Accessed:
2016/04/21.
[18] zzuf mutation-based fuzzer. http://caca.zoy.org/wiki/zzuf. Accessed:
2016/04/24.
[19] Sofia Bekrar, Chaouki Bekrar, Roland Groz, and Laurent Mounier. Finding
software vulnerabilities by smart fuzzing. In Software Testing, Verification and
Validation (ICST), 2011 IEEE Fourth International Conference on, pages 427–
430. IEEE, 2011.
65
[20] Josip Bozic and Franz Wotawa. Model-based testing-from safety to security. In
Proceedings of the 9th Workshop on Systems Testing and Validation (STV?12),
pages 9–16, 2012.
[21] Chi-Cheong. Genetic algorithms in portfolio optimization. Computing in Eco-
nomics and Finance 2001 204, Society for Computational Economics, 2001.
[22] Crispin Cowan, Perry Wagle, Calton Pu, Steve Beattie, and Jonathan Walpole.
Bu↵er overflows: Attacks and defenses for the vulnerability of the decade.
In DARPA Information Survivability Conference and Exposition, 2000. DIS-
CEX’00. Proceedings, volume 2, pages 119–129. IEEE, 2000.
[23] ICT Data and Statistics Division. ICT Facts and Figures 2015. 2015.
[24] Kenneth De Jong, D Fogel, and Hans-Paul Schwefel. Handbook of evolutionary
computation. IOP Publishing Ltd and Oxford University Press, 1997.
[25] Jared DeMott, Richard Enbody, and William F Punch. Revolutionizing the
field of grey-box attack surface testing with evolutionary fuzzing. BlackHat and
Defcon, 2007.
[26] Fabien Duchene. How i evolved your fuzzer: Techniques for black-box evolution-
ary fuzzing.
[27] Fabien Duchene, Sanjay Rawat, Jean-Luc Richier, and Roland Groz. Kameleon-
fuzz: evolutionary fuzzing for black-box xss detection. In Proceedings of the
4th ACM conference on Data and application security and privacy, pages 37–48.
ACM, 2014.
[28] Verizon Enterprise. Data breach investigations report. Technical report, Verizon
Communications, Inc., 2015.
66
[29] Larry J Eshelman. The chc adaptive search algorithm: How to have safe search
when engaging. Foundations of Genetic Algorithms 1991 (FOGA 1), 1:265, 2014.
[30] Nick Galbreath. libinjection software. https://github.com/client9/
libinjection. Accessed:2015/09/03.
[31] Thomas Geijtenbeek, Michiel van de Panne, and A Frank van der Stappen.
Flexible muscle-based locomotion for bipedal creatures. ACM Transactions on
Graphics (TOG), 32(6):206, 2013.
[32] Liu Guang-Hong, Wu Gang, Zheng Tao, Shuai Jian-Mei, and Tang Zhuo-Chun.
Vulnerability analysis for x86 executables using genetic algorithm and fuzzing.
In Convergence and Hybrid Information Technology, 2008. ICCIT’08. Third In-
ternational Conference on, volume 2, pages 491–497. IEEE, 2008.
[33] Tao Guo, Puhan Zhang, Xin Wang, and Qiang Wei. Gramfuzz: fuzzing testing of
web browsers based on grammar analysis and structural mutation. In Informatics
and Applications (ICIA), 2013 Second International Conference on, pages 212–
215. IEEE, 2013.
[34] R Hansen and M Patterson. Stopping injection attacks with computational the-
ory. In Black Hat Briefings Conference, 2005.
[35] John H Holland. Adaptation in natural and artificial systems: an introductory
analysis with applications to biology, control, and artificial intelligence. U Michi-
gan Press, 1975.
[36] Christian Holler, Kim Herzig, and Andreas Zeller. Fuzzing with code fragments.
In Presented as part of the 21st USENIX Security Symposium (USENIX Security
12), pages 445–458, 2012.
67
[37] Gregory S Hornby, Al Globus, Derek S Linden, and Jason D Lohn. Automated
antenna design with evolutionary algorithms. In AIAA Space, pages 19–21, 2006.
[38] Michael Howard and Steve Lipner. The security development lifecycle. O’Reilly
Media, Incorporated, 2009.
[39] Yating Hsu, Guoqiang Shu, and David Lee. A model-based approach to security
flaw detection of network protocol implementations. In Network Protocols, 2008.
ICNP 2008. IEEE International Conference on, pages 114–123. IEEE, 2008.
[40] Vincenzo Iozzo. 0-knowledge fuzzing. Black Hat DC, 2010.
[41] Dave Wichers Je↵ Williams. Owasp top 10 2013. http://www.owasp.org. Ac-
cessed: 2015/11/4.
[42] Michal Zalewski (lcamtuf). american fuzzy lop (2.10b). http://lcamtuf.
coredump.cx/afl/. Accessed: 2016/04/18.
[43] Li Li, Qiu Dong, Dan Liu, and Leilei Zhu. The application of fuzzing in web soft-
ware security vulnerabilities test. In Information Technology and Applications
(ITA), 2013 International Conference on, pages 130–133. IEEE, 2013.
[44] P. Amini M. Sutton, A. Greene. Fuzzing: Brute Force Vulnerability Discovery.
Addison-Wesley, Boston, MA, 2007.
[45] Mitchell Melanie. An introduction to genetic algorithms. Cambridge, Mas-
sachusetts London, England, Fifth printing, 3:62–75, 1999.
[46] Brian Wylie Mike Sconzo. Data hacking sql injection exercise. https:
//github.com/ClickSecurity/data_hacking. Accessed: 2014/8/22, video:
https://www.youtube.com/watch?v=8lF5rBmKhWk.
68
[47] Barton P Miller, Louis Fredriksen, and Bryan So. An empirical study of the
reliability of unix utilities. Communications of the ACM, 33(12):32–44, 1990.
[48] Charlie Miller and Zachary N. J. Peterson. Mobile Systems IV. Technical report,
Independent Security Evaluators, 03 2007.
[49] Adam Muntner.
[50] John Neystadt. Automated penetration testing with white-box fuzzing. MSDN
Library, 2008.
[51] Sanjay Rawat and Laurent Mounier. An evolutionary computing approach for
hunting bu↵er overflow vulnerabilities: A case of aiming in dim light. In Com-
puter Network Defense (EC2ND), 2010 European Conference on, pages 37–45.
IEEE, 2010.
[52] Andy Renk. !exploitable crash analyzer - msec debugger extensions. https:
//msecdbg.codeplex.com/. Accessed: 2016/03/12.
[53] Roger Lee Seagle Jr. A framework for file format fuzzing with genetic algorithms.
2012.
[54] Michael Sipser. Introduction to the Theory of Computation, volume 2. Thomson
Course Technology Boston, 2006.
[55] Soen. Evolving exploits through genetic algorithms, 2013. DEFCON 21.
[56] Sherri Sparks, Shawn Embleton, Ryan Cunningham, and Cli↵ Zou. Automated
vulnerability analysis: Leveraging control flow for evolutionary input crafting. In
Computer Security Applications Conference, 2007. ACSAC 2007. Twenty-Third
Annual, pages 477–486. IEEE, 2007.
69
[57] Omer Tripp, Omri Weisman, and Lotem Guy. Finding your way in the test-
ing jungle: a learning approach to web security testing. In Proceedings of the
2013 International Symposium on Software Testing and Analysis, pages 347–357.
ACM, 2013.
[58] Alan M Turing. Computing machinery and intelligence. Mind, 59(236):433–460,
1950.
[59] Yi-Hsun Wang, Ching-Hao Mao, and Hahn-Ming Lee. Structural learning of at-
tack vectors for generating mutated xss attacks. arXiv preprint arXiv:1009.3711,
2010.
[60] David H Wolpert and William G Macready. No free lunch theorems for opti-
mization. Evolutionary Computation, IEEE Transactions on, 1(1):67–82, 1997.
[61] Dingning Yang, Yuqing Zhang, and Qixu Liu. Blendfuzz: A model-based frame-
work for fuzz testing programs with grammatical inputs. In Trust, Security and
Privacy in Computing and Communications (TrustCom), 2012 IEEE 11th In-
ternational Conference on, pages 1070–1076. IEEE, 2012.
70
SCOTT M. [email protected] ⇧ github.com/sseal
EDUCATION
Wake Forest University May 2016Masters of Science in Computer ScienceOverall GPA: 3.166Wake Forest University May 2013Bachelor of Arts in English & Computer ScienceOverall GPA: 3.352Technical Coursework: Network and Computer Security, Internet Protocols, Algo-rithms, Artificial Intelligence, Operating Systems, Linux Administration, Discrete Math-ematics, Calculus, Linear Algebra
EXPERIENCE
Wake Forest University September 2013 - May 2016Research and Teaching Assistant Winston Salem, NC
· Supported research which implemented a ”Moving Target” security configuration systemfor network hosts
· Conducted thesis research that explored the application of machine learning techniques,language theory, and evolutionary algorithms to optimize SQL-injection and XSS audit-ing approaches
· Organized and taught undergraduate lectures on introductory topics related to operat-ing systems and computer security, involving attacker life-cycle, security vulnerabilityauditing, and secure software practices
Pacific Northwest National Laboratory June 2014 - September 2014Masters Intern Richland, WA
· Developed auto-refresh functionality for a network tra�c visualization application writ-ten in Java
· Provided operational assistance for a company-wide Capture the Flag competition,which involved instructing new participants on attack classes, exploitation techniques,and general secure development practices
· Developed Capture the Flag challenges, including a firewall rules testing applicationwhich utilized the Flask microframework, Scapy packet manipulation software, nginxreverse-proxy and gunicorn HTTP server
B/E Aerospace June 2013 - August 2013Operations Security Intern Winston Salem, NC
· Supported company-wide security operations and incident response handling· Developed automated tools and workflow procedures that increased the e�ciency ofincident management and mitigation
Cisco Systems, Inc. June 2012 - August 2012Software Engineering Intern: R&D Knoxville, TN
· Developed an analytic web application using Ruby on Rails framework· Learned and developed software security analysis skills through independent study andparticipation in CTF challenges within penetration testing environments
71
· Studied and analyzed secure software development practices and related vulnerabilityclasses/attack vectors
TECHNICAL SKILLS
Computer Languages and Technologies: Python, Ruby, Java, C/C++, R
Frameworks, Protocols & APIs: Rails and REST APIs, Flask, Scapy, Nginx,Apache HTTP Server, JSON, pandas, numpy, scikit-learn
Databases & Developer Tools: MySQL, PostgresSQL, SQLite, Git, SVN, Vim,Netbeans
Security Tools: Burp Suite, Metasploit Framework, Backtrack/Kali Linux, gdb, IDADisassembler
72