a metric for evaluating static analysis tools katrina tsipenyuk, fortify software brian chess,...

A Metric for Evaluating Static Analysis Tools

Katrina Tsipenyuk, Fortify Software

Brian Chess, Fortify Software

2

Four Perspectives on the Problem General

How good are software security tools today? Tools vendor

Is my static analysis product getting better over time? (What is better?) How much has it improved since the last release? What should I focus on to improve my tool in the future? If I make my tool detect a new kind of security bug, will an auditor or a developer

thank me? Or both? Tools user: auditor

Is the tool finding all the important types of security bugs? Tools user: developer

Is the tool producing a lot of noise?

Auditors and developers have different criteria for security tools, so we need a way to answer posed questions on two scales – “Auditor” and “Developer”

3

Proposed Solution Define metrics that model tool characteristics and conjecture a formula for calculating

the score for each tool version Counts of true positives (t), false positives (p), and false negatives (n) 100 * t / (t + p + n) augmented by the weights and penalties (score out of 100)

Define weights & penalties for reported results Results with different reported severities should be weighed differently

High (h), Medium (m), and Low (l) False negatives penalties per bug category should differ depending on whether the tool

claims to detect this kind of bug or not Define weights & penalties for “Auditor” and “Developer” scale

Auditors tolerate false positives while developers tolerate false negatives – make false positive and false negative weights different to reflect this

Importance & value of a vulnerability category (vc) for auditors & developers should affect the weights of the results

Conduct an experiment and collect the necessary data to prove or disprove the conjecture

4

Experiment Analyzed three different projects: wuftpd (C),

webgoat (Java), and securibench (Java) Ran four versions of Fortify tool Did a full audit of reported results for all product /

version combinations (time consuming)

TP (t) FP (p) FN (n)

Important 2 0.5 2

Not important 0.5 2 0.5

FP (p) FN (n)

Auditor 0.5 2

Developer 2 0.5

High (h) Medium (m) Low (l)

4 2 1

Claims to detect Doesn’t detect

1 0.5

Table 1. Penalties with respect to category importance

Defined weights based on our experiences with auditors and developers Table 1 presents chosen weights & penalties for true positives (t), false positives (p), and false negatives (n) based on high-value (high vc) and low-value (low vc) categories Table 2 presents false negatives penalty per bug category based on whether the tool claims to detect the category or not Table 3 presents High (h), Medium (m), and Low (l) severity weights Table 4 presents false positives (p) and false negatives (n) penalties for “Auditor” and “Developer” scales

Table 4. “Auditor” vs. “Developer” scales penalties

Table 2. False negatives penalty based on whether the tool claims

to detect the category or not

Table 3. Severity weights

5

Experimental Results & Analysis Collected data seems to indicate that we are headed in the right direction Both scores for wuftpd get higher until version 3.1

The number of false positives decreases, but in version 3.1 it increases wuftpd “Developer” score is lower than “Auditor” score for all four versions

“Developer” false positives penalty is higher -- tool is tuned better for Java than for C After all, Fortify is a security company

webgoat “Developer” score drops between versions 3.1 and 3.5 With the addition of multiple auditor-oriented categories

Both scores are best for latest release examined (whew)Version 2.1 Version 3.0 Version 3.1 Version 3.5

Auditor score 7 8 38 94

Developer score 2 16 78 72





webgoat

securibench

wuftpd

(complete set of data for one experiment is available as a handout)

6

Conclusions & Future Work Proposed approach is useful for our purposes – measuring

improvements of Fortify static analyzer It is unclear whether the same approach would be useful for comparing two

different tools Determining an “answer key” to grade the results of the tool with is

still a hard problem On our to-do list:

Do more audits of various projects to collect more data to adjust the weights and penalties Include projects written for other languages the tool supports

Experiment with additional weights and penalties Introduce penalty for incorrectly reporting severity of results

Define a good visual representation of the collected data Make it intuitive to determine the area that needs improvement

a metric for evaluating static analysis tools katrina tsipenyuk, fortify software brian chess,...

Documents