on the “naturalness” of buggy codedevanbu/odd-bugs.pdf · to complement the effectiveness of...

On the “Naturalness” of Buggy Code

Baishakhi Ray§⇤

Vincent Hellendoorn†⇤ Saheel Godhane†

Zhaopeng Tu\ Alberto Bacchelli‡ Premkumar Devanbu†

§University of Virginia †University of California, Davis \Huawei Technologies Co. [email protected] {vjhellendoorn,srgodhane,ptdevanbu}@ucdavis.edu [email protected]

‡Delft University of [email protected]

ABSTRACTReal software, the kind working programmers produce by the kLOCto solve real-world problems, tends to be “natural”, like speech ornatural language; it tends to be highly repetitive and predictable.Researchers have captured this naturalness of software through sta-tistical models and used them to good effect in suggestion engines,porting tools, coding standards checkers, and idiom miners. Thissuggests that code that appears improbable, or surprising, to a goodstatistical language model is “unnatural” in some sense, and thuspossibly suspicious. In this paper, we investigate this hypothesis.We consider a large corpus of bug fix commits (ca. 7,139), from 10different Java projects, and focus on its language statistics, evaluat-ing the naturalness of buggy code and the corresponding fixes. Wefind that code with bugs tends to be more entropic (i.e. unnatural),becoming less so as bugs are fixed. Ordering files for inspectionby their average entropy yields cost-effectiveness scores compara-ble to popular defect prediction methods. At a finer granularity,focusing on highly entropic lines is similar in cost-effectivenessto some well-known static bug finders (PMD, FindBugs) and or-dering warnings from these bug finders using an entropy measureimproves the cost-effectiveness of inspecting code implicated inwarnings. This suggests that entropy may be a valid, simple wayto complement the effectiveness of PMD or FindBugs, and thatsearch-based bug-fixing methods may benefit from using entropyboth for fault-localization and searching for fixes.

1. INTRODUCTIONOur work begins with the observation by Hindle et al [22], that

“natural" code in repositories is highly repetitive, and that this rep-etition can be usefully captured by language models originally de-veloped in the field of statistical natural language processing (NLP).Following this work, language models have been used to good ef-fect in code suggestion [22, 48, 53, 15], cross-language porting [38,37, 39, 24], coding standards [2], idiom mining [3], and code de-obfuscation [47]. Since language models are useful in these tasks,⇤Baishakhi Ray and Vincent Hellendoorn are both first authors, andcontributed equally to the work.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’16, May 14 - 22, 2016, Austin, TX, USAc� 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.

ISBN 978-1-4503-3900-1/16/05. . . $15.00DOI: http://dx.doi.org/10.1145/2884781.2884848

they are capturing some property of how code is supposed to be.This raises an interesting question: What does it mean when a codefragment is considered improbable by these models?

Language models assign higher naturalness to code (tokens, syn-tactic forms, etc.) frequently encountered during training, and lowernaturalness to code rarely or never seen. In fact, prior work [7]showed that syntactically incorrect code is flagged as improbableby language models. However, by restricting ourselves to code thatoccurs in repositories, we still encounter unnatural, yet syntacti-cally correct code; why? We hypothesize that unnatural code ismore likely to be wrong, thus, language models actually help zero-in on potentially defective code.

This notion appears plausible; highly experienced programmerscan often intuitively zero-in on “funny-looking" code, when tryingto diagnose a failure. If statistical language models could capturethis capability, then they could be a useful adjunct in a variety ofsettings: they could improve defect prediction; help provide an im-proved priority ordering for static analysis warnings; improve theperformance of fault-localization algorithms; or even recommend“more natural" code to replace buggy code.

To investigate this phenomenon, we consider a large corpus of7,139 bug fix commits from 10 different projects and focus on itslanguage statistics, evaluating the naturalness of defective code andwhether fixes increase naturalness. Language models can rate prob-abilities of linguistic events at any granularity, even at the level ofcharacters. We focus on line-level defect analysis, giving far finergranularity of prediction than typical statistical defect predictionmethods, which most often operate at the granularity of files ormodules. In fact, this approach is more commensurate with staticanalysis or static bug-finding tools, which also indicate potentialbugs at line-level. For this reason, we also investigate our lan-guage model approach in contrast and in conjunction with two well-known static bug finders (namely, PMD [10] and FindBugs [14]).

Overall, our results corroborate our initial hypothesis that codewith bugs tends to be more unnatural. In particular, the main find-ings of this paper are:

1. Buggy code is rated as significantly more “unnatural” (im-probable) by language models.

2. This unnaturalness drops significantly when buggy code isreplaced by fix code.

3. Furthermore, we find that above effects are substantially stron-ger when:

• the buggy code fragment is shorter (fewer lines), and• the bug is “short-lived", viz. more quickly fixed.

4. Using cost-sensitive measures, inspecting “unnatural” codeindicated by language models works quite well: Performanceis comparable to that of static bug finders FindBugs and PMD.

5. Ordering warnings produced by the FindBugs and PMD tools,using the “unnaturalness” of associated code, significantlyimproves the performance of these tools.

Our experiments are mostly done with Java projects, but wehave strong empirical evidence indicating that the first two find-ings above generalize to C as well; we hope to confirm the rest infuture work.

2. BACKGROUNDOur main goal is evaluating the degree to which defective code

appears “unnatural” to language models and the extent to whichthis can enable programmers to zero-in on bugs during inspections.Furthermore, if language models can help pinpoint buggy lines, wewant to identify how their performance and applicability relate tocommonly used fault-detection methods. To this end, we explorethe application of language models, first to file-level defect pre-diction (comparing with statistical defect prediction methods) andthen to line-level defect prediction, comparing their performancewith popular Static Bug Finders (SBF). In this section, we presentthe relevant technical background and the main research questions.

2.1 Language ModelingLanguage models assign a probability to every sequence of words.

Given a code token sequence S = t1t2 . . . tN , a language modelestimates the probability of this sequence occurring as a product ofa series of conditional probabilities for each token’s occurrence:

P (S) = P (t1) ·NY

i=2

P (ti|t1, . . . , ti�1) (1)

P (ti|t1, . . . , ti�1) denotes the chance that the token ti followsthe previous tokens, the prefix, h = t1, . . . , ti�1. The probabilitiesare impractical to estimate, due to the huge number of possibleprefixes. A common fix is the ngram language model, using theMarkov assumption to condition just on the preceding n�1 tokens.

Pngram(ti|h) = P (ti|ti�n+1, . . . , ti�1) (2)

This we estimate from the training corpus as the fraction of timesthat ti follows the prefix ti�n+1, . . . , ti�1. This is reversible: wecan also compute each token given its suffix (the subsequent to-kens). We compute entropies based on both prefix and suffix tokensequences to better identify the buggy lines (Section 3.3).

The ngram language models can effectively capture the regu-larities in source code and have been applied to code suggestiontasks [22, 2]. Tu et al. [53] improved such language models byconsidering that software tends to be repetitive in a local context.They introduced a cache language model ($gram) that deploys anadditional cache–list of ngrams extracted from the local context,to capture the local regularities. The ngrams extracted from eachfile under test form its local context in the cache model. We usethe state of the art $gram to judge the “improbability” (measuredas cross-entropy) of lines of code.

2.2 Line-level defect detection: SBFStatic Bug-Finders (SBF) use syntactic and semantic proper-

ties of source code to locate common errors, such as null pointerdereferencing and buffer overflows. They rely on methods rang-ing from informal heuristic pattern-matching to formal algorithmswith proven properties; they typically report warnings at build time.Most of the pattern-matching tools [10, 14, 13, 9] require users tospecify the buggy templates. Others [51, 32] can automaticallyinfer rules by mining existing software; they raise warnings if vi-olations of the rules occur. Most (e.g., PMD and FindBugs) are

unsound, yet fast and widely used, compared to more formal ap-proaches. Generally, SBF produce false positives and false nega-tives, which reduce their cost-effectiveness [23, 26].

Both SBF and our model (fairly imperfectly) indicate potentialdefect locations; our goal is to compare these approaches and seewhether they can be combined.

2.3 Evaluating Defect PredictionsWe take the simplified view that SBF and $gram are compara-

ble, in that they both select suspicious lines of code for manual re-view. We therefore refer to language model based bug prediction asNBF (“Naturalness Bug Finder"). With either SBF or NBF , hu-man code review effort, spent on lines identified as bug-prone, willhopefully find some defects. Comparing the two approaches re-quires a performance measure. We adopt a cost-based measure thathas become standard: AUCEC (Area Under the Cost-EffectivenessCurve) [4]. Like ROC, AUCEC is a non-parametric measure thatdoes not depend on the defects’ distribution. AUCEC assumes thatcost is inspection effort and payoff is the number of bugs found.

Given a model (SBF or NBF ) that predicts buggy lines, we rankall the lines in decreasing order of their defect-proneness score.Thus, the best possible model would place the buggiest lines atthe top of the list. This approach helps reviewers to inspect asmaller portion of code (i.e. cost), while finding a disproportion-ately larger fraction of defects (i.e. payoff). We normalize both costand payoff to 100% and visualize the improvement that a predic-tion model provides when compared against a random guess using a‘lift-chart’ [55]. In this chart, cost (on the x-axis) refers to the per-centage of the code-base inspected at prediction time, and payoff(on the y-axis) indicates the portion of the known bugs (discoveredby data gathering) already in the code, that are covered by warnedlines. AUCEC is the area under this curve. Under uniform bug dis-tribution across SLOC, inspecting x% of lines of code at randomwill, in expectation, also yield x% of the bugs, i.e., random selec-tion produces a diagonal line on the lift chart. The correspondingAUCEC when inspecting 5% of lines at random is 0.001251.

Inspecting 100% of SLOC in a project is probably unrealistic.Prior research has assumed that 5% (sometimes 20%) of the codecould realistically be inspected under deadline [44]. Additionally,Rahman et al. compare SBF with DP (a file level statistical defectpredictor) by allowing the number warnings from SBF to set theinspection budget (denoted AUCECL) [43]. They assign the DPthe same budget and compare the resulting AUCEC scores. Weextend this approach to our comparison of SBF and NBF . Tounderstand how NBF ’s payoff varies with cost, we first measureits performance for both 5% and 20% inspection budget. We thencompare AUCECs of NBF and SBF at both the 5% budget andunder AUCECL budget.

Finally, we investigate defect prediction performance under sev-eral credit criteria. A prediction model is awarded credit, rangingfrom 0 to 1, for each (ipso facto eventually) buggy line flagged assuspicious. Previous work by Rahman et al. has compared SBFand DP models using two types of credit: full (or optimistic) andpartial (or scaled) credit [43], which we adapt to line level defectprediction. The former metric awards a model one credit point foreach bug iff at least one line of the bug was marked buggy by themodel. Thus, it assumes that a programmer will spot a bug as soonas one of its lines is identified as such. Partial credit is more con-servative: for each bug, the credit is awarded in proportion to thefraction of the bug’s defective lines that the model marked. Hence,

1Calculated as 0.5 * 0.05 * 0.05. This could be normalized dif-ferently, but we consistently use this measurement, so our compar-isons work.

Table 1: Summary data per project used in Phase-I

Ecosystem Project Description Study Period #Files NCSL #Unique bug-fixes

Github

Atmosphere Web socket framework Oct-11 to Oct-12 4,073 427,901 664Elasticsearch Distributed search engine Jul-14 to Jul-15 30,977 5,962,716 498Facebook-and- Android SDK for Dec-11 to Dec-12 1,792 172,695 77roid-sdk (facebook) Facebook applicationNetty Network application framework Aug-12 to Aug-13 8,618 1,078,493 530Presto SQL query engine Jul-14 to Jul-15 10,769 2,869,799 346

Apache

Derby Relational database Jul-06 to Jul-07 7,332 6,053,966 1,352Lucene Text search engine library Jan-12 to Jan-13 29,870 7,172,714 1,639OpenJPA Java Persistence API Jul-09 to Jul-10 3,849 4,869,620 567Qpid Messaging system Apr-14 to Apr-15 7,350 4,665,159 277Wicket Web framework Jul-10 to Jul-11 9,132 2,070,365 1,189

Overall Sep-01 to Jul-14 113,762 35,343,428 7,139

partial credit assumes that the probability of a developer findinga bug is proportional to the portion of the bug that is marked bythe model. It should be noted the AUCEC is non-parametric underpartial credit, but not under full credit, as it depends on the de-fect distribution; however we get the same overall result under bothregimes.

2.4 Research QuestionsOur central question is whether “unnaturalness" (measured as

entropy, or improbability) indicates poor code quality. The abun-dant history of changes (including bug fixes) in OSS projects allowsthe use of standard methods [50] to find code that was implicatedin bug fixes (“buggy code”).

RQ1. Are buggy lines less “natural” than non-buggy lines?

Project histories contain many bug fixes, where buggy code ismodified to correct defects. Do language models rate bug-fix codeas more natural than the buggy code that was replaced (i.e., was bugfix code rated more probable than the buggy code)? Such a findingwould also have implications for automatic, search-based bug re-pair: if fixes tend to be more probable, then a good language modelmight provide an effective organizing principle for the search, or (ifthe model is generative) even generate possible candidate repairs.

RQ2. Are buggy lines less “natural" than bug-fix lines?

Even if defective lines are indeed more often rated improbableby language models, this may be an unreliable indicator; there maymany be false positives (correct lines marked improbable) and falsenegatives (buggy lines indicated as natural). Therefore, we investi-gate whether naturalness (i.e. entropy) can provide a good orderingprinciple for directing inspection.

RQ3. Is “naturalness" a good way to direct inspection effort?

One can view ordering lines of code for inspection by ‘natural-ness’ as a sort of defect-prediction technique; we are inspectinglines in a certain order, because prior experience suggests that cer-tain code is very improbable and thus possibly defective. Tradi-tional defect-prediction techniques typically rely on historical pro-cess data (e.g., number of authors, previous changes or bugs); how-ever, defectiveness is predicted at the granularity of files (or meth-ods). Thus, we may reasonably compare naturalness as an orderingprinciple with SBF , which provide warnings at the line level.

RQ4. How do SBF and NBF compare in terms of ability todirect inspection effort?

Finally, if SBF provides a warning on a line and it appears un-natural to a language model, we may expect that this line is evenmore likely a mistake. We therefore investigate whether naturalnessis a good ordering for warnings provided by static bug-finders.

RQ5. Is “naturalness" a useful way to focus the inspection efforton warnings produced by SBF?

3. METHODOLOGYWe now describe the projects we studied, and how we gathered

and analyzed the data.

3.1 Study SubjectWe studied 10 OSS Java projects, as shown in Table 1: among

these are five projects from Github, while the others are from theApache Software Foundation. We chose the projects from differ-ent domains to measure NBF ’s performance in various types ofsystems. All projects are under active development.

We analyzed NBF ’s performance in two settings: Phase-I con-siders NBF ’s ability to find bugs, based on continuous usage dur-ing active development (see Section 3.2). We chose to analyze eachproject for the period of one-year which contained the most bug-fixes in that project’s history; here, we considered both development-time and post-release bugs. Then, for the chosen one-year dura-tion, we extracted snapshots at 1-month intervals. A snapshot cap-tures the state of the project at a given point in time. Thus, foreach project we studied 12 snapshots, in total analyzing 120 snap-shots across 10 projects, including 113,762 distinct file versions,and 35.3 Million total non-commented source code lines (NCSL).Overall, we studied 7,139 distinct bug-fix commits comprising of2.2 Million total buggy lines. Subsequently, we confirmed our re-sults across the entire history of each studied project, using snap-shots at 6-month intervals (157 snapshots, 63,301 commits, 23,664bug-fixes). Due to page limitations, we are presenting results fromour study of 1-month snapshots only.

Next, for Phase-II (see Section 3.2), we focused only on post-release bugs to evaluate NBF ’s performance as a release-time bugprediction tool. We used the data set from Rahman et al. [43],in which snapshots of the five Apache projects were taken at se-

Table 2: Summary data per project used in Phase-II. The dataset istaken from Rahman et al. [43]

Project NCSL #Warnings#K FindBug PMD #Issues

Derby (7) 420–630 1527–1688 140–192K 89–147Lucene (7) 68–178 137–300 12–31K 24–83OpenJPA (7) 152–454 51–340 62–171K 36–104Qpid (5) 212–342 32–66 69–80K 74–127Wicket (4) 138–178 45–86 23–30K 47–194

lected project releases. The project snapshot sizes vary between 68and 630K NCSL. The bugs were extracted from Apache’s JIRAissue tracking system; the bug count per release, across all theprojects, varies from 24-194 (see Table 2). We further used warn-ings produced by two static bug finding tools: FINDBUGS [5] andPMD [10], as collected by Rahman et al.. PMD operates on sourcecode and produces line-level warnings; FINDBUGS operates onJava bytecode and reports warnings at line, method, and class level.

3.2 Data Collection

Phase-IHere we describe how we identified the buggy lines in a snapshotcorresponding to the bugs that developers fixed in an ongoing de-velopment process.Estimating bug-fixing commits. Development time bug fixes areoften not recorded in an issue database. Thus, to estimate bugfixing activities during an ongoing development process, we ana-lyzed commit messages associated with each commit for the en-tire project evolution and looked for error related keywords. First,we converted each commit message to a bag-of-words and thenstemmed the bag-of-words using standard natural language pro-cessing (NLP) techniques. Then similar to Mockus et al. [33], wemarked a commit as a bug-fix, if the corresponding stemmed bag-of-words contains at least one of the error related keywords: ‘error’,‘bug’, ‘fix’, ‘issue’, ‘mistake’, ‘incorrect’, ‘fault’, ‘defect’, ‘flaw’,and ‘type’. This method was adapted from our previous work [46].For the Apache projects, as well as Atmosphere, and Netty wefurther improved the classification with information available fromthe JIRA issue database.

To evaluate the accuracy of the above classification, we man-ually verified the result for 300 commits (30 from each project,chosen randomly). Here, we only evaluated whether the author ofa presumed bug-fix commit really marked their commit as a bug-fix. Out of these 300 commits, 288 were classified correctly (96%),3 commits (1%) were described as a potential “issue" (thus mayhave developed into a bug later) and 9 commits (3%) were classi-fied incorrectly—5 false negatives and 4 false positives. Thus, ourapproach achieves 96% accuracy (95% conf.int.: 93.8% to 98.2%).Selecting snapshots. To evaluate NBF in continuous active devel-opment, ideally we need to study all the commits made to a projectfor its full history. But using git blame to get line-level bug dataat this scale is not feasible. Thus, we chose 1-year evaluation pe-riods for each project. Since our focus is studying bugs, we con-sidered the one-year period of each project that contained the mostbug fixes. Within these periods, we looked at monthly snapshotsthereby simulating near-continuous usage of our tool. For instance,snapshots were taken at 1-month intervals between 2006-07-06 and2007-07-06 for project Derby (see Table 1).Identifying buggy lines in a snapshot. This part consists of threesteps: (1) identifying lines related to a bug, (2) identifying commitsthat introduced the bug, and (3) mapping the buggy lines to thesnapshots of interest.

In step (1), we assumed that all the lines deleted (or changed) ina bug-fix commit were buggy lines. To find these lines, we lookedat the versions of a file before and after a bug-fix. We used gitdiff to identify the deleted/changed lines in the old version andmarked them as ‘buggy’; the added/changed lines in the new ver-sion are marked as ‘fixed’. Next, in step (2), we used git-blameto locate the commits that had introduced these buggy lines in thesystem. The first two steps are analogous to the SZZ algorithm [50].Once we know where the buggy lines originated, we used git-bl-ame with ‘--reverse’ option to locate these lines in the snap-shots of interest. This step ‘maps’ the buggy lines to specific snap-

ProjectTime Linec5c4

Buggy

lines

Fixed

lines

S3S2

c3c2

git-blame

S1

c1

Bug-fixBugBug

Map onto S2

Map onto S1

old new

Figure 1: Collecting bug data: vertical dashed lines are snapshots(S1. . .S3) and triangles are commits (c1. . .c5) that occurred betweenthese snapshots. For every bug-fix commit (e.g. c4), we first git-blamethe buggy lines (blue arrowed lines) and then map them to the corre-sponding snapshots (red arrowed lines).

shots. Figure 1 explains the procedure where commit c4 is a bug-fixcommit. The corresponding buggy lines (marked red in the old ver-sion) are found to originate from two earlier commits c1 and c2. Wethen map the buggy lines from c1 to both S1 and S2, whereas thebuggy lines from c2 are mapped only to S2.

Note that we considered all the bugs that appeared at any time inthe entire evolution and map them back to the snapshots of interest.However, we lose the buggy lines that were fixed before, or aroseafter, our study period as we cannot map these to any snapshots ofinterest. We also miss some transient bugs that appeared and werefixed within a snapshot interval (thus lasting less than a month). Atthe end of this step, we know exactly which lines in each of oursnapshots are buggy (and were fixed in some future commit) andwhich ones are benign, modulo time-window censoring effects.

Phase-IIIn Phase-II, we studied post-release bugs for the Apache projects,using Rahman et al. [43]’s dataset. Rahman et al. selected a num-ber of release versions of each Apache project and, for each release,identified post-release bugfix commits from the JIRA issue trackingsystem. They then identified buggy and non-buggy lines for eachrelease version similar to steps 2 and 3 of the previous section.

3.3 Entropy MeasurementChoice of language model. We measured entropy using Tu et al.’scache-based language model ($gram) tool [53], as described in Sec-tion 2.1. We computed the entropy over each lexical token in allthe Java files of all the snapshots. For a given file in a snapshot, thetool estimates a language model on all the other files of the samesnapshot. It then builds a cache by running the language model onthe given file, computing the entropy of each token based on bothprolog (preceding tokens) and epilog (succeeding tokens). Finally,based on the training set and locally built cache, the tool computesthe entropy of each token of the file; the line and file entropies arecomputed by averaging over all the tokens belong to a line and alllines corresponding to a file respectively.

To generate entropies of the fixed lines, we leveraged the data-setgathered for the entire evolution period with 6 months interval, asmentioned in Section 3.1. This was necessary because a bug mayget fixed after our studied period. For each bug-fix commit, wetrained the dataset on its immediate preceding snapshot and testedit on the new file version corresponding to the bug-fix.Determining parameters for cache language model. Several fac-tors of locality can affect the performance of the cache languagemodel: cache context, cache scope, cache size, and cache order. Inthis work, we built the cache on the entire file under investigation.

In this light, we only needed to tune cache order (i.e. maximumand minimum order of ngrams stored in the cache). In general,longer ngrams are more reliable but quite rare, thus we backed offto shorter matching prefixes (or suffixes) [25] when needed. Wefollowed Tu et al. [53] to set the maximum order of cache ngramsto 10. To determine the minimum back-off order, we performed ex-periments on the Elasticsearch and Netty projects looking for opti-mal performance measured in terms of entropy difference betweenbuggy and non-buggy lines. The maximum difference was foundat a minimum backoff order of 4 with no change in the backoffweight. Thus, we set the minimum backoff order to 4, and thebackoff weight to 1.0.Adjusting entropy scores. Language models could work for de-fect prediction at line-granularity if bug-prone lines are more en-tropic. For instance, a non-buggy but high-entropy line would be afalse positive and worsen the language model’s performance at theprediction task. For example, lines with previously unseen iden-tifiers, such as package, class and method declarations, have sub-stantially higher entropy scores on average. Vice versa, for-loopstatements and catch clauses – being often repetitive – have muchlower entropy scores. Such inter-type entropy differences do notnecessarily reflect their true bug-proneness. In fact, for-statements,though less entropic, are often more bug-prone than the more en-tropic import-declarations.

This observation led us to using abstract-syntax-based line-typesand computing a syntax-sensitive entropy score. First, we usedEclipse’s JDT2 to parse an Abstract Syntax Tree (AST) of all filesunder consideration. Any line in a Java file is always containedby either a single AST node (e.g., Compilation Unit root-node) orseveral AST nodes in hierarchical order, e.g., a nested line withif-statement, method-declaration, and class-declaration. For eachline, its syntax-type is the grammatic entity associated with thelowest AST node encompassing the full line. Examples includestatements (e.g., if, for, while, return), declarations (e.g., variable,structure, method, import) or other AST nodes that tend to spanone line, such as switch cases and annotations. We then computedhow much a line’s entropy deviated from the mean entropy of itsline-type using normalized Z-score: zline, type =

entropyline�µtypeSDtype

,where µtype denotes mean $gram entropy of all the lines of a giventype, and SDtype denotes standard deviation. This gave us a syntax-sensitive entropy model $gram+type.

The above normalization essentially uses the extent to which aline is “unnatural” w.r.t. other lines of the same type. In addition,based on the fact that all line-types are not equally buggy, we com-puted relative bug-proneness of a type based on the fraction of bugsand total lines (LOC) it had in all previous snapshots. Here we usedthe previous snapshots as training set and computed bug-weight of aline-type as: wtype =

bugtype/LOCtypePt2types bugt/LOCt

, where the bugs and LOCsper type were counted over all previous snapshots. We then scaledthe z-score of each line by its weight w to achieve our final model,which we name $gram+wType.

Phase-I and Phase-II data set and the entropy generation tool areavailable at http://odd-code.github.io/.

4. EVALUATIONThis section discusses the answers to Research Questions intro-

duced in Section 2.4. The First two RQs are primarily based on thePhase-I data set. RQ3 uses Phase-II data to evaluate NBF ’s capa-bility as a File level defect predictor. Line level defect predictionis evaluated using both data sets (RQ3, RQ4, and RQ5). We beginwith the question that is at the core of this paper:2Java development tools, http://www.eclipse.org/jdt/

Table 3: Entropy Difference and Effect Size between buggy vs. non-buggy and buggy vs. fixed lines

buggy vs. non-buggy (RQ1) buggy vs. fixed (RQ2) %UniqueBug-fix Entropy diff. Cohen’s d Entropy diff. Cohen’s d bugsthreshold (bug > non-bug) effect (bug > fix) effect detected

1 1.95 to 2.00 0.61 1.58 to 1.67 0.60 14.922 1.68 to 1.72 0.53 1.43 to 1.51 0.52 23.934 1.37 to 1.40 0.43 1.16 to 1.23 0.41 34.657 1.15 to 1.17 0.36 0.98 to 1.04 0.34 43.9415 0.91 to 0.93 0.29 0.75 to 0.81 0.26 55.9120 0.81 to 0.83 0.25 0.62 to 0.68 0.21 60.1650 0.58 to 0.59 0.18 0.39 to 0.44 0.13 71.68100 0.55 to 0.57 0.17 0.36 to 0.41 0.12 80.96Overall 0.86 to 0.87 0.26 0.69 to 0.74 0.19 100.00

Buggy lines have higher entropy than non-buggy lines. Buggy linesalso have higher entropy than fixed lines. Entropy difference decreasesas bug-fix threshold increases. Entropy differences are measured witht-test for 95% confidence interval and shows statistical significance (p-value ⌧ 0.05). Cohen’s d effect size: 0.2 = ‘small’, 0.5 = ‘medium’, and0.8 =‘large’.

RQ1. Are buggy lines less “natural" than non-buggy lines?To evaluate this question, we compare entropies of buggy and

non-buggy lines for all the studied projects. A Wilcoxon non-parametric test confirms that buggy lines are indeed more entropicthan non-buggy lines with statistical significance (p < 2.2⇤10�16).Average entropy of buggy lines is 6.21 while that of non-buggylines is 5.34. However, Cohen’s D effect size between the two is0.26 (see last row in Table 3), which is considered small.

One explanation for the small effect size across bugs of all sizesis an impact of tangled bug fixes—lines that are changed in a bug-fix commit but are not directly related to the bug. Herzig et al. [21]showed that around 17% of all source files are incorrectly associ-ated with bugs due to such tangled changes. The impact of tangledchanges on bug entropy is more visible for larger bug-fix commits.Some of the lines changed in a larger bug fix may not be directlyassociated with the erroneous lines and thus may not be “unnatu-ral”. In contrast, for smaller bug-fix commits, say for 1 or 2 linesof fix, the fixed lines (i.e. the lines deleted from the older version)are most likely to be buggy.

0

5

10

15

non_buggy buggy fixed

Line

Ent

ropy

(a) Entropy difference betweennon-buggy, buggy, and fixedlines at bug-fix threshold 7.

0

5

10

15

low duration

medium duration

high duration

Line

Ent

ropy

(b) Entropy difference betweenbuggy lines of different bug du-ration.

Figure 2To understand the effect of tangled changes on the naturalness

of buggy code, we compute entropy difference between buggy andnon-buggy lines at various bug-fix size thresholds. We define bug-fix threshold as the number of lines in a file that are deleted from theolder version of a bug-fix commit. Table 3 shows the result. Bothentropy difference and effect size decrease as the bug-fix thresholdincreases. For example, for a threshold size of 1, entropies of buggylines are on average 1.95 to 2.00 bits higher than non-buggy lines(95% confidence interval). Cohen’s D effect size between the twolies between medium and large (0.61). 14.92% of the bug-fixes liebelow this threshold. For a threshold size of 7 3, buggy lines have1.15 to 1.17 bits higher entropy than their non-buggy counterpart.In this case, we see a small to medium effect (effect size = 0.36).375% of all changes (both bug-fixes and feature implementations)in our data set contain no more than 7 lines of deletion

This is also shown in Figure 2(a). At bug-fix threshold 100, we seeonly 0.55 to 0.57 entropy difference with small effect (0.17). Theseresults indicate that the lines that are indirectly associated with realbuggy lines in tangled bug-fix commits may have lower entropy,and thus would diminish the overall entropy of buggy lines.

We further observe that bugs that stay longer in a repository tendto have lower entropy than the short-lived bugs. Bug duration ofa buggy line is measured as the number of months until a buggyline is fixed starting from the day of its introduction (bug_duration= bug-fix date minus bug-introduction date). The following tableshows the summary of bug duration in our data set (in months):

Min. 1st Qu. Median Mean 3rd Qu. Max.0.03 8.33 23.40 29.03 39.83 125.6

Based on bug duration, we divide all the buggy lines into threegroups: low, medium, and high. The bugs in low group “survived”for less than 9 months (1st quartile); bugs in medium group sur-vived from 9 to 24 months (1st quartile to median), and the re-maining bugs are in the high group. Figure 2(b) shows their entropyvariation. The low group has significantly higher entropy than themedium and high group, with Cohen’s d effect size of 0.62 and0.75 respectively (medium to large effect size). The difference isalso confirmed with Wilcoxon non-parametric test with statisticalsignificance. The medium group is also slightly more entropic thanthe high group with statistical significance, although the effect sizeis very small (0.10). These results indicate that the bugs that arefixed more quickly are more “unnatural" than the longer-lived bugs.We hope to explore the reasons in future work: perhaps the highlyentropic bugs are easier to locate, diagnose and fix, and thus getspeedily resolved—or perhaps (more intriguingly) highly-entropiccode is more strongly associated with failures that are more likelyto be quickly encountered by users.

In summary, we have the overall result:

Result 1: Buggy lines, on average, have higher entropies,i.e. are “less natural”, than non-buggy lines.

A natural question is whether the entropy of the lines in a bugdrops once the bug is fixed. This leads us to the following question:

RQ2. Are buggy lines less “natural" than bug-fix lines?To answer RQ2, we collected bug-fix commit patches of all the

bugs that exist in any snapshot under study. In a bug-fix commit,the lines deleted from the original version are considered buggylines and lines added in the fixed versions are considered fixed lines.We collected all such buggy and fixed lines for all the projects, asdescribed in Section 3.2. Establishing a one-to-one correspondencebetween a buggy and fixed line is hard because buggy lines areoften fixed by a different number of new lines. Hence, we comparethe mean entropies between buggy and fixed lines across all thepatches. Wilcoxon non-parametric test confirms that entropy ofbuggy lines, in general, drops after the bug-fixes with statisticalsignificance (see Figure 2(a)).

Similar to RQ1, tangled changes may also impact the entropiesof bugs and their fixes. To measure the impact, we further com-pare the entropy differences between buggy and fixed lines at vari-ous bug-fix thresholds. Table 3 shows the result. Both the entropydifference and effect size decreases as bug-fix threshold increases.For example, at a bug-fix threshold of one line, average entropydrops, upon fixing, between 1.58 to 1.67 bits (95% confidence in-terval and with statistical significance). The Cohen d’s effect size ismedium to large (0.60). However, with threshold size at 30, mean

Table 4: Examples of bug fix commits that NBF detected successfully.These bugs evinced a large entropy drop after the fix. Bugs with onlyone defective line are shown for simplicity purpose. The errors aremarked in red, and the fixes are highlighted in green.

Example 1 : Wrong Initialization ValueFacebook-Android-SDK (2012-11-20)File: Session.javaEntropy dropped after bugfix : 4.12028

if (newState.isClosed()) {// Before (entropy = 6.07042):

- this.tokenInfo = null;// After (entropy = 1.95014):

+ this.tokenInfo = AccessToken.createEmptyToken(Collections.<String>emptyList());

}...

Example 2 : Wrong Method CallNetty (2013-08-20)File: ThreadPerChannelEventLoopGroup.java

Entropy dropped after bugfix : 4.6257if (isTerminated()) {// Before (entropy = 5.96485):

- terminationFuture.setSuccess(null);// After (entropy = 1.33915):

+ terminationFuture.trySuccess(null);}Example 3 : Unhandled Exception

Lucene (2002-03-15)File: FSDirectory.javaEntropy dropped after bugfix : 3.87426

if (!directory.exists())// Before (entropy = 9.213675):

- directory.mkdir();// After (entropy = 5.33941):

+ if (!directory.mkdir())+ throw new IOException

("Cannot create directory: " +directory);

...

entropy difference between the two are slightly more than half a bitwith a small effect size of 0.18. Such behavior suggests that tan-gled changes may be diluting the entropy of buggy code and theirfixes.

Bug duration also impacts the drop of entropy after bug fix. Forbugs with low duration, the entropy drop is significant: 2.68 to 2.75bits on average (effect size: 0.59). For medium duration bugs, en-tropy drops from 0.09 to 0.18 bits (effect size: 0.04), while entropydoes not necessarily drop for high duration bugs.

Table 4 shows three examples of code where entropy of buggylines dropped significantly after bug-fixes. In the first example,a bug was introduced in Facebook-Android-SDK code dueto a wrong initialization value—tokenInfo was incorrectly re-set to null. This specific initialization rarely occurred elsewhere,so the buggy line had a rather high entropy of 6.07. Once thebug was fixed, the fixed line followed a repetitive pattern (indeed,with two prior instances in the same file). Hence, entropy of thefixed line dropped to 1.95, an overall 4.12 bit reduction. The sec-ond example shows an example of incorrect method call in theNetty project. Instead of calling the method trySuccess (usedthree times earlier in the same file), the code incorrectly called themethod setSuccess, which was never called in a similar con-text. After the fix, entropy drops by 4.63 bits. Finally, example 3shows an instance of missing conditional check in Lucene. Thedeveloper should check whether directory creation is successful bychecking return value of directory.mkdir() call, followingthe usual code pattern. The absence of this check raised the entropyof the buggy line to 9.21. The entropy value drops to 5.34 after thefix.

In certain cases these observations do not hold. For instance, inexample 4 of Table 5, entropy increased after the bug fix by 5.75

Table 5: Examples of bug fix commits where NBF did not performwell. In Example 4, NBF could not detect the bug successfully (markedin red) and after bugfix the entropy has increased. In Example 5, NBFincorrectly detected the line as buggy due to its high entropy value.

Example 4 : Wrong Argument (NBF could not detect)Netty (2010-08-26)File: HttpMessageDecoder.javaEntropy increased after bugfix : 5.75103if (maxHeaderSize <= 0) {

throw new IllegalArgumentException(// Before (entropy = 2.696275):

- "maxHeaderSize must be a positive integer: "+ maxChunkSize);

// After (entropy = 8.447305):

+ "maxHeaderSize must be a positive integer: "+ maxHeaderSize);

}Example 5 : (NBF detected incorrectly)Facebook-Android-SDK (multiple snapshots)File: Request.java

// Entropy = 9.892635

Logger logger = new Logger(LoggingBehaviors.REQUESTS, "Request");

...

bits. In this case, developer copied maxChunkSize from a differ-ent context but forgot to update the variable name. This is a classicexample of copy-paste error [45]. Since, the statement related tomaxChunkSize was already present in the existing corpus, theline was not surprising. Hence, its entropy was low although it wasa bug. When the new corrected statement with maxHeaderSizewas introduced, it increased the entropy. Similarly, in Example 5,the statement related to logger was newly introduced in the cor-pus. Hence, its entropy was higher despite not being a bug.

However, for all bug-fix thresholds, Wilcoxon non-parametrictest confirms with statistical significance that the entropy of buggylines is higher than the entropy of fixed lines. Overall, we see 0.69to 0.74 bit entropy drops after bug fixes with a small effect size of0.19. Thus, in summary:

Result 2: Entropy of the buggy lines drops after bug-fixes,with statistical significance.

Having established that buggy lines are significantly less naturalthan non-buggy lines, we investigate whether entropy can be usedto direct inspection effort towards buggy code. We start with thefollowing research question:

RQ3. Is “naturalness" a good way to direct inspection effort?Baseline: detecting Buggy Files. We first consider file-level defectprediction (DP), the de facto standard in the literature. Specifically,we evaluate whether ordering files by entropy will better guide usto identifying buggy files than traditional logistic regression andrandom forest based DP .DP is typically used at release-time to predict post-release bugs [35,57, 42, 36, 11]; so, for this comparison we use the post release bugdata collected in Phase-II. DP is implemented using two classi-fiers: logistic regression (LR) [43, 42] and Random Forest (RF),where the response is a binary variable indicating whether a fileis buggy or not. The predictor variables are the process metricsfrom [42, 11], such as #developers, #file-commit, code churn, andprevious bug history; prior research shows that process metrics arebetter predictors of file level defects [42]. For each project, we trainour model on one release and evaluate on the next release: a defect-proneness score is assigned to every file under test. We repeat thisprocedure for all releases for all the projects under study.

A file’s entropy is measured as the average entropy of all the linesin that file. We rank each file in each release based on entropy and

AUCEC_20

0.0

0.1

0.2

0.3

5 10 15 20Percentage Inspected Lines

AUCE

C (n

orm

alize

d)

methodavg file entropylogistic regressionrandomrandom forest

Figure 3: Performance Evaluation of NBF w.r.t. DP for identifyingbuggy files

logistic regression-based prediction score. Figure 3 shows the nor-malized AUCEC performance of all the classifiers, similar to [4].Here, the y-axis shows the AUCEC scores as a fraction of the “per-fect" score (files ranked using an oracle) for each model. At thehigher inspection budget of 20% SLOC, the logistic regression andRandom Forest DP models perform 56.5% and 30.4% better thanthe entropy-based model respectively. However, at the stricter in-spection budget of 5% of SLOC, the entropy-based predictor per-forms 30% better than LR, and only 4.2% worse than RF, all aremeasured w.r.t. entropy-based predictor.Detecting Buggy Lines. Having shown that entropy can help de-tect bug-prone files, we now focus on a finer granularity: can theentropy of a line of code be used to direct inspection effort towardsbuggy lines? Specifically, will ordering lines by entropy will guideinspection effort better than ordering lines at random? In all ourexperiments, the random baseline choosess lines at random fromNon-Commented Source Lines (NCSL), picking just as many asNBF and SBF (in RQs 4&5). For the reasons outlined in Sec-tion 2.3, we evaluate the performance of entropy-ordering, withthe AUCEC scores at 5% and 20% of inspected lines (AUCEC5/20

in short) according to two types of credit: partial and full (in de-creasing order of strictness). Since this is a line-level experiment,comparing AUCEC values here with file-level optimum, as we didearlier in Figure 3, risks confusion arising from ecological infer-ence [41]; so we just present the raw AUCEC scores, without nor-malization.

We further calculate AUCEC for different bug-fix thresholds, asentropy is better at predicting smaller bug-fixes. When measuringAUCEC at a threshold of, say, n lines, we ignore bug-fixes span-ning � n lines. Performance is evaluated in terms of percentagegain of AUCEC over random_AUCEC:

gain =

Pproject(aucec� random_aucec)

Pproject random_aucec

Figure 4(a) shows AUCEC20 scores for partial credit, averagedover all projects for bug-fix threshold 7. Under partial credit, thedefault $gram model (without the syntax weighting described in§3.3) performs better than random, particularly at � 10% of in-spected lines. At 20% of inspected line, $gram performs 41.95%better than random. Figure 4(b) focuses on the performance on10 studied projects, up to 5% of the inspected lines. At this level,$gram’s performance varies. For projects Facebook, Netty, andQpid $gram performs significantly better than random; but in otherprojects $gram either performs similar or worse than random.

On closer examination, we found that some program constructsare intrinsically more entropic than others. For example, methoddeclarations are often more entropic, because they are less frequent.This observation led us to consider syntactic line-type in bug pre-diction, as discussed in Section 3.3. Scaling the entropy scores byline type improves AUCEC5 performance in all but Facebook

0.00

0.01

0.02

0.03

5 10 15 20Percentage Inspected Lines

AUC

EC_2

0 method$gram$gram + Type$gram + wTyperandom

(a) Overall AUCEC upto inspecting 20% lines for all the projects

atmosphere derby elasticsearch facebook lucene

netty openjpa presto qpid wicket0.000

0.001

0.002

0.003

0.000

0.001

0.002

0.003

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5Percentage Inspected Lines

AUC

EC_5 method

$gram$gram + Type$gram + wTyperandom

(b) Closer look at low order AUCEC, upto inspecting 5% lines forindividual project

Figure 4: Performance Evaluation of NBF with Partial Credit.

and Atmosphere and significantly improves performance in allcases where $gram performed no better than random. Including thebugginess history of line-types ($gram+wType) furthermore out-performs random and $gram in all but Lucene and Atmosphereand achieves an overall AUCEC5 scores 92.53% higher than ran-dom at bug-fix threshold 7. These results are similar under fullcredit (see Table 6). Since $gram+wType is the best-performing“naturalness" approach so far, we hereafter refer to it as NBF .Table 6: Performance evaluation of NBF with random for differentbug-fix threshold

Full Credit Partial Credit

Bugfix random gain random gainthreshold aucec aucec (%) aucec aucec (%)

AUCEC_5

2 0.0035 0.0016 122.74 0.0027 0.0013 113.464 0.0037 0.0019 94.35 0.0025 0.0013 102.227 0.0038 0.0023 64.98 0.0024 0.0013 92.53

14 0.0041 0.0028 47.38 0.0023 0.0013 87.77all 0.0051 0.0051 -0.21 0.0022 0.0013 76.60

AUCEC_20

2 0.0531 0.0252 110.62 0.0403 0.0200 101.604 0.0543 0.0300 81.15 0.0388 0.0200 93.937 0.0559 0.0345 59.02 0.0370 0.0200 85.15

14 0.0565 0.0401 40.89 0.0362 0.0200 81.03all 0.0619 0.0567 9.29 0.0345 0.0200 72.52

Table 6 further shows that NBF performance worsens with largerbug-fix thresholds, for both AUCEC5 and AUCEC20. For exam-ple, for AUCEC5, with bug-fix threshold 2, we see 122.74% and113.46% performance gain over random AUCEC for full and par-tial credit respectively. These gains drop to 47.38% and 87.77% ata threshold of 14.

Notice that, in Table 6, under Partial Credit, the random selectionapproach yields constant AUCEC5 and AUCEC20 scores, indepen-dent of the bug fix threshold. Partial credit scoring assigns credit toeach line based on the size of the bug-fix that it is part of (if any);thus, selecting 5% of lines at random should, in expectation, yield5% of the overall credit that is available (see also Section 2.3). FullCredit, on the other hand, assigns the credit for detecting a bug assoon as a single line of the bug is found. Therefore, the AUCECscores of a random selection method under full credit will dependon the underlying distribution of bugs: large bugs are detected witha high likelihood even when inspecting only a few lines at random,whereas small bugs are unlikely to be detected when inspecting5% of lines without a good selection function. This is reflected inTable 6: as the bug-fix threshold increases, the random AUCECscores increase as well. The NBF approach, on the other hand, ex-

cels at detecting small bugs under both full and partial credit, whichwe also found to be the most entropic (see Figure 2). Thus, al-though its performance increases slightly with an increasing bugfixthreshold, its gain over random decreases. Overall, we summarizethat:

Result 3: Entropy can be used to guide bug-finding effortsat both the file-level and the line-level.

RQ4. How do SBF and NBF compare in terms of ability todirect inspection effort?

To compare NBF with SBF , we use $gram+wType model onPhase-II data set. To investigate the impact of tangled changes, wechoose the overall data set and a bug-fix threshold of 14 (roughlycorresponds to the fourth quartile of bug-fix sizes on this dataset).Further, we select PMD and FINDBUGS from a pool of availableSBF tools, because they are popular and have been studied in pre-vious research [43, 23, 26].

As discussed in Section 2.3, Rahman et al. developed a measurenamed AUCECL to compare SBF and DP methods on an equalfooting [43]. In this method, the SBF under investigation sets theline budget based on the number of warnings it returns and the DPmethod may choose a (roughly) equal number of lines. The mod-els’ performance can then be compared by computing the AUCECscores both approaches achieve on the same budget. We follow thisapproach to compare SBF with NBF .

Furthermore, we also compare the AUCEC5 scores of the algo-rithms. For the $gram+wType model, this is analogous to the re-sults in RQ3. To acquire AUCEC5 scores for the SBF , we simulatethem as follows: First assign each line the value zero if it was notmarked by the SBF and the value of the SBF priority otherwise({1, 2} for FINDBUGS, {1 - 4} for PMD); then, add a small randomamount (tie-breaker) from U [0, 1] to all line-values and order thelines by descending value. This last step simulates the developerrandomly choosing to investigate the lines returned by SBF : firstfrom those marked by the SBF in descending (native, SBF tool-based) priority, and within each priority level at random. We repeatthe simulation multiple times and average the performance.

Figure 5(a) and 5(b) show the AUCEC5 and AUCECL scores forPMD using partial credit and at bug-fix threshold 14. The results forFINDBUGS were comparable, as were the results using full credit.As can be seen, performance varied substantially between projectsand between releases of the same project. Across all releases andunder both AUCEC5 and AUCECL scoring, all models performedsignificantly better than random (paired t-test: p < 10�3), with

derby lucene openjpa qpid wicket

0.000

0.002

0.004

0.006

0.008

2 4 6 2 4 6 2 4 6 1 2 3 4 5 1 2 3 4Release

AU

CE

C

method

NBFPMDPMD_Mixrandom

Partial Credit

(a) AUCEC5 performance of $gram+wType vs. PMD and thecombination model

derby lucene openjpa qpid wicket

0.000

0.025

0.050

0.075

0.100

2 4 6 2 4 6 2 4 6 1 2 3 4 5 1 2 3 4Release

AU

CE

CL

method

NBFPMDPMD_Mixrandom

Partial Credit

(b) AUCECL performance of $gram+wType vs. PMD and thecombination model

Figure 5: Partial Credit: Performance Evaluation of $gram+wType w.r.t. SBF and a mix model which ranks the SBF lines by entropy.

large effect (Cohen’s D > 1). SBF and NBF performed com-parably; NBF performed slightly better when using both partialcredit and the specified bug-fix threshold, but when dropping thethreshold, and/or with full credit, no significant difference remainsbetween NBF and SBF . No significant difference in performancewas found between FINDBUGS and PMD either.

In all comparisons, all approaches retrieved relatively bug-pronelines by performing substantially better than random. In fact, at 5%inspection budget, both the line-level NBF and the two SBF per-formed substantially better than the earlier presented DP methodand file-level NBF (compare Figure 5(a) and Figure 3).

Result 4: Entropy achieves comparable performance tocommonly used Static Bug Finders in defect prediction.

Notably, NBF had both the highest mean and standard devia-tion of the tested models, whereas PMD’s performance was mostrobust. This suggests a combination of the models: we can orderthe warnings of the SBF using the $gram+wType model. In par-ticular, we found that the standard priority ordering of the SBF isalready powerful, so we propose to re-order the lines within eachpriority category.

RQ5. Is “naturalness" a useful way to focus the inspection ef-fort on warnings produced by SBF?

To answer this question, we again assigned values to each linebased on the SBF priority as in RQ4. However, rather than addrandom tie-breakers, we rank the lines within each priority bin bythe (deterministic) $gram+wType score. The results for PMD areshown in Figure 5, first using the AUCEC5 measure (5(a)) and thenusing the AUCECL measure (5(b)). PMD_Mix refers to the com-bination model as proposed.

Overall, the combined model produced the highest mean per-formance in both categories. It significantly outperformed the twoSBFs in all cases (p < 0.01) and performed similarly to the NBFmodel (significantly better on Lucene and QPid, significantlyworse on Derby (p < 0.05), all with small effect). These resultsextended to the other evaluation methods, using full credit and/orremoving the threshold for max bug-fix size. In all cases, the mixmodel was either significantly better or no worse than any of theother approaches when averaged over all the studied releases.

We further evaluated ranking all warnings produced by the SBFby entropy (ignoring the SBF priorities) and found comparable butslightly weaker results. These results suggest that both NBF andSBF contribute valuable information to the ordering of bug-pronelines and that their combination yields superior results.

Result 5: Ordering SBF warnings by priority and entropysignificantly improves SBF performance.

5. THREATS TO VALIDITYInternal Validity. A number of threats to the internal validity arisefrom the experimental setup. First, our identification of buggy linescould be wrong, as we used a simple key-word based search toidentify buggy commits (see Section 3.2). To minimize this threat,we manually evaluated our bug classification tool and reported anoverall accuracy of 96%.

Another source of false negatives is the presence of (yet) unde-tected bugs that linger in the code-base. Also, developers may “tan-gle” unrelated changes into one commit [20, 12]. However, giventhe high significance of entropy difference between buggy and non-buggy lines, we consider it unlikely that these threats could invali-date our overall results. Furthermore, NBF ’s positive performanceon (higher quality, JIRA-based) Phase-II dataset confirms our ex-pectations regarding the validity of these results.

A threat regarding RQ2 is the identification of ‘fixed’ lines, whichreplaced ‘buggy’ lines during a bugfix commit. The comparisonsbetween these categories could be skewed, especially when thebugfix commits replace buggy lines with a larger number of fixedlines. In fact, small bug-fixes do indeed add more lines than theydelete on average (the reverse holds for fixes spanning over 50lines). However, a majority of bugs of any size were fixed withat most as many lines. In particular, more than two-third of oneand two-line bugs, which demonstrated the greatest decrease in en-tropy were fixed with one and two lines respectively. Thus, thesefindings minimize the above threat. Other non-bugfixing changes(e.g., introduction of clone) may also show drop in entropy w.r.t. itsprevious version.

Our comparison of SBF and NBF assumes that indicated linesare equally informative to the inspector, which is not entirely fair;NBF just marks a line as “surprising”, whereas SBF provides spe-cific warnings. On the other hand, we award credit to SBF whetheror not the bug has anything to do with the warning on the samelines; indeed, earlier work [52] suggests that warnings are not oftenrelated to the buggy lines which they overlap. So this may not be amajor threat to our RQ4 results.

Finally, the use of AUCEC to evaluate defect prediction has beencriticized for ignoring the cost of false negatives [56]; the develop-ment of better, widely-accepted measures remains a topic of futureresearch.External Validity. External Validity concerns generalizability ofour result. To minimize this threat, we use systems from different

domain from Github and Apache, having a substantial variation inage, size and ratio of bugs to overall lines (see table 1). We alsoconfirmed our overall result by studying entire evolutionary historyof all the projects, analyzed with 6 months snapshot interval.

Next, does this approach generalize to other languages? There isnothing language-specific about the implementation of n-gram and$gram models ($gram+wType model, however, does require pars-ing, which depends on language grammar). Prior research showedthat these models work well to capture regularities in Java, C, andPython [22, 53]. To investigate NBF ’s performance for other lan-guages, we performed a quick sanity check on 3 C/C++ projects:(Libuv, Bitcoin and Libgit), studying evolution from Novem-ber 2008 - January 2014. The results are consistent with those pre-sented in Table 3 and Figure 2—buggy lines are between 0.87 and1.16 bits more entropic than non-buggy lines at bug-fix threshold15 (slightly larger than in Table 3). Also, the entropy of buggy linesat this threshold drops by nearly one bit. These findings stronglysuggest that our results generalize to C/C++; we are investigatingthe applicability to other languages.

Finally, even though we showed good empirical results with ourapproach, this does not assure that it actually helps developers intheir bug finding efforts, as was shown in a similar scenario (withautomated debugging techniques) by Parnin et al. [40]. To tacklethis, the natural next step would be a controlled experiment withdevelopers using our approach.

6. RELATED WORKStatistical Defect Prediction. DP aims to predict defects yet tobe detected by learning from historical data of reported bugs in is-sue databases (e.g., JIRA). This is a very active area (see [8] fora survey), with even a dedicated series of conferences (i.e. PRO-MISE [1]). D’Ambros et al. survey and provide a direct com-parison of a number of representative DP approaches (includingthose using process metrics, such as previous changes [34, 17]and defects [27], and product metrics, such as code complexitymetrics [6]). While earlier work evaluated models using IR mea-sures such as precision, recall and F-score, more recently non-parametric methods such as AUC and AUCEC have gained in pop-ularity. D’Ambros et al. follow this trend and conduct an evaluationsimilar to ours.

A DP may work beyond file granularity; Giger et al. presenteda DP at the level of individual methods [16]. We are the first topredict defects at a line-level using only statistical models.Static Bug Finders. SBF can work at line granularity as opposedto DP , hence the comparison with our approach. The work closelyrelated to ours is by Rahman et al. [43]; by comparing DP perfor-mance with SBF they reported that popular SBF tools like Find-Bugs and PMD do not necessarily perform better than DP . Wealso find that NBF can be used to rank SBF warnings, but weare not the first to tackle this challenge. Kremenek et al. use z-ranking and a cluster-based approach to prioritizing warnings basedon the warnings’ previous success rate [29, 28]. Kim and Ernstmine information from code change history to estimate the impor-tance of warnings [26]. Our approach ranks warnings based onproperties of the source code rather than the output of the SBFor whether and how warnings have been fixed in history. Futureresearch could evaluate how our approaches can complement thework above. Ruthruff et al. propose a filtering approach to detect-ing accurate and actionable SBF warnings [49]. They use priorityof warnings, defined by the SBF , type of error detected, and fea-tures of the affected file (e.g., size and warning depth) to do thefiltering. Our approach ranks warnings on a different aspect ofsource code than those they consider and could be used to com-

plement their model. Finally, Heckman et al. proposed Faultbench,a benchmark for comparison and evaluation of static analysis alertprioritization and classification techniques [18] and used it to vali-date the Aware [19] tool to prioritize static analysis tool warnings.Since results of our approach are promising, further research couldinvestigate our approach against this additional benchmark.

The field of SBF has advanced rapidly, with many develop-ments; researchers identify new categories of defects, and seek toinvent methods to find these defects efficiently, either heuristicallyor though well-defined algorithms and abstractions. Since neithermethod is perfect, the actual effectiveness in practice is an empiri-cal question. A comprehensive discussion of related work regard-ing SBF and their evaluation can be found in Rahman et al. [43].Inferring rules and specifications. Statistical language modelsare employed to capture the repetitive properties of languages, inour case of programming languages, thus inferring the style or evensome latent specification about how the language is supposed to beused in a specific project and context. As such, our work is re-lated to previous research that tries to automatically infer specifica-tions and use it to identify outliers as probable defects. Kermeneket al. present a framework based on a probabilistic model (namely,factor graph [31]) for automatically inferring specifications fromprograms and use it to find missing and incorrect properties in aspecification used by a commercial static bug-finding tool [30]. Inour case, we allow errors to be localized without language-specifictuning, without defining the set of annotations to infer, and withoutmodeling domain-specific knowledge. Wasylkowski et al. mine us-age models from code to detect anomalies and violations in meth-ods invoked on objects and demonstrated that these can be usedto detect software defects [54]. Similarly, Thummalapenta and Xiedevelop Alattin, an approach to mine patterns from APIs and detectviolations [51]. In contrast, our model is less specialized and triesto highlight unexpected patterns at token level, without focusing onthe specific case of method invocations.

7. CONCLUSIONThe predictable nature (“naturalness”) of code suggests that code

that is improbable (“unnatural”) might be wrong. We investigatethis intuition by using entropy, as measured by statistical languagemodels, as a way of measuring unnaturalness.

We find that unnatural code is more likely to be implicated ina bug-fix commit. We also find that buggy code tends to becomemore natural when repaired. We then turned to applying entropyscores to defect prediction and find that, when adjusted for syntac-tic variances in entropy and defect occurrence, our model is aboutas cost-effective as the commonly used static bug-finders PMD andFindBugs. Applying the (deterministic) ordering of entropy scoresto the warnings produced by these static bug-finders produces themost cost-effective method. These findings suggest that entropyscores are a useful adjunct to defect prediction methods. The find-ings also suggest that certain kinds of automated search-based bug-repair methods might do well to have the search in some way influ-enced by language models.

In the near future, we plan to build extensions into PMD, Find-Bugs and other static bug finders that order warnings based onour $gram+wType regime. Further ahead, we plan to study otherapplications of these approaches including dynamic fault-isolationmethods and automated bug-patching tools.

Acknowledgment. This material is based upon work supportedby the National Science Foundation under Grant No. 1414172.

8. REFERENCES[1] PROMISE ’14: Proceedings of the 10th International

Conference on Predictive Models in Software Engineering,New York, NY, USA, 2014. ACM.

[2] M. Allamanis, E. T. Barr, C. Bird, and C. Sutton. Learningnatural coding conventions. In Proceedings of the 22nd

International Symposium on the Foundations of SoftwareEngineering (FSE’14), 2014.

[3] M. Allamanis and C. Sutton. Mining idioms from sourcecode. In SIGSOFT FSE, pages 472–483, 2014.

[4] E. Arisholm, L. C. Briand, and E. B. Johannessen. Asystematic and comprehensive investigation of methods tobuild and evaluate fault prediction models. JSS, 83(1):2–17,2010.

[5] N. Ayewah, D. Hovemeyer, J. D. Morgenthaler, J. Penix, andW. Pugh. Using static analysis to find bugs. IEEE Software,25(5):22–29, 2008.

[6] V. R. Basili, L. C. Briand, and W. L. Melo. A validation ofobject-oriented design metrics as quality indicators. IEEETrans. Software Eng., 22(10):751–761, 1996.

[7] J. C. Campbell, A. Hindle, and J. N. Amaral. Syntax errorsjust aren’t natural: improving error reporting with languagemodels. In MSR, pages 252–261, 2014.

[8] C. Catal and B. Diri. A systematic review of software faultprediction studies. Expert systems with applications,36(4):7346–7354, 2009.

[9] B. Chelf, D. Engler, and S. Hallem. How to writesystem-specific, static checkers in metal. In Proceedings ofthe 2002 ACM SIGPLAN-SIGSOFT Workshop on ProgramAnalysis for Software Tools and Engineering, PASTE ’02,pages 51–60, New York, NY, USA, 2002. ACM.

[10] T. Copeland. PMD applied. Centennial Books San Francisco,2005.

[11] M. D’Ambros, M. Lanza, and R. Robbes. An extensivecomparison of bug prediction approaches. In MiningSoftware Repositories (MSR), 2010 7th IEEE WorkingConference on, pages 31–41. IEEE, 2010.

[12] M. Dias, A. Bacchelli, G. Gousios, D. Cassou, andS. Ducasse. Untangling fine-grained code changes. InProceedings of the 22nd International Conference onSoftware Analysis, Evolution, and Reengineering, 2015.

[13] D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf.Bugs as deviant behavior: A general approach to inferringerrors in systems code. In Proceedings of the EighteenthACM Symposium on Operating Systems Principles, SOSP’01, pages 57–72, New York, NY, USA, 2001. ACM.

[14] FindBugs. http://findbugs.sourceforge.net/. Accessed2015/03/10.

[15] C. Franks, Z. Tu, P. Devanbu, and V. Hellendoorn. Cacheca:A cache language model based code suggestion tool. In ICSEDemonstration Track, 2015.

[16] E. Giger, M. D’Ambros, M. Pinzger, and H. C. Gall.Method-level bug prediction. In Proceedings of theACM-IEEE international symposium on Empirical softwareengineering and measurement, ESEM ’12, pages 171–180,2012.

[17] A. E. Hassan. Predicting faults using the complexity of codechanges. In Proceedings of ICSE 2009, pages 78–88, 2009.

[18] S. Heckman and L. Williams. On establishing a benchmarkfor evaluating static analysis alert prioritization andclassification techniques. In Proceedings of the SecondACM-IEEE international symposium on Empirical softwareengineering and measurement, pages 41–50. ACM, 2008.

[19] S. S. Heckman. Adaptively ranking alerts generated fromautomated static analysis. Crossroads, 14(1):7, 2007.

[20] K. Herzig and A. Zeller. Untangling changes. Unpublishedmanuscript, September, 2011.

[21] K. Herzig and A. Zeller. The impact of tangled codechanges. In Mining Software Repositories (MSR), 2013 10thIEEE Working Conference on, pages 121–130. IEEE, 2013.

[22] A. Hindle, E. Barr, M. Gabel, Z. Su, and P. Devanbu. On thenaturalness of software. In ICSE, pages 837–847, 2012.

[23] B. Johnson, Y. Song, E. Murphy-Hill, and R. Bowdidge.Why don’t software developers use static analysis tools tofind bugs? In Software Engineering (ICSE), 2013 35thInternational Conference on, pages 672–681. IEEE, 2013.

[24] S. Karaivanov, V. Raychev, and M. Vechev. Phrase-basedstatistical translation of programming languages. InSPLASH, Onward!, pages 173–184, 2014.

[25] S. Katz. Estimation of probabilities from sparse data for thelanguage model component of a speech recognizer. IEEETransactions on Acoustics, Speech and Signal Processing,35:400–401, 1987.

[26] S. Kim and M. D. Ernst. Which warnings should i fix first?In Proceedings of the the 6th joint meeting of the Europeansoftware engineering conference and the ACM SIGSOFTsymposium on The foundations of software engineering,pages 45–54. ACM, 2007.

[27] S. Kim, T. Zimmermann, J. Whitehead, and A. Zeller.Predicting faults from cached history. In Proceedings ofICSE 2007, pages 489–498. IEEE CS, 2007.

[28] T. Kremenek, K. Ashcraft, J. Yang, and D. Engler.Correlation exploitation in error ranking. In ACM SIGSOFTSoftware Engineering Notes, volume 29, pages 83–93. ACM,2004.

[29] T. Kremenek and D. Engler. Z-ranking: Using statisticalanalysis to counter the impact of static analysisapproximations. In Static Analysis, pages 295–315. Springer,2003.

[30] T. Kremenek, P. Twohey, G. Back, A. Ng, and D. Engler.From uncertainty to belief: Inferring the specification within.In Proceedings of the 7th symposium on Operating systemsdesign and implementation, pages 161–176. USENIXAssociation, 2006.

[31] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factorgraphs and the sum-product algorithm. IEEE Transactionson Information Theory, 47(2):498–519, 2001.

[32] Z. Li and Y. Zhou. Pr-miner: Automatically extractingimplicit programming rules and detecting violations in largesoftware code. In Proceedings of the 10th European SoftwareEngineering Conference Held Jointly with 13th ACMSIGSOFT International Symposium on Foundations ofSoftware Engineering, ESEC/FSE-13, pages 306–315, NewYork, NY, USA, 2005. ACM.

[33] A. Mockus and L. G. Votta. Identifying reasons for softwarechanges using historic databases. In ICSM, pages 120–130,2000.

[34] R. Moser, W. Pedrycz, and G. Succi. A comparative analysisof the efficiency of change metrics and static code attributesfor defect prediction. In Proceedings of ICSE 2008, pages181–190, 2008.

[35] N. Nagappan and T. Ball. Use of relative code churnmeasures to predict system defect density. In SoftwareEngineering, 2005. ICSE 2005. Proceedings. 27thInternational Conference on, pages 284–292. IEEE, 2005.

[36] N. Nagappan, T. Ball, and A. Zeller. Mining metrics topredict component failures. In Proceedings of the 28thinternational conference on Software engineering, pages452–461. ACM, 2006.

[37] A. T. Nguyen, H. A. Nguyen, T. T. Nguyen, and T. N.Nguyen. Statistical learning of api mappings for languagemigration. In ICSE Companion, pages 618–619, 2014.

[38] A. T. Nguyen, T. T. Nguyen, and T. N. Nguyen. Lexicalstatistical machine translation for language migration. InSIGSOFT FSE, pages 651–654, 2013.

[39] A. T. Nguyen, T. T. Nguyen, and T. N. Nguyen. Migratingcode with statistical machine translation. In ICSECompanion, pages 544–547, 2014.

[40] C. Parnin and A. Orso. Are automated debugging techniquesactually helping programmers? In Proceedings of the 2011International Symposium on Software Testing and Analysis,pages 199–209. ACM, 2011.

[41] D. Posnett, V. Filkov, and P. Devanbu. Ecological inferencein empirical software engineering. In Proceedings of the2011 26th IEEE/ACM International Conference onAutomated Software Engineering, pages 362–371. IEEEComputer Society, 2011.

[42] F. Rahman and P. Devanbu. How, and why, process metricsare better. In Proceedings of ICSE, pages 432–441, 2013.

[43] F. Rahman, S. Khatri, E. T. Barr, and P. T. Devanbu.Comparing static bug finders and statistical prediction. InICSE, pages 424–434, 2014.

[44] F. Rahman, D. Posnett, and P. Devanbu. Recalling theimprecision of cross-project defect prediction. InProceedings of the ACM SIGSOFT 20th InternationalSymposium on the Foundations of Software Engineering,page 61. ACM, 2012.

[45] B. Ray, M. Kim, S. Person, and N. Rungta. Detecting andcharacterizing semantic inconsistencies in ported code. InASE, pages 367–377, 2013.

[46] B. Ray, D. Posnett, V. Filkov, and P. Devanbu. A large scalestudy of programming languages and code quality in github.In SIGSOFT FSE, 2014.

[47] V. Raychev, M. Vechev, and A. Krause. Predicting programproperties from “big code”. In POPL, pages 111–124, 2015.

[48] V. Raychev, M. Vechev, and E. Yahav. Code completion withstatistical language models. In PLDI, pages 419–428, 2014.

[49] J. R. Ruthruff, J. Penix, J. D. Morgenthaler, S. Elbaum, andG. Rothermel. Predicting accurate and actionable staticanalysis warnings: an experimental approach. InProceedings of the 30th international conference onSoftware engineering, pages 341–350. ACM, 2008.

[50] J. Sliwerski, T. Zimmermann, and A. Zeller. When dochanges induce fixes? In MSR, pages 1–5, 2005.

[51] S. Thummalapenta and T. Xie. Alattin: Mining alternativepatterns for detecting neglected conditions. In Proceedings ofthe 2009 IEEE/ACM International Conference on AutomatedSoftware Engineering, pages 283–294. IEEE ComputerSociety, 2009.

[52] F. Thung, D. Lo, L. Jiang, F. Rahman, P. T. Devanbu, et al. Towhat extent could we detect field defects? an empirical studyof false negatives in static bug finding tools. In Proceedingsof the 27th IEEE/ACM International Conference onAutomated Software Engineering, pages 50–59. ACM, 2012.

[53] Z. Tu, Z. Su, and P. Devanbu. On the localness of software.In SIGSOFT FSE, pages 269–280, 2014.

[54] A. Wasylkowski, A. Zeller, and C. Lindig. Detecting objectusage anomalies. In Proceedings of the 6th joint meeting ofthe European software engineering conference and the ACMSIGSOFT symposium on The foundations of softwareengineering, pages 35–44. ACM, 2007.

[55] I. H. Witten and E. Frank. Data Mining: Practical machinelearning tools and techniques. Morgan Kaufmann, 2005.

[56] H. Zhang and S. Cheung. A cost-effectiveness criterion forapplying software defect prediction models. In Proceedingsof the 2013 9th Joint Meeting on Foundations of SoftwareEngineering, pages 643–646. ACM, 2013.

[57] T. Zimmermann, R. Premraj, and A. Zeller. Predictingdefects for eclipse. In Predictor Models in SoftwareEngineering, 2007. PROMISE’07: ICSE Workshops 2007.International Workshop on, pages 9–9. IEEE, 2007.

on the “naturalness” of buggy codedevanbu/odd-bugs.pdf · to complement the effectiveness of...

Documents