comparing the defect reduction benefits of

Upload: vivekananda-ganjigunta-narayana

Post on 06-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Comparing the Defect Reduction Benefits Of

    1/15

    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 1

    Comparing the Defect Reduction Benefits ofCode Inspection and Test-Driven

    DevelopmentJerod W. Wilkerson, Jay F. Nunamaker, Jr., and Rick Mercer

    AbstractThis study is a quasi-experiment comparing the software defect rates and implementation costs of two methods of

    software defect reduction: code inspection and test-driven development. We divided participants, consisting of junior and senior

    computer science students at a large Southwestern university, into four groups using a two-by-two, between-subjects, factorial

    design and asked them to complete the same programming assignment using either test-driven development, code inspection,

    both, or neither. We compared resulting defect counts and implementation costs across groups. We found that code inspection is

    more effective than test-driven development at reducing defects, but that code inspection is also more expensive. We also found

    that test-driven development was no more effective at reducing defects than traditional programming methods.

    Index Terms Agile programming, code inspections and walkthroughs, reliability, test-driven development, testing strategies,

    empirical study.

    !

    1 INTRODUCTION

    SOFTWARE development organizations face the dif-ficult problem of producing high-quality, low-defect software on-time and on-budget. A U.S. gov-ernment study [1] estimated that software defects arecosting the U.S. economy approximately $59.5 billionper year. Several high-profile software failures have

    helped to focus attention on this problem, includingthe failure of the Los Angeles airports air trafficcontrol system in 2004, the Northeast power blackoutin 2003 (with an estimated cost of $6$10 billion),and two failed NASA Mars missions in 1999 and 2000(with a combined cost of $320 million).

    Various approaches to addressing this epidemic of budget and schedule overruns and software defectshave been proposed in the academic literature andapplied in software development practice. Two ofthese approaches are software inspection and test-driven development (TDD). Both have advantagesand disadvantages, and both are capable of reducingsoftware defects [2], [3], [4], [5].

    Software inspection has been the focus of more than400 academic research papers since its introductionby Michael Fagan in 1976 [6]. Software inspection isa formal method of inspecting software artifacts to

    J. W. Wilkerson is with the Sam and Irene Black School of Business,Penn State Erie, 281 Burke Center, 5101 Jordan Road, Erie, PA 16563.E-mail: [email protected].

    J. F. Nunamaker is with the Department of Management InformationSystems, University of Arizona, 1130 E. Helen St., Tucson, AZ 85721.

    E-mail: [email protected]. R. Mercer is with the Department of Computer Science, University ofArizona, 1040 E. 4th St., Tucson, AZ, 85721.E-mail: [email protected].

    Manuscript received January 11, 2010, revised October 29, 2010.

    identify defects. This method has been in use for morethan 30 years and has been found to be very effectiveat reducing software defects. Fagan reported softwaredefect reduction rates between 66 and 82% [6].

    While any software development artifact may beinspected, most of the software inspection literaturedeals with the inspection of program code. In this

    study, we limit inspections to program code, and werefer to these inspections as code inspections.TDD is a relatively new software development

    practice in which unit-tests are written before pro-gram code. New tests are written before features areadded or changed, and new features or changes aregenerally considered complete only when the newtests and any previously written tests succeed. TDDusually involves the use of unit-testing frameworks(such as JUnit1 for Java development) to support thedevelopment of automated unit tests, and to allowtests to be executed frequently as new features or

    modifications are introduced. Although results havebeen mixed, some research has shown that TDD canreduce software defects by between 18 and 50% [2],[3], with one study showing a reduction of up to 91%[7], with the added benefit of eliminating defects atan earlier stage of development than code inspection.

    TDD is normally described as a method of softwaredesign, and as such, has benefits that go beyondtesting and defect reduction. However, in this studywe limit our analysis to a comparison of the defectreduction benefits of the methods and do not considerother benefits of either approach.

    Existing research does not sufficiently assesswhether TDD is a useful supplement or a viable

    1. http://www.junit.org

    Digital Object Indentifier 10.1109/TSE.2011.46 0098-5589/11/$26.00 2011 IEEE

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication

  • 8/3/2019 Comparing the Defect Reduction Benefits Of

    2/15

    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2

    alternative to code inspection for purposes of reduc-ing software defects. Previous research has comparedthe defect reduction benefits of code inspection andsoftware testingmuch of which is summarized byRuneson, Andersson, Thelin, Andrews, and Berling[8]. However, the current high adoption rates of TDDindicate the timeliness and value of specific compar-isons of code inspection and TDD. The focus of thisstudy is a comparison of the defect rates and relativecosts of these two methods of software defect reduc-tion. In the study, we seek to answer the followingtwo main questions:

    1) Which of these two methods is most effective atreducing software defects?, and

    2) What are the relative costs of these software defectreduction methods?

    The software inspection literature typically uses theterm defect to mean fault. This literature contains

    a well-established practice of categorizing defects aseither major or minor [9], [10]. In this paper, wecompare the effectiveness of code inspection and TDDin removing major defects.

    The remainder of this paper is organized as follows:The next section discusses the relevant research re-lated to this study. Section 3 describes the purpose andresearch questions. Section 4 describes the experimentand the procedures followed. Sections 5 and 6 de-scribe the results and implications of the findings, andsection 7 concludes with a discussion of the studyscontributions and limitations.

    2 RELATED RESEARCH

    Since software inspection and TDD have not previ-ously been compared, we have divided the discussionof related work into two main subsections: SoftwareInspection, and Test-Driven Development. These sub-sections provide a discussion of the prior research oneach method that is relevant to the current studyscomparison of methods. The Software Inspection sec-tion also includes a summary of findings from priorcomparisons of code inspection and traditional testing

    methods.

    2.1 Software Inspection

    Michael Fagan introduced the concept of formal soft-ware inspection in 1976 while working at IBM. Hisoriginal techniques are still in widespread use andare commonly called Fagan Inspections. Fagan In-spections can be used to inspect the software artifactsproduced by all phases of a software developmentproject. Inspection teams normally consist of three tofive participants, including a moderator, the author

    of the artifact to be inspected, and one to three in-spectors. The moderator may also participate in theinspection of the artifact. Fagan Inspections consist ofthe following phases: Overview (may be omitted for

    code inspections), Preparation, Inspection, Rework,and Follow-Up, as shown in fig. 1. The gray arrowbetween Follow-Up and Inspection indicates that a re-inspection is optionalat the moderators discretion.

    In the Overview phase, the author provides anoverview of the area of the system being addressed bythe inspection, followed by a detailed explanation ofthe artifact(s) to be inspected. Copies of the artifact(s)to be inspected and copies of other materials (suchas requirements documents and design specifications)are distributed to inspection participants. During thePreparation phase, participants independently studythe materials received during the Overview phase inpreparation for the inspection meeting. During theInspection phase, a reader explains the artifact beinginspected, covering each piece of logic and everybranch of code at least once. During the reading pro-cess, inspectors identify errors, which the moderatorrecords. The author corrects the errors during theRework phase and all corrections are verified duringthe Follow-Up phase. The Follow-Up phase may beeither a re-inspection of the artifact or a verificationperformed only by the moderator.

    Fagan [6], reported defect yield rates between 66and 82%, where the total number of defects in theproduct prior to inspection (t) is

    t = i+ a+ u (1)

    where i is the number of defects found by inspection,a is the number of defects found during acceptance

    testing, and u is the number of defects found duringthe first six months of use of the product. The defectdetection (yield) rate (y) is

    y = i/t 100. (2)

    Two papers, [11], [12], summarize much of the exist-ing software inspection literature, including variationsin how software inspections are performed. Softwareinspection variations differ mainly in the readingtechnique used in the Inspection phase of the review.Reading techniques include Ad Hoc Reading [13],

    Checklist-Based Reading [6], [9], [10], [14], Reading byStepwise Refinement [15], Usage-Based Reading [16],[17], [18], and Scenario (or Perspective)-Based Read-ing [19], [20], [21], [22]. Several comparison studiesof reading techniques have also been performed [23],[24], [25], [26], [27].

    Porter and Votta [25] and Porter, Votta and Basili[26] performed experiments comparing Ad Hoc Read-ing, Checklist-Based Reading, and Scenario-BasedReading for software requirements inspections usingboth student and professional inspectors. Inspectorsusing Ad Hoc Reading were not given specific guide-

    lines to direct their search for defects. Inspectors usingChecklist-Based Reading were given a checklist toguide their search for defects. Each Scenario-BasedReading inspector was given one of the following

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication

  • 8/3/2019 Comparing the Defect Reduction Benefits Of

    3/15

    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 3

    Fig. 1. Software Inspection Process

    primary responsibilities to fulfill during the searchfor defects: 1) search for data type inconsistencies, 2)search for incorrect functionality, and 3) search for am-biguities or missing functionality. Porter et al. found

    that Scenario-Based Reading was the most effectivereading method, producing improvements over bothAd Hoc Reading and Checklist-Based Reading from21 to 38% for professional inspectors and from 35to 51% for student inspectors. They attributed thisimprovement to efficiency gains resulting from a re-duction in overlap between the types of defects forwhich each inspector was searching.

    Another important advance in the state of softwareinspections was the application of group supportsystems (GSS) to the inspection process. Years of

    prior research have shown that the use of GSS canimprove meeting efficiency. Efficiency improvementshave been attributed to reductions in dominance ofthe meeting by one or a few participants, reductionsin distractions associated with traditional meetings,improved group memory, and the ability to supportdistributed collaborative work [28], [29]. Johnson [30]notes that the application of GSS to software inspec-tion can overcome obstacles encountered with paper-based inspections, thereby improving the efficiency ofthe inspection process. Van Genuchten, Dijk, Scholten,and Vogel [31] also found that the benefits of GSS

    can be realized in code inspection meetings. Otherstudies [32], [33], [34], [35], [36] have also foundimprovements in the software inspection process asa result of GSS.

    Several studies have compared code inspectionwith more traditional forms of testing. Runeson, et al.[8] summarize nine studies comparing the effective-ness and efficiency of code inspection and softwaretesting in finding code defects. They concluded thatthe data doesnt support a scientific conclusion asto which technique is superior, but from a practicalperspective it seems that testing is more effective

    than code inspections. Boehm [37] analyzed fourstudies comparing code inspection and unit testing,and found that code inspection is more effective andefficient at identifying up to 80% of code defects.

    2.2 Test-Driven Development

    TDD is a software development practice that involvesthe writing of automated unit tests before program

    code. The subsequent coding is deemed complete onlywhen the new tests and all previously written testssucceed. TDD, as defined by Beck [38], consists of fivesteps. These steps, which are illustrated in fig. 2, arecompleted iteratively until the software is complete.Successful execution of all unit tests is verified afterboth the Write Code and Refactor steps.

    Prior studies have evaluated the effectiveness ofTDD, and have obtained varied defect reduction re-sults. Muller and Hagner [39] compared test-firstprogramming to traditional programming in an ex-periment involving 19 university students. The re-

    searchers concluded that test-first programming didnot increase program reliability or accelerate the de-velopment effort.

    In a pair of studies by Maximilien and Williams[3] and George and Williams [2], the researchersfound that TDD resulted in higher code quality whencompared to traditional programming. Maximilienand Williams performed a case study at IBM on asoftware development team that developed a Java- based point-of-sale system. The team adopted TDDat the beginning of their project and produced 50%fewer defects than a more experienced IBM team

    that had previously developed a similar system us-ing traditional development methods. Although thecase study lacked the experimental control necessaryto establish a causal relationship, the developmentteam attributed their success to the use of the TDDapproach. In another set of four case studies (oneperformed at IBM and three at Microsoft) the use ofTDD resulted in between 39 and 91% fewer defects[7].

    George and Williams [2] conducted a set of con-trolled experiments with 24 professional pair pro-grammers. One group of pair programmers used a

    TDD approach while the other group used a tradi-tional waterfall approach. The researchers found thatthe TDD group passed 18% more black-box tests andspent 16% more time developing the code than the

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication

  • 8/3/2019 Comparing the Defect Reduction Benefits Of

    4/15

    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 4

    Fig. 2. Test-Driven Development Process

    traditional group. They also reported that the pairswho used a traditional waterfall approach often didnot write the required automated test cases at the end

    of the development cycle.Erdogmus, Morisio, and Torchiano [40] conducted

    an experiment designed to test a theory postulatedin previous studies [2], [3]: that the cause of higherquality software associated with TDD is an increasednumber of automated unit tests written by program-mers using TDD. They found that TDD does resultin more tests, and that more tests result in higherproductivity, but not higher quality.

    In summary, some prior research has found TDDto be an effective defect reduction method [2], [3],[7], and some has not [39], [40]. Section 6 containsa potential explanation for this variability.

    2.3 Summary

    Code inspection is almost exclusively a software de-fect reduction method, whereas TDD has several pur-ported benefitsonly one of which is software defectreduction. We are primarily interested in comparingsoftware defect reduction methods, so our focus ison the software defect reduction capabilities of the

    two methods and how these capabilities compare ondefect reduction effectiveness and cost. One of themain differences between code inspection and TDDis the point in the software development process inwhich defects are identified and eliminated. Code in-spection identifies defects at the end of a developmentcycle, allowing programmers to fix defects previouslyintroduced, whereas TDD identifies and removes de-fects during the development process at the point inthe process where the defects are introduced. Earlierelimination of defects is a benefit of TDD that canhave significant cost savings [37]. It should be noted,

    however, that software inspection can be performedon analysis and design documents in addition to code,thereby moving the benefits of inspection to an earlierstage of software development.

    3 PURPOSE AND RESEARCH QUESTIONS

    The purpose of this study is to compare the defectrates and relative costs of code inspection and TDD.

    Specifically, we seek to answer the following researchquestions:

    1) Which software defect reduction method is mosteffective at reducing software defects?

    2) Are there interaction effects associated with thecombined use of these methods?

    3) What are the relative costs of these softwaredefect reduction methods?

    The previously cited literature indicates that bothmethods can be effective at reducing software defects.However, TDD is a relatively new method, whereas

    code inspection has been refined through more than30 years of research. Prior research has clearly definedthe key factors involved with successfully implement-ing code inspection, such as optimal software reviewrates [9], [10], [14], [41] and inspector training require-ments [10], whereas TDD is not as clearly defined dueto its lack of maturity.

    Currently, the defect reduction results for TDD havebeen mixed with most reported reductions being be-low 50% [3]. Defect reduction from code inspectionhas consistently been reported at above 50% sinceFagans introduction of the method in 1976. This leads

    to a hypothesis that code inspection is more effectivethan TDD at reducing defects.

    H1: Code inspection is more effective than TDD atreducing software defects.

    Code inspection and TDD have fundamental differ-ences that likely result in each method finding defectsthat the other method misses. With TDD, the sameprogrammer who writes the unit tests also writes thecode. Therefore, any misconceptions held by the pro-grammer about the requirements of the system willresult in the programmer writing incorrect tests and

    incorrect code to pass the tests. These requirementmisconception defects are less likely in code thatundergoes inspection because it is unlikely that allof the inspectors will have the same misconceptions

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication

  • 8/3/2019 Comparing the Defect Reduction Benefits Of

    5/15

    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 5

    about the requirements that the programmer hasespecially if the requirements document has also beeninspected for defects.

    Although susceptible to requirement misconceptiondefects, TDD encourages the writing of a large num-ber of unit testssome of which may test conditionsinspectors overlook during the inspection process.This effect would likely be more noticeable whenusing inexperienced inspectors, but could occur withany inspectors. These differences between the meth-ods indicate that each method will find defects thatthe other method misses. This leads to a hypothesisthat the combined use of the methods is more effectivethan either method alone.

    H2: The combined use of code inspection and TDD ismore effective than either method alone.

    The existing literature does not support a hypothe-

    sis as to which method has the lowest implementationcost. However, the nature of the cost differs betweenthe two methods. The cost from TDD results fromprogrammers spending additional time writing tests.The cost from code inspection results from both thetime spent by the inspectors, and the time spentby programmers correcting identified defects. Thesedifferences lead to a hypothesis that the methodsdiffer in implementation costmeasured as the costof developing software using that method of defectreduction.

    H3: Code inspection and TDD differ in implementationcost.

    4 METHOD

    We evaluated the research questions in a quasi-experiment using a two-by-two, between-subjects, fac-torial design. Participants in each research group in-dependently completed a programming assignmentaccording to the same specification using either in-spection, TDD, both (Inspection+TDD), or neither. Thetwo independent variables (factors) in the study are

    whether Inspection was used, and whether TDD wasused as part of the development method. The In-spection and Inspection+TDD groups constituted theInspection factor and the TDD and Inspection+TDDgroups constituted the TDD factor. The group thatused neither TDD nor Inspection was the controlgroup.

    The programming assignment involved the creationof part of a spam filter using the Java programminglanguage. We gave the participants detailed specifica-tions and some pre-written code and instructed themto use the Java API to read-in an XML configuration

    file containing the rules, allowed-list, and blocked-list for a spam filter. This information was to berepresented with a set of Java objects. We requiredparticipants to maintain a specific API in the code to

    TABLE 1Descriptive Statistics of Non-Commentary Source

    Statements Submitted by Group

    Group Mean Std.Dev.

    Min Max

    Control 581.00 59.42 491 689

    TDD 519.67 54.78 463 619

    Inspection 545.67 70.03 447 634

    Inspection+TDD 580.14 87.31 451 735

    Total 554.45 69.79 447 735

    enable the use of pre-written JUnit tests as part of thedefect counting process described in section 4.3.1.

    The final projects submitted by the participantscontained an average of 554 non-commentary sourcestatements (NCSS), including 261 NCSS that we pro-

    vided to participants in a starter project. Table 1provides descriptive statistics of the number of NCSSsubmitted, summarized by research group.

    4.1 Participants

    Participants were undergraduate (mostly junior orsenior) computer science students from an object-oriented programming and design course at a largeSouthwestern U.S. university. We invited all studentsfrom the class to participate in the study. Participa-

    tion included taking a pre-test to assess their Javaprogramming and object-oriented design knowledge,and permitting the use of data generated from theircompletion of a class assignment. All 58 studentsin the class agreed to participate but data was onlycollected for the 40 with the highest pre-test scores.We entered the students who agreed to participateinto a $100 cash drawing, but we did not compensatethem in any other way. Because the programmingassignment was required of all students in the class(whether they agreed to be research participants ornot), and the assignment was a graded part of the

    class we recruited them from, we believe that theyhad a reasonably high motivation to perform well onthe assignment.

    Each of the 40 participants was objectively assignedto one of the four research groupswith 10 partici-pants assigned to each groupby a genetic algorithmthat attempted to minimize the difference betweenthe groups in both pre-test average scores and stan-dard deviation. The algorithm was very successfulin producing equalized groups without researcherintervention in the grouping; however, several partic-ipants had to be excluded from the study for reasons

    described in table 2. Table 3 shows the effect of theseexclusions on group size. These exclusions resulted inunequal research groups, so we used pre-test score asa control variable during data analysis.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication

  • 8/3/2019 Comparing the Defect Reduction Benefits Of

    6/15

    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 6

    TABLE 2Reasons for Participant Exclusion

    Reason for Exclusion NumberExcluded

    Dropped the class for which the programmingassignment was a requirement.

    2

    Failed to complete the programming assign-ment.

    4

    Cheated on the assignment (submitted the samesolution as another participant).

    2

    Submitted code that was not testable. 3

    TABLE 3Participants Excluded by Group

    Assigned Group Number Excluded Number Remaining

    Control 3 7

    TDD 1 9Inspection 4 6

    Both 3 7

    Total 11 29

    4.2 Experimental Procedures

    Before the start of the experiment, all participants re-ceived training on TDD and the use of JUnit, throughin-class lectures, a reading assignment, and a gradedprogramming assignment. At the start of the exper-

    iment, we gave participants a detailed specificationand instructed them to individually write Java codeto satisfy the specification requirements. We gaveparticipants two weeks to complete the project in twoseparate one-week iterations.

    To maintain experimental control during this out-of-lab, multiple-week experiment, we analyzed the re-sulting code using MOSS2an algorithm for detectingsoftware plagiarism. We excluded two participants forsuspected plagiarism as shown in table 2.

    4.2.1 Software Inspection

    All participants in the Inspection and Inspection+TDDgroups had their code inspected by a single team ofthree inspectors. We then gave these participants oneweek to resolve the major defects found by inspection.We refer to these inspections as Method Inspections.Inspectors were students but were not participantsin the study. Inspections were performed accordingto Fagans method [6], [9] with three exceptions.First, we did not invite authors to participate in theinspection process. The inspection process took twoweeks to complete because of the large number ofinspections performed, and inviting authors to the

    inspection meetings would have given authors whosecode was inspected early in the process extra time to

    2. http://theory.stanford.edu/ aiken/moss

    correct their defects. This would have been unfair tothe students whose code was inspected later, since theassignment was graded and included as part of theircourse grade. Inviting authors to inspection meetingsmay also have increased the effect of inspection orderon both dependent variables. As a result, the mod-erator also assumed the role of the authors in theinspection meetings.

    Second, the inspectors used a collaborative inspec-tion logging tool for both the inspection preparationand the meetings. Each inspector logged issues in thecollaborative tool as the issues were found. This al-lowed the inspectors to see what had previously beenlogged and to spend their preparation time findingissues that had not already been found by anotherinspector. Use of the tool also allowed more timein the inspection meetings to find new issues ratherthan using the meeting time to report and log issuesfound during Preparation. The collaborative tool alsoreduced meeting distractions, allowing inspectors toremain focused on finding new issues [42].

    Third, the inspectors used the Scenario-Based Read-ing approach described by Porter, Votta and Basili[26]. We assigned each inspector a role of searching forone of the following defect types: missing function-ality, incorrect functionality, or incorrect Java coding.We instructed the inspectors not to limit themselvesto these roles, but to spend extra effort finding all ofthe defects that would fall within their assigned role.

    We instructed the inspector who was assigned to

    search for incorrect Java coding to use the Java in-spection checklist created by Christopher Fox [43]to guide the search for defects. Although we alsomade the checklist available to the other inspectors,searching for this type of defect was not their primaryrole. We annotated the checklist items that were mostlikely to uncover major defects with the words Majoror Possible Major and instructed the inspectors tofocus their efforts on these items in addition to theirassigned role.

    We gave inspectors four hours of trainingapproximately 1.5 hours on the inspection process and

    2.5 hours on XML processing with Java (the subjectmatter under inspection). We instructed the inspectorsto spend one hour preparing for each inspection, andwe held inspection meetings to within a few minutesof one hour. We limited the number of inspectionmeetings to two per day to avoid reduced productiv-ity due to fatigue as noted by Fagan [9] and Gilb andGraham [10]. We also controlled the maximum inspec-tion rate, which has long been known to be a criticalfactor in inspection effectiveness [9], [10], [14], [41].Fagan recommends a maximum inspection rate of 125non-commentary source statements (NCSS) per hour

    [9], whereas Humphrey recommends a maximum rateof 300 lines of code (LOC) per hour [14]. The meaninspection rate in this study was 180 NCSS/hourwith a maximum rate of 395 NCSS/hour. Although

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication

  • 8/3/2019 Comparing the Defect Reduction Benefits Of

    7/15

    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 7

    the maximum rate was slightly above Humphreysrecommendation, this rate seems justified consideringthat the inspectors were inspecting multiple copies ofcode written to the same specification, and as a result,became very familiar with the subject matter of theinspections.

    Inspectors categorized the issues they found as be-ing either major or minor and we instructed themto focus their efforts on major issues as recommended by Gilb and Graham [10]. After all inspections werecompleted, we gave each author an issue report con-taining all of the issues logged by the inspectors. Wethen gave the authors one week to resolve all majordefects and to return the issue report with each defectcategorized by the author into one of the followingcategories: Resolved, Ignored, Not a Defect, or Other.We required authors to write an explanation for anyissue categorized as either Not a Defect or Other.

    4.2.2 Test-Driven DevelopmentPrior to the start of this experiment, all participantswere given classroom instruction on the use of JUnitand TDD. Formal instruction consisted of one 75-minute classroom lecture on JUnit and TDD. Partic-ipants were also shown in-class demos during otherlectures that demonstrated test-first programming. Allparticipants (and other students in the class) com-pleted a one-week programming assignment prior tothe start of the experiment in which they developedan interactive game using JUnit and TDD and had tosubmit both code and JUnit tests for grading.

    We instructed participants in the TDD and In-spection+TDD groups to develop automated JUnittests and program code iteratively while completingthe programming assignment. We instructed themto write JUnit tests first when creating new func-tionality or modifying existing functionality, and touse the passage of the tests as an indication thatthe functionality was complete and correct. We alsoinstructed participants in the Inspection+TDD groupto use TDD during correction of the defects identifiedduring inspection.

    4.3 Measurement

    Most research on defect reduction methods has re-ported yield as a measure of method effectiveness,where yield is the ratio of the number of defectsfound by the method to the total number of defectsin the software artifact prior to inspection [6], [14].However, yield cannot be reliably calculated forTDD because the TDD method eliminates defects atthe point of introduction into the code, making itimpossible to reliably count the number of defectseliminated by the method. Therefore, we used the

    number of defects remaining after application of themethod as a substitute for yield. We used the costof development of the software using the assignedmethod, as a second dependent variable.

    4.3.1 Defects Remaining

    We defined the total number of defects remainingas the summation of the number of major defectsfound by code inspection after the application of thedefect reduction method and the number of failed au-tomated acceptance tests representing unique defectsnot found by inspection (out of 58 JUnit tests coveringall requirements). This is consistent with the measureused by Fagan [6], with the exception of the exclusionof the number of defects identified during the first sixmonths of actual use of the software.

    We included code inspection as part of the defectcounting procedure, even though code inspection wasone of the two factors under investigation in the study,because a careful review of the literature indicates thatcode inspection has been the most heavily researchedand widely accepted method of counting defects sinceFagans introduction of the method in 1976 [6]. To

    address a potential bias in favor of code inspectionresulting from the use of code inspection as both atreatment condition and a defect counting method, wealso report defects remaining results from acceptancetesting (excluding counts from inspection) in a sup-plemental document included with the online versionof this paper. The hypothesis testing results were thesame in this casehypothesis H1 is supported withboth the combined method of defect counting and theacceptance testing only method.

    Fig. 3 illustrates our defect counting procedure andhow it relates to the treatment conditions tested in the

    study. The four boxes along the left edge of the dia-gram represent the four treatment conditions. Theseare the four software development methods under in-vestigation in the study. For both the Inspection groupand the Inspection+TDD group, a code inspection andsubsequent rework to resolve the issues identifiedduring the inspection was performed as part of thedevelopment method. Then each group, whether codeinspection was part of the development method ornot, was subjected to a separate code inspection aspart of the defect counting procedure. We refer tothe defect counting inspections as measurement in-

    spections. A separate inspection team performed themeasurement inspections to prevent bias in favor ofthe Inspection and Inspection+TDD groups. The samemethod was used for the measurement inspectionsas was used for the method inspections, except thatonly two inspectors in addition to the moderator wereused, due to resource constraints.

    We executed the automated JUnit tests after com-pletion of the measurement inspections and addedto the defect counts, any test failures representingdefects not already found by inspection. We wrotethe automated tests before the start of the experiment

    and made minor adjustments and additions before thefinal test run.

    One of the automated tests executed the codeagainst a sample configuration file that we provided

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication

  • 8/3/2019 Comparing the Defect Reduction Benefits Of

    8/15

    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 8

    Fig. 3. Treatment Conditions and Defect Counting

    to the participants at the start of the experiment. The

    passage of this test, which we will refer to as thebaseline test, indicates that the code executes correctlyagainst a standard configuration file. We used thepassage of this test to indicate that the code is mostlycorrect. We wrote all other tests as modifications ofthis baseline test to identify specific defects. If the baseline test failed, we could not assume that thecode was mostly correct, and therefore, we couldnot rule out the possibility that some unexpectedcondition (other than what the tests were intendedto check) was causing failures within the test suite.For example, if the baseline test failed, it could mean

    that the configuration file was not being read into theprogram correctly. This would result in almost all ofthe tests failing because the configuration file was notproperly read, and not because the code contained thedefects the tests were intended to check. As a result,we only considered code to be testable if it passedthe baseline test. We made minor changes to someprojects to make the code testable, and in these cases,we logged each required change as a defect. Threeparticipants submitted code that would have requiredextensive changes to make it testable according to theaforementioned definition. We could not be sure thatthese changes would not alter the authors originalintent, so we excluded these participants from thestudy as shown in table 2 on page 6.

    Two adjustments to the resulting defect counts werenecessary to arrive at the final number of defectsremaining in the code. First, the inspection moderatorperformed an audit of defects identified by inspec-tion and eliminated false positives. This would haveresulted in an understatement of the effect of codeinspection if the moderator inadvertently eliminatedany real defects, making it less likely to find our

    reported result.Second, in several cases, authors either ignored

    or attempted but did not correct, defects that wereidentified by the method inspections. In an industrial

    setting, the inspection process would have included

    an iterative Follow-Up phase (see fig. 1 on page 3)for the purpose of catching these uncorrected defectsand ensuring that they were corrected before theinspection process completed. However, due to timeand resource constraints, we could not allow iterativecycles of Follow-Up, Inspection, and Rework, to en-sure that all identified major defects were eventuallycorrected.

    In a well-functioning inspection process with ex-perienced inspectors and an experienced inspectionmoderator, most (if not all) of these identified defectswould be corrected by this process. In this type of

    process, the only identified defects that are likely to be left uncorrected are those for which a fix wasattempted, and the fix appears to the inspector(s) per-forming the follow-up inspection to be correct, whenin fact it is not correct. We assume that these cases arerare, and that we can, therefore, increase the accuracyof the results by subtracting any uncorrected defectsfrom the final defect counts. However, because ofuncertainty about how rare these cases are, we reportresults for both the adjusted and unadjusted defectcounts. The unadjusted defect counts still include theelimination of false positives by the moderator.

    4.3.2 Cost

    We report the total cost of each method in total man-hours. However, we also report man-hours separatelyfor inspectors and programmers in the code inspec-tion case to allow for the application of these findingswhere programmer rates and inspector rates differ.

    The total cost for the code inspection method isthe sum of the original development hours spent by the author of the code, the hours spent by thesoftware inspectors and moderator (preparation time

    plus meeting time), and the hours spent by the authorcorrecting defects identified by inspection. The totalcost for the TDD method is the sum of the develop-ment hours used to write both the automated tests

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication

  • 8/3/2019 Comparing the Defect Reduction Benefits Of

    9/15

    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 9

    and the code.Author hours were self-reported in a spreadsheet

    that was submitted with the code. Inspector prepa-ration hours were also self-reported. The inspectionmoderator tracked and recorded inspection meetingtime. All hours were recorded in 15-minute incre-ments.

    4.4 Threats to Internal Validity

    We considered the possibility that unexpected factorsmay have biased the results. We considered the fol-lowing potential threats to internal validity:

    1) Selection bias,2) Mortality bias,3) Maturation bias,4) Order bias, and5) Implementation bias.

    Selection bias refers to the possibility that par-ticipants in the study were divided unequally intoresearch groups, and as a result, the findings are atleast partly due to these differences and not to theeffects of the treatment conditions. Although manydifferences in the participants could potentially con-tribute to a selection bias, Java programming abilityseems to be the most likely cause of selection bias inthis study. We accounted for this possibility by using aquasi-experimental design with participants assignedto groups by pre-test score using the aforementionedgenetic algorithm.

    Mortality bias refers to the possibility that thegroups became unequal after the start of the studyas a result of participants either dropping out or being eliminated. We experienced a high mortalityratestarting with 40 participants and ending with29. However, we used the pre-test score to measureand control for this effect. We also used a T-Test tocompare the pre-test means of those who remained inthe study (23.90) and those who did not (22.18) andfound the difference not to be statistically significantat the 0.05 level of alpha.

    Maturation bias is the result of participants learning

    at unequal rates within groups during the experiment.Due to the nature of the experiment, our results mayinclude effects of a maturation bias. As a normalpart of the inspection process, we gave participantsin both the Inspection and Inspection+TDD groupsan opportunity to correct defects identified in theircode approximately two weeks after submitting theoriginal code, but participants in the control and TDDgroups did not have this opportunity. All participantswere enrolled in an object-oriented programming anddesign course during the experiment and may havegained knowledge during any two-week period of

    the course that would have made them better pro-grammers and less likely to produce defects. Only theparticipants whose code was inspected as part of thedevelopment method had an opportunity to use any

    knowledge gained to improve their code, and sincethis potential maturation effect was not measured, weare unable to eliminate or quantify the possible effectsof a maturation bias.

    Order bias is an effect resulting from the orderin which treatments are applied to participants. Thisstudy is vulnerable to an order bias resulting fromthe order in which inspections were performed andwhether inspections were the first or second inspec-tion on the day of inspection. We controlled for or-der bias in two ways. First, we performed the mea-surement inspections in random order within blocksof four, and we performed the method inspections(which involved only two groups) on code from onerandomly selected participant from each group eachday, alternating each day on which groups inspectionwas performed first. Second, we used inspection day,and whether the inspection was performed first or

    second on the day of inspection, as control variablesduring data analysis.Implementation bias is an effect resulting from

    variability in the way a treatment condition is imple-mented or applied. Failure to write unit tests beforeprogram code may result in an implementation bias inthe application of TDD. We do not have an objectivemeasure to indicate whether the unit tests submittedby participants in this study were written before theprogram code, so we are unable to eliminate the possi-bility of an implementation bias affecting our results.However, we used the Eclipse Plug-in of the Clover3

    test coverage tool to provide an objective measure ofthe effectiveness of the submitted tests. Code cover-age results showed an average of 82.58% coverage(including both statement and branch coverage) witha standard deviation of 8.92. We also conducted apost-experiment survey in which participants wereasked to rate their effectiveness in implementing TDDon a 5-point Likert scale with 5 being high and 1being low. To encourage honest answers, we gave thesurvey when the classroom instructor was not present,and told the students that their instructor would notsee the surveys. The mean response to this survey

    item was 3.63. Of the 16 students in one of the tworesearch groups who performed TDD, two rated theireffectiveness in implementing TDD as a 5, nine ratedit as a 4, two rated it as a 3, and three rated it as a 2.

    4.5 Threats to External Validity

    We have identified the following four threats to exter-nal validity that limit our ability to generalize resultsto the software development community:

    1) The participants in the study were undergrad-uate students rather than professional program-

    mers, and therefore, did not have the same levelof programming knowledge or experience as the

    3. http://www.atlassian.com/software/clover/

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication

  • 8/3/2019 Comparing the Defect Reduction Benefits Of

    10/15

    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 10

    average professional programmer. We attemptedto minimize this effect by choosing participantsfrom a class consisting mostly of juniors andseniors, and by including only the students withthe highest pre-test scores in the study. However,the participants still were not representative ofthe general population of programmers, andmost likely represent novice programmers. Thiswould have affected all of the research groups,but since the TDD method was most dependenton the ability of the programmers, it most likelybiased the study in favor of code inspection.

    2) Although they were not participants in thestudy, the inspectors were college students anddid not have professional code inspection expe-rience. Prior research has shown a positive cor-relation between inspector experience and thenumber of defects found [9], [10], [44]. However,our resultthat inspection is more effective thanTDDis robust to this potential bias, whichwould have had the effect of reducing the likeli-hood of finding inspection to be more effective.

    3) The nature of the experiment required changesto the code inspection process from what wouldnormally be done in industry. First, authors werenot invited to participate in the inspections asdescribed in section 4.2.1, and second, we didnot use an iterative cycle of Rework, Follow-Up,and Re-Inspection as described in section 4.3.1 toensure that all identified defects were corrected.

    Not inviting authors to participate would haveresulted in understating the effectiveness of codeinspection. Section 4.3.1 discusses the potentialeffect of not including an iterative cycle of Re-work, Follow-Up, and Re-Inspection.

    4) The inspectors performed multiple inspectionsin a short period of time, of code that performsthe same function and is written to the samespecification. This was a necessary part of theexperiment, but would rarely, if ever, occur inpractice. This could have resulted in the inspec-tors finding more defects in later inspections as

    they became more familiar with the specifica-tions and with Java-based XML processing code.However, if this affected the results, we shouldhave detected an order bias during data analysis.Use of inspection order as a control variable didnot indicate an order bias in the results.

    5 RESULTS

    This study included two dependent variables: thetotal number of major defects (Total Majors) and thetotal number of hours spent applying the method (To-

    tal Method Hours). We used Bartletts test of spheric-ity to determine whether the dependent variableswere sufficiently correlated to justify use of a singleMANOVA, instead of multiple ANOVAs, for hypoth-

    esis testing. A p-value less than 0.001 indicates cor-relation sufficiently strong to justify use of a singleMANOVA [45]. Bartletts test yielded a p-value of0.602, indicating that a MANOVA was not justified,so we proceeded with a separate ANOVA for eachdependent variable.

    5.1 Tests of Statistical Assumptions

    Tests of normality (including visual inspection of ahistogram and z-testing for skewness and kurtosis asdescribed below) indicated that Total Method Hourswas not normally distributed, and Levenes test in-dicated Total Method Hours did not meet the ho-mogeneity of variance assumption. These assumptionviolations were corrected by performing a square-root transformation. Throughout the remainder of thisdocument, Total Method Hours refers to the trans-formed variable unless otherwise noted. Although

    hypothesis testing results are only presented for thetransformed variable, the original variable producedthe same hypothesis testing results. After performingthe transformation, we used (3) and (4) as recom-mended by Hair, Black, Babin, Anderson, and Tatham[46] to obtain Z values for testing the normalityassumptions both within and across groups for bothdependent variables, and found them to be normallydistributed at the 0.05 level of alpha.

    Zskewness =skewness

    6N

    (3)

    Zkurtosis =kurtosis

    24

    N

    (4)

    5.2 Defects Remaining

    The Inspection+TDD and Inspection groups had thelowest mean number of defects remaining as shownin table 4, whereas the TDD group had the highest.

    Although the means reported in table 4 provideuseful insight into the relative effectiveness of each

    method on reducing defects, the numbers are biasedas a result of both an unequal number of observa-tions in the research groups, and the differing effectsof programmer ability and inspection order on thegroups. Searle, Speed, and Millikin [47] introduced theconcept of an Estimated Marginal Mean to correct forthese biases by calculating a weighted average meanto account for different numbers of observations inthe groups, and by using the ANOVA model to adjustfor the effect of covariates. The covariate adjustmentis performed by inserting the average values of thecovariates into the model, thereby calculating a pre-

    dicted value for the mean that would be expected ifthe covariates were equal at their mean values.

    We used the pre-test score as a covariate for pro-grammer ability. We used two variablesinspection

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication

  • 8/3/2019 Comparing the Defect Reduction Benefits Of

    11/15

    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 11

    TABLE 4Descriptive Statistics of Defects Remaining by Group

    Group Mean Std.Dev.

    Min Max

    Inspection+TDD 9.71 5.99 3 19

    Inspection 11.50 6.92 2 22Control 14.43 8.52 2 23

    TDD 15.11 5.13 7 25

    (a) Adjusted for defects not fixed after methodinspection

    Group Mean Std.Dev.

    Min Max

    Inspection+TDD 12.29 7.70 4 24

    Inspection 13.33 7.74 2 25

    Control 14.43 8.52 2 23

    TDD 15.11 5.13 7 25(b) Not adjusted for defects not fixed aftermethod inspection

    TABLE 5Estimated Marginal Means of Defects Remaining by

    Group

    GroupAdjustedEstimated

    Marginal Mean

    UnadjustedEstimated

    Marginal Mean

    Inspection+TDD 8.21 10.65

    Inspection 12.37 14.17

    TDD 14.37 14.45

    Control 16.15 16.20

    day order, and an indication of whether the inspec-tion was performed first or second on the day ofinspectionas covariates for inspection order. Table5 shows the estimated marginal means of the numberof defects remaining, in order from smallest to largest.The Adjusted Estimated Marginal Mean column in-

    cludes the adjustment for defects not fixed after themethod inspections. The main difference between theestimated marginal means and the simple means oftable 4, is that the estimated marginal mean of theTDD group is lower than the one for the controlgroup. However, the ANOVA results described belowindicate that the difference is not statistically signifi-cant.

    We used ANOVA to test our hypotheses and as withthe calculation for estimated marginal means, we usedthe pre-test score to control for the effects of program-mer ability, and measurement inspection order, and

    whether the inspection was performed first or secondon the day of inspection to control for the effects ofinspection order. For hypothesis H1, which hypothe-sizes that code inspection is more effective than TDD

    at reducing software defects, we obtained differenthypothesis testing results depending on whether weused the adjusted or the unadjusted defect counts asthe dependent variable. The hypothesis is supportedwith the adjusted defect count variable, whereas it isnot supported with the unadjusted variable.

    For the adjusted defect count variable, we observedeta squared values of 0.190 for the effect of codeinspection and 0.323 for the pre-test score, indicatingthat code inspection and pre-test score accounted for19.0% and 32.3% respectively, of the total variance inthe number of defects remaining. Both the pre-testscore and whether Inspection was used as a defectreduction method were significant at the 0.05 levelof alpha and both of these variables were negativelycorrelated with the number of defects. The use of TDDdid not result in a statistically significant difference inthe number of defects. However, this lack of signifi-

    cance may be the result of low observed statisticalpower for the effect of TDD, which was only 0.238.Table 6 summarizes the results of the ANOVA analysison the adjusted defect count variable, with variableslisted in order of significance.

    For the unadjusted defect count variable, only thepre-test score was found to be statistically significant,with a p-value of 0.003. The effects of inspection, TDD,and the interaction between inspection and TDDresulted in the following respective p-values: 0.119,0.153, and 0.358. Therefore, based on the unadjusteddefect counts, we would reject hypotheses H1 and H2.

    To test hypothesis H2 (that the combined use ofthe two methods is more effective than either methodalone) for the adjusted defect count variable, we per-formed one-tailed T-Tests with a Bonferroni adjust-ment for multiple comparisons to compare the follow-ing two sets of means: 1) TDD vs. Inspection+TDD,and 2) Inspection vs. Inspection+TDD. After the Bon-ferroni adjustment, a p-value of 0.025 was needed for both comparisons to support the hypothesis at the0.05 level. The comparisons yielded p-values of 0.036and 0.314 respectively, so we reject hypothesis H2.

    A supplemental document, included with the on-

    line version of this paper, contains additional analysisof the number of defects remaining. This supplemen-tal analysis includes descriptive statistics and ANOVAanalysis of the number of defects found separatelyby the automated tests and by the measurement codeinspection. The main result of the supplemental anal-ysis is that hypothesis H1 is supported and H2 isnot supported when using only automated testing tocount defects. Neither hypothesis is supported whenusing only code inspection.

    5.3 Implementation CostWe performed an analysis of the implementation costsassociated with TDD and code inspection but wedid not explore the cost-benefits of reducing software

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication

  • 8/3/2019 Comparing the Defect Reduction Benefits Of

    12/15

    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 12

    TABLE 6ANOVA Summary for Adjusted Number of Defects Remaining

    SourceType IIISum ofSquares

    Degreesof

    FreedomF

    Significance(p-value)

    ObservedPower

    Pre-Test Score 341.791 1 10.481 0.002 0.871

    Inspection 167.930 1 5.150 0.017 0.583TDD 55.327 1 1.697 0.103 0.238

    TDD * Inspection 9.710 1 0.298 0.295 0.082

    Measure Inspection Day Order 17.807 1 0.546 0.468 0.109

    Measure Inspection Day 17.037 1 0.522 0.477 0.106

    TABLE 7Descriptive Statistics of Implementation Cost by

    Group

    Group MeanStd.Dev.

    Min Max

    TDD 11.01 5.96 6.00 24.25Control 12.79 4.87 4.00 18.00

    Inspection 20.88 a 5.56 13.50 30.00

    Inspection+TDD 31.71 a 11.71 13.00 46.75

    a includes 7 hours of inspection timeone hour of meetingpreparation time for each of the three inspectors and onehour of inspection meeting time for the inspectors and theinspection moderator

    defects. Refer to Boehm [37] or Gilb and Graham[10] for in-depth treatment of cost savings associated

    with defect reduction. We measured cost in man-hours and found TDD to have the lowest mean costand Inspection+TDD to have the highest, as shown intable 7.

    Fig. 4 presents a profile plot of the estimatedmarginal means. The solid line represents the twogroups that did not use TDD, whereas the dashedline represents the two groups that did use TDD.The position of the points on the x-axis indicateswhether the groups used code inspection. Followingeach line from left to right shows the cost effectof starting either with or without TDD and adding

    code inspection to the method used. Two importantobservations from the profile plot are that the use ofTDD resulted in a cost savings, and that there is aninteraction effect between the methods, as indicatedby the fact that the lines cross. The interaction effectis supported by the ANOVA analysis at the 0.05 levelof alpha as shown in table 8, whereas the cost savingsfor TDD is not supported.

    The finding of an interaction effect on cost betweencode inspection and TDD appears to be atheoretical.Further research is necessary to confirm, and if con-firmed, to explain this effect. The lack of a finding

    of significance for the potential TDD cost savings isconsistent with other TDD research. In our review ofthe TDD literature, we have not found reports of aninitial development cost savings associated with TDD.

    Fig. 4. Profile Plot of Square-Root Transformed Esti-mated Marginal Mean Implementation Cost

    We did, however, observe a low statistical power of0.107 for this effect, so if a cost savings is a real effectof TDD, we would not expect to have observed it inthis study.

    We used ANOVA to test hypothesis H3 (that im-plementation cost differs between the two methods).The ANOVA results are taken from the square-roottransformed cost variable, although the original vari-able yielded the same hypothesis testing results. As

    with the test for the number of defects remaining, weused the pre-test score as a control variable. We didnot control for inspection order because the amount oftime spent on inspections was held constant, leavingno opportunity for inspection order to affect cost. Weobtained an eta squared of 0.517 for the effect of codeinspectionindicating that whether code inspectionwas used accounted for 51.7% of the total variancein implementation cost. The pre-test score was notsignificant and had an eta squared of only 0.085. Asstated above, we also found the use of TDD not tobe significant. Table 8 presents a summary of these

    results, with variables listed in order of significance.Post hoc analysis using Bonferronis test indicates

    that the TDD group is significantly different from boththe Inspection and Inspection+TDD groups, and that

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication

  • 8/3/2019 Comparing the Defect Reduction Benefits Of

    13/15

    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 13

    TABLE 8ANOVA Summary for Implementation Cost

    SourceType IIISum ofSquares

    Degreesof

    FreedomF

    Significance(p-value)

    ObservedPower

    Inspection 17.793 1 25.698 0.000 0.998

    TDD * Inspection 2.988 1 4.315 0.049 0.514Pre-Test Score 1.537 1 2.219 0.149 0.299

    TDD 0.366 1 0.529 0.474 0.107

    TABLE 9Post Hoc Analysis for Implementation Cost by Group

    Comparison MeanDifference

    Std. ErrorSignificance

    (p-value)

    TDD vs.Inspection+TDD

    -2.3098 0.42945 0.000

    Control vs.Inspection+TDD

    -2.0355 0.45550 0.001

    TDD vs.Inspection

    -1.3077 0.44913 0.045

    Control vs.Inspection

    -1.0334 0.47410 0.233

    TDD vs.Control

    -0.2742 0.42945 1.000

    TABLE 10Summary of Hypothesis Testing Results

    Hypothesis Description Result

    H1Code inspection is more ef-fective than TDD at reducingdefects.

    Accepted a, b

    H2

    The combined use of codeinspection and TDD is moreeffective than either methodalone.

    Not Accepted a

    H3Code inspection and TDDdiffer in implementation cost.

    Accepted a, c

    a at the 0.05 level of alphab when defect counts are adjusted for defects found during

    method inspections but not correctedc code inspection is more expensive

    the control group is significantly different from theInspection+TDD groupall at the 0.05 level of alpha.Table 9 summarizes these results, with comparisonslisted in order of significance.

    These results show that hypothesis H3 is supported,indicating that there is a cost difference between theuse of code inspection and TDD, with the cost of codeinspection being higher. Table 10 presents a summaryof hypothesis testing results.

    6 DISCUSSIONThe main result of this study is to provide support forthe hypothesis that code inspection is more effectivethan TDD at reducing software defects, but that it is

    also more expensive. Resource constraints preventedus from implementing an iterative re-inspection pro-cess, which resulted in some conflicting results in ourfinding that code inspection is more effective thanTDD. As explained in section 4.3.1, we presented twosets of resultsone on a defect count variable thatincluded an adjustment for uncorrected defects that

    should have been caught by an iterative re-inspectionprocess, and one on a defect count variable that didnot include the adjustment. Results from the adjusteddefect count variable support the hypothesis that codeinspection is more effective, whereas results from theunadjusted variable do not.

    We believe that the adjusted defect count variableis more representative of what would be experiencedin an industrial setting for two reasons. First, the pro-grammers were students who although motivated todo well on the programming assignment for a coursegrade, likely had a lower motivation to producehigh quality software than a professional programmerwould have. We believe that few professional pro-grammers would ignore defects found by inspectionas some of our participants did, so the number ofuncorrected defects would be lower than what weobserved. Second, although it is impossible to becertain about which uncorrected defects would havebeen caught by an iterative re-inspection process, webelieve that most (if not all) of them would have beencaught by an experienced inspection team with anexperienced inspection moderator, since verification

    of defect correction is the purpose of the re-inspectionstep, and software inspection has been found to beeffective at reducing defects in numerous previouslycited studies.

    The supplemental analysis (included with the on-line version of this document) provides additionalsupport for the hypothesis that code inspection ismore effective than TDD at reducing defects. Herewe performed hypothesis testing on the defect countsobtained only by using the 58 JUnit tests described insection 4.3.1 and found support for the hypothesis atthe 0.01 level of alpha for the defect counts adjusted

    for uncorrected defects, and at the 0.1 level for theunadjusted defect counts.

    This result is important because automated tests areless subjective than inspection-based defect counts.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication

  • 8/3/2019 Comparing the Defect Reduction Benefits Of

    14/15

    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 14

    We performed the analysis in the main part of thestudy on the defect counts obtained from a combina-tion of inspection and automated testing to be consis-tent with prior code inspection research and to avoid apotential bias in favor of TDD because of the relation-ship between TDD and automated testing. Therefore,the finding of code inspection being more effectivewhen using only automated acceptance testinginspite of this potential biasprovides strong supportfor the hypothesis. This support, however, is tem-pered by the fact that we did not find support forthe hypothesis when using only the measurementinspections to count defects.

    Another implication of this research is the findingthat TDD did not significantly reduce the number ofdefects. Several possible explanations for this resultexist. Low observed statistical power is one expla-nation. Another explanation, and the one that we

    believe accounts for the varying results summarizedin section 2.2 on the effectiveness of TDD as a defectreduction method, is that TDD is currently too looselydefined to produce reliable results that can confidently be compared with other methods. A common def-inition of TDD is that it is a practice in which noprogram code is written until an automated unit testrequires the code in order to succeed [38]. However,much variability is possible within this definition, andwe believe it is this variability that accounts for themixed results in the effectiveness of TDD as a defectreduction method. Additional research is necessary to

    add structure to TDD and to allow it to be reliablyimproved and compared to other methods.

    7 SUMMARY AND CONCLUSIONS

    We compared the software defect rates and imple-mentation costs associated with two methods of soft-ware defect reduction: code inspection and test-drivendevelopment. Prior research has indicated that bothmethods are effective at reducing defects, but themethods had not previously been compared.

    We found that code inspection is more effective than

    TDD at reducing defects, but that code inspection isalso more expensive to implement. We also foundsome evidence to indicate that TDD may result inan implementation cost savings, although somewhatconflicting results require additional research to verifythis result. Previous research has not shown a costsavings from TDD. The results did not show a statis-tically significant reduction in defects associated withthe use of TDD, but results did show an interactioneffect on cost between code inspection and TDD. Weare currently unable to explain this effect. See table 10for a summary of hypothesis testing results.

    These findings have the potential to significantlyimpact software development practice for both soft-ware developers and managers, but additional re-search is needed to validate these findings both inside

    and outside of a laboratory environment. Additionalresearch is also needed to more clearly define TDDand to compare a more clearly defined version of TDDwith code inspection.

    ACKNOWLEDGMENTS

    The authors are grateful to Cenqua Pty. Ltd. for useof their Clover code coverage tool. We also thankthe study participants and the inspectors for theirtime and effort, and Dr. Adam Porter for the use ofmaterials from his software inspection experiments.

    REFERENCES

    [1] G. Tassey, The economic impact of inadequate infrastructurefor software testing, National Institute of Standards andTechnology, Tech. Rep., 2002.

    [2] B. George and L. Williams, A structured experiment oftest-driven development, Information and Software Technology,vol. 46, no. 5, pp. 33742, 2004.

    [3] E. M. Maximilien and L. Williams, Assessing test-drivendevelopment at IBM, in Proceedings of the 25th InternationalConference on Software Engineering, Portland, OR, 2003, pp. 5649.

    [4] D. L. Parnas and M. Lawford, The role of inspections insoftware quality assurance, IEEE Transactions on SoftwareEngineering, vol. 29, no. 8, pp. 6746, 2003.

    [5] F. Shull, V. R. Basili, B. W. Boehm, A. W. Brown, P. Costa,M. Lindvall, D. Port, I. Rus, R. Tesoriero, and M. Zelkowitz,What we have learned about fighting defects, in Proceedingsof the 8th IEEE Symposium on Software Metrics, 2002, pp. 24958.

    [6] M. E. Fagan, Design and code inspections to reduce errorsin program development, IBM Systems Journal, vol. 15, no. 3,pp. 182211, 1976.

    [7] N. Nagappan, M. E. Maximilien, T. Bhat, and L. Williams, Re-

    alizing quality improvement through test driven development:Results and experiences of four industrial teams, EmpiricalSoftware Engineering, vol. 13, no. 3, pp. 289302, 2008.

    [8] P. Runeson, C. Andersson, T. Thelin, A. Andrews, andT. Berling, What do we know about defect detection meth-ods? IEEE Software, vol. 23, no. 3, pp. 8290, 2006.

    [9] M. E. Fagan, Advances in software inspections, IEEE Trans-actions on Software Engineering, vol. SE-12, no. 7, pp. 74451,1986.

    [10] T. Gilb and D. Graham, Software Inspection. Addison-Wesley,1993.

    [11] O. Laitenberger and J.-M. DeBaud, An encompassing lifecycle centric survey of software inspection, Journal of Systemsand Software, vol. 50, no. 1, pp. 531, 2000.

    [12] A. Aurum, H. Petersson, and C. Wohlin, State-of-the-art: Soft-ware inspections after 25 years, Software Testing, Verification

    and Reliability, vol. 12, no. 3, pp. 13354, 2002.[13] A. F. Ackerman, L. S. Buchwald, and F. H. Lewski, Softwareinspections: An effective verification process, IEEE Software,vol. 6, no. 3, pp. 316, 1989.

    [14] W. S. Humphrey, A Discipline for Software Engineering, ser.The SEI Series in Software Engineering. Addison-WesleyPublishing Company, 1995.

    [15] R. C. Linger, Cleanroom software engineering for zero-defectsoftware, in Proceedings of the 15th International Conference onSoftware Engineering, Baltimore, Maryland, 1993, pp. 213.

    [16] T. Thelin, P. Runeson, and B. Regnell, Usage-based readingan experiment to guide reviewers with use cases, Informationand Software Technology, vol. 43, no. 15, pp. 92538, 2001.

    [17] T. Thelin, P. Runeson, and C. Wohlin, An experimental com-parison of usage-based and checklist-based reading, IEEETransaction on Software Engineering, vol. 29, no. 8, pp. 687704,

    2003.[18] T. Thelin, P. Runeson, C. Wohlin, T. Olsson, and C. Andersson,

    Evaluation of usage-based reading - conclusions after threeexperiments, Empirical Software Engineering, vol. 9, no. 1-2,pp. 77110, 2004.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication

  • 8/3/2019 Comparing the Defect Reduction Benefits Of

    15/15

    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 15

    [19] V. R. Basili, S. Green, O. Laitenberger, F. Lanubile, F. Shull,S. Srumgard, and M. V. Zelkowitz, The empirical investi-gation of perspective-based reading, Empirical Software Engi-neering, vol. 1, no. 2, pp. 13364, 1996.

    [20] C. Denger, M. Ciolkowski, and F. Lanubile, Investigatingthe active guidance factor in reading technigues for defectdetection, in Proceedings of the 3rd International Symposium onEmpirical Software Engineering, Kaiserslautern, Germany, 2004.

    [21] O. Laitenberger and J.-M. DeBaud, Perspective-based reading

    of code documents at Robert Bosch GmbH, Information andSoftware Technology, vol. 39, no. 11, pp. 78191, 1997.

    [22] J. Miller, M. Wood, and M. Roper, Further experiences withscenarios and checklists, Empirical Software Engineering, vol. 3,no. 1, pp. 3764, 1998.

    [23] V. R. Basili, G. Caldiera, F. Lanubile, and F. Shull, Studies onreading techniques, in Proceedings of the 21st Annual SoftwareEngineering Workshop, Greenbelt, MD, USA, 1996, pp. 5965.

    [24] V. R. Basili and R. W. Selby, Comparing the effectivenessof software testing strategies, IEEE Transaction on SoftwareEngineering, vol. 13, no. 12, pp. 127896, 1987.

    [25] A. A. Porter and L. G. Votta, An experiment to assessdifferent defect detection methods for software requirementsinspections, in Proceedings of the 16th International Conferenceon Software Engineering, Sorrento, Italy, 1994, pp. 10312.

    [26] A. A. Porter, L. G. Votta, Jr., and V. R. Basili, Comparingdetection methods for software requirements inspections: Areplicated experiment, IEEE Transactions on Software Engineer-ing, vol. 21, no. 6, pp. 56375, 1995.

    [27] F. Shull, F. Lanubile, and V. R. Basili, Investigating read-ing techniques for object-oriented framework learning, IEEETransaction on Software Engineering, vol. 26, no. 11, pp. 110118,2000.

    [28] J. F. Nunamaker, Jr., R. O. Briggs, D. D. Mittleman, D. R.Vogel, and P. A. Balthazard, Lessons from a dozen years ofgroup support systems research: A discussion of lab and fieldfindings, Journal of Management Information Systems, vol. 13,no. 3, pp. 163207, 1997.

    [29] J. F. Nunamaker, Jr., A. R. Dennis, J. S. Valacich, D. R. Vogel,and J. F. George, Electronic meeting systems to support groupwork, Communications of the ACM, vol. 34, no. 7, pp. 4061,

    1991.[30] P. M. Johnson, An instrumented approach to improving soft-ware quality through formal technical review, in Proceedingsof the 16th International Conference on Software Engineering,Sorrento, Italy, 1994, pp. 11322.

    [31] M. van Genuchten, C. van Dijk, H. Scholten, and D. Vogel,Using group support systems for software inspections, IEEESoftware, vol. 18, no. 3, pp. 605, 2001.

    [32] S. Biffli, P. Grunbacher, and M. Halling, A family of exper-iments to investigate the effects of groupware for softwareinspection, Automated Software Engineering, vol. 13, no. 3, pp.37394, 2006.

    [33] F. Lanubile, T. Mallardo, and F. Calefato, Tool support forgeographically dispersed inspection teams, Software Process:Improvement and Practice, vol. 8, no. 4, pp. 21731, 2003.

    [34] C. K. Tyran and J. F. George, Improving software inspections

    with group process support, Communications of the ACM,vol. 45, no. 9, pp. 8792, 2002.[35] M. van Genuchten, W. Cornelissen, and C. van Dijk, Support-

    ing inspections with an electronic meeting system, Journalof Management Information Systems, vol. 14, no. 3, pp. 16578,1997.

    [36] P. Vitharana and K. Ramamurthy, Computer-mediated groupsupport, anonymity, and the software inspection process: Anempirical investigation, IEEE Transaction on Software Engineer-ing, vol. 29, no. 2, pp. 16780, 2003.

    [37] B. W. Boehm, Software Engineering Economics. EnglewoodCliffs, NJ: Prentice Hall, 1981.

    [38] K. Beck, Test Driven Development: By Example. Addison-WesleyProfessional, 2002.

    [39] M. M. Muller and O. Hagner, Experiment about test-firstprogramming, IEEE Proceedings Software, vol. 149, no. 5, pp.

    1316, 2002.[40] H. Erdogmus, M. Morisio, and M. Torchiano, On the ef-

    fectiveness of the test-first approach to programming, IEEETransactions on Software Engineering, vol. 31, no. 3, pp. 22637,

    [41] W. S. Humphrey, Managing the Software Process, ser. The SEISeries in Software Engineering. Addison-Wesley PublishingCompany, 1989.

    [42] T. L. Rodgers, D. L. Dean, and J. F. Nunamaker, Jr., Increas-ing inspection efficiency through group support systems, inProceedings of the 37th Annual Hawaii International Conference onSystem Sciences, Waikoloa, Hawaii, 2004.

    [43] C. Fox, Java inspection checklist, 1999.[44] R. G. Ebenau and S. H. Strauss, Software Inspection Process, ser.

    Systems Design and Implementation. McGraw Hill, 1994.[45] L. S. Meyers, G. Gamst, and A. J. Guarino, Applied Multivariate

    Research: Design and Interpretation. Sage Publications, Inc.,2006.

    [46] J. F. Hair, B. Black, B. Babin, R. E. Anderson, and R. L. Tatham,Multivariate Data Analysis, 6th ed. Prentice Hall, 2005.

    [47] S. R. Searle, F. M. Speed, and G. A. Milliken, Populationmarginal means in the linear model: An alternative to leastsquares means, The American Statistician, vol. 34, no. 4, pp.21621, 1980.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication