neural program repair with execution-based backpropagation

12
Neural Program Repair with Execution-based Backpropagation He Ye * , Matias Martinez , Martin Monperrus * * KTH Royal Institute of Technology, Sweden. [email protected], [email protected] University of Valenciennes, France. [email protected] Abstract—Neural machine translation (NMT) architectures have achieved promising results for automatic program repair. Yet, they have the limitation of generating low-quality patches (e.g., not compilable patches). This is because the existing works only optimize a purely syntactic loss function based on characters and tokens without incorporating program-specific information during neural net weight optimization. In this paper, we propose a novel program repair model called RewardRepair. The core novelty of RewardRepair is to improve NMT-based program repair with a loss function based on program compilation and test execution information, rewarding the network to produce patches that compile and that do not overfit. We conduct several experiments to evaluate RewardRepair showing that it is feasible and effective to use compilation and test execution results to optimize the underlying neural repair model. In total, RewardRepair correctly repairs 43 Defects4J bugs including eight that are fixed for the first time. I. I NTRODUCTION Automatic program repair aims to reduce manual work related to bug localization and bug fixing [1], [2]. With recent advances in deep learning, research has been proposed to use neural networks for program repair [3]. This line of work, put here under the umbrella term “neural program repair”, mostly uses neural machine translation (NMT) techniques [4]–[9]. Program repair systems based on neural machine translation treat the repair task as a translation from buggy code to correct code. Given sufficient training data, NMT-based repair has achieved promising performance [4]–[9]. All the prior works on program repair based on NMT use the same loss function: cross-entropy. Despite early successes, NMT-based program repair tech- niques suffer from two major drawbacks. First, they often generate patches that do not compile [4]. The reason is that a lower cross-entropy value does not necessarily lead to a compilable patch. Second, the loss function under optimization forces the neural networks to learn to produce absolutely identical patches, thus missing the opportunity to explore semantically equivalent patches based on equivalent program transformations [10]. Both problems share the same root: there is a discrepancy between the core training objectives, learning to generate compilable and correct patches, and the loss function that is being optimized. The cross-entropy loss used in related work requires a strict pairwise matching between the generated patch and the human-written ground truth patch, and not more [11]. Nothing in the design of cross-entropy loss explicitly steers backpropagation to produce compilable and correct patches. This is the problem we address in this paper. We introduce a novel neural repair model, called Re- wardRepair based on a mixed learning objective. The key insight in the design of RewardRepair is the combination of a syntactic training objective at the token level and a semantic training objective based on program execution. RewardRepair defines a discriminator to separate good quality patches from low-quality ones. The discriminator is based on executing the compiler and running the test cases, both providing high qualify feedback on the generated patches. This feedback is transformed as a quantified reward signal that modulates the cross-entropy loss. Then, the neural network’s weights are updated based on the discriminator. In other words, in RewardRepair, backpropagation embeds essential compilation and execution information. We conduct a large experiment to evaluate RewardRepair based on four state-of-the-art datasets from the literature, in- cluding Defects4J [12]. Our experiment comprehensively com- pares the effectiveness of purely syntactic training with cross- entropy loss, and of semantic training with compilation and execution information for optimizing the neural network. Our result shows that execution-based semantic training improves RewardRepair towards generating more compilable (+30.73%) and more correct patches (+128%). Notably, RewardRepair is able to generate a patch for eight Defects4J bugs that were never successfully repaired before. To sum up, our contributions are: We devise, RewardRepair, a neural program repair approach with execution-based backpropagation. Re- wardRepair defines a novel training objective that em- ploys compilation and test execution information. We perform an original series of experiments to show that RewardRepair’s training objective outperforms the cross-entropy loss used in related work. Our experimental results demonstrate that embedding execution information in backpropagation clearly improves the quality of gener- ated patches (more compilable and more correct patches). We provide evidence that RewardRepair can correctly fix 43 Defects4J bugs and that eight unique Defects4J bugs are repaired for the first time ever. Here, the correctness criteria are per the state-of-the-art [13], [14] based on both automatically generated tests and manual analysis. We provide in-depth case studies. 1 arXiv:2105.04123v1 [cs.SE] 10 May 2021

Upload: others

Post on 07-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Program Repair with Execution-based Backpropagation

Neural Program Repair with Execution-basedBackpropagation

He Ye∗, Matias Martinez†, Martin Monperrus∗∗KTH Royal Institute of Technology, Sweden. [email protected], [email protected]

†University of Valenciennes, France. [email protected]

Abstract—Neural machine translation (NMT) architectureshave achieved promising results for automatic program repair.Yet, they have the limitation of generating low-quality patches(e.g., not compilable patches). This is because the existing worksonly optimize a purely syntactic loss function based on charactersand tokens without incorporating program-specific informationduring neural net weight optimization. In this paper, we proposea novel program repair model called RewardRepair. The corenovelty of RewardRepair is to improve NMT-based programrepair with a loss function based on program compilation andtest execution information, rewarding the network to producepatches that compile and that do not overfit. We conductseveral experiments to evaluate RewardRepair showing that itis feasible and effective to use compilation and test executionresults to optimize the underlying neural repair model. In total,RewardRepair correctly repairs 43 Defects4J bugs including eightthat are fixed for the first time.

I. INTRODUCTION

Automatic program repair aims to reduce manual workrelated to bug localization and bug fixing [1], [2]. With recentadvances in deep learning, research has been proposed to useneural networks for program repair [3]. This line of work, puthere under the umbrella term “neural program repair”, mostlyuses neural machine translation (NMT) techniques [4]–[9].

Program repair systems based on neural machine translationtreat the repair task as a translation from buggy code to correctcode. Given sufficient training data, NMT-based repair hasachieved promising performance [4]–[9]. All the prior workson program repair based on NMT use the same loss function:cross-entropy.

Despite early successes, NMT-based program repair tech-niques suffer from two major drawbacks. First, they oftengenerate patches that do not compile [4]. The reason is thata lower cross-entropy value does not necessarily lead to acompilable patch. Second, the loss function under optimizationforces the neural networks to learn to produce absolutelyidentical patches, thus missing the opportunity to exploresemantically equivalent patches based on equivalent programtransformations [10]. Both problems share the same root: thereis a discrepancy between the core training objectives, learningto generate compilable and correct patches, and the lossfunction that is being optimized. The cross-entropy loss usedin related work requires a strict pairwise matching between thegenerated patch and the human-written ground truth patch, andnot more [11]. Nothing in the design of cross-entropy lossexplicitly steers backpropagation to produce compilable andcorrect patches. This is the problem we address in this paper.

We introduce a novel neural repair model, called Re-wardRepair based on a mixed learning objective. The keyinsight in the design of RewardRepair is the combination of asyntactic training objective at the token level and a semantictraining objective based on program execution. RewardRepairdefines a discriminator to separate good quality patches fromlow-quality ones. The discriminator is based on executingthe compiler and running the test cases, both providing highqualify feedback on the generated patches. This feedbackis transformed as a quantified reward signal that modulatesthe cross-entropy loss. Then, the neural network’s weightsare updated based on the discriminator. In other words, inRewardRepair, backpropagation embeds essential compilationand execution information.

We conduct a large experiment to evaluate RewardRepairbased on four state-of-the-art datasets from the literature, in-cluding Defects4J [12]. Our experiment comprehensively com-pares the effectiveness of purely syntactic training with cross-entropy loss, and of semantic training with compilation andexecution information for optimizing the neural network. Ourresult shows that execution-based semantic training improvesRewardRepair towards generating more compilable (+30.73%)and more correct patches (+128%). Notably, RewardRepair isable to generate a patch for eight Defects4J bugs that werenever successfully repaired before.

To sum up, our contributions are:

• We devise, RewardRepair, a neural program repairapproach with execution-based backpropagation. Re-wardRepair defines a novel training objective that em-ploys compilation and test execution information.

• We perform an original series of experiments to showthat RewardRepair’s training objective outperforms thecross-entropy loss used in related work. Our experimentalresults demonstrate that embedding execution informationin backpropagation clearly improves the quality of gener-ated patches (more compilable and more correct patches).

• We provide evidence that RewardRepair can correctly fix43 Defects4J bugs and that eight unique Defects4J bugsare repaired for the first time ever. Here, the correctnesscriteria are per the state-of-the-art [13], [14] based onboth automatically generated tests and manual analysis.We provide in-depth case studies.

1

arX

iv:2

105.

0412

3v1

[cs

.SE

] 1

0 M

ay 2

021

Page 2: Neural Program Repair with Execution-based Backpropagation

- return FastMath.pow(2 * FastMath.PI, -dim / 2 ) *

+ return FastMath.pow(2 * FastMath.PI, -0.5 * dim ) *

(a) The human-written patch

//loss value 0.1157

+ return FastMath.pow(2 * FastMath.PI, -d * dim ) *

(b) A generated non-compilable patch receives a small loss score

//loss value 0.4224

+ return FastMath.pow(2 * FastMath.PI, -dim / 2d ) *

(c) A generated semantically-equivalent patch receives a bigger loss score

Listing 1: Motivating example: NMT based repair models based oncross-entropy loss may favor non-compilable patches.

II. BACKGROUND

A. Neural Program Repair

Neural Machine Translation (NMT) systems have recentlyachieved state-of-the-art performance on program repair tasks[4]–[9]. For example, DeepFix by Gupta et al. [9] leveragesa neural network to fix compilation bugs. SEQUENCER byChen et al. [4] fixes bugs based on sequence-to-sequencelearning on Java source code. CoCoNut by Lutellier et al.[5] learns patch generation as a combination of convolutionalneural networks (CNNs).

Despite the difference in their inputs and neural models,those approaches are similar in the sense that they are all basedon a typical NMT model formulated as an encoder-decoder-attention architecture optimized with cross-entropy [11], [15],[16]. As a matter of fact, all the prior works on program repairare based on the NMT architecture with a cross-entropy loss[4]–[9].

The cross-entropy loss (a.k.a log loss) is a measure frominformation theory, building upon entropy and calculating thedifference between two probability distributions. In sequencegeneration, the cross-entropy loss calculates the differencebetween the generated tokens and the human-written patchtokens in a strict pairwise matching manner [11], [16], [17].In program repair patches, a low cross-entropy value meansthat the generated patch is syntactically close to the groundtruth patch at the token level.

B. Motivating Example

The token-based cross-entropy loss optimization is effectiveat guiding the model to generate patches identical to thehuman-written patches, but suffers from two major limitations.Firstly, a major goal in patch generation is to generate well-formed patches that compile, yet the cross-entropy loss doesnot favor patches that compile over non-compilable patches.Even worse, if a generated patch has a significantly lowerloss per the token-based cross-entropy, although being non-compilable, it would be favored at training time. Secondly,this loss function fails to learn from semantically equivalentpatches: a syntactically different but semantically equivalentpatch could potentially have a high cross-entropy loss. Thismeans that cross-entropy loss discourages the network to

explore equivalent information. Overall, the problem we ad-dress in this paper is that the cross-entropy loss used inprevious research cannot learn from semantic informationbeyond tokens.

Listing 1 gives our motivating example to show the draw-backs of the neural repair model based on the cross-entropyloss. Listing 1a presents the buggy code and the human-written patch for Defects4J bug Math-11. Listing 1b givesone generated non-compilable patch because of the undefinedvariable d. The network does not avoid generating this patchbecause its cross-entropy loss is a low 0.1157. Listing 1c is asemantically equivalent patch (compared to the human-writtenpatch) and its computed loss is 0.4224, which is higher thanthe incorrect non-compilable patch.

With cross-entropy loss, the NMT-based repair model pe-nalizes the semantically equivalent patch and favors the non-compilable patch. This is because the cross-entropy loss isbased on a strict token comparison. There is only one tokendifference for the generated non-compilable patches, wherethe wrong token d is not the expected token 0.5 from theground truth patch. However, the token d is an undefinedvariable in the buggy program. On the contrary, there existsthree token differences in the semantically equivalent patch:(dim→ 0.5), (/→ ∗), (2d→ dim). Consequently, the NMT-based repair model considers the generated non-compilablepatch is closer to the human-written patch, and thus shouldbe favored during backpropagation. However, in the theoryof program repair, the semantically equivalent patch shouldhave a loss close to zero, because it is a valid solution to thebug. The generated non-compilable patch has a lower loss,but still cannot satisfy the compiler, which is inconsistent withour goal. This example shows the fundamental limitation ofoptimizing the cross-entropy loss for the program repair task.

III. REWARDREPAIR

A. Overview

Figure 1 gives an overview of RewardRepair. The top,middle, bottom parts represent the three stages of syntactictraining, semantic training and inference. RewardRepair istrained with two datasets respectively, a syntactic trainingdataset with pairs of buggy and patch code, and a semantictraining dataset. The requirements for the semantic trainingdataset are full execution: each sample in the dataset comeswith a compiler configuration and test cases.

Syntactic training is our first training phase. We trainRewardRepair to optimize the cross-entropy loss based on alarge bug fix corpus per the related work [4]–[7], [9], [18].Syntactic training is able to provide a good pre-trained modelfor semantic training.

Semantic training is our second phase after syntactic train-ing. Semantic training is novel and based on a discriminativemodel. The discriminative model of RewardRepair analyzesthe candidate patches with four discriminators (differencediscriminator, compilability discriminator, plausibility discrim-inator and regression discriminator), and modulates the cross-entropy loss before the start of backpropagation on training.

2

Page 3: Neural Program Repair with Execution-based Backpropagation

Syntactic Training

Semantic Training

Cross-Entropy

Loss

candidatepatches

PatchGenerator

Weights update withbackpropagation

expected output

traininginput

PatchGenerator

Weights updatewith

backpropagation

Buggy code HumanPatch

training data point (pair)

Buggy code HumanPatch

training data point (tuple)

App

traininginput

RewardRepairLoss

PlausibleDiscriminator

RegressionDiscriminator

no-changepenalty

compilabilityreward

plausible reward

likely-correctreward

Y YY

difference reward

YDifferenceDiscriminator

CompilabilityDiscriminator

Buggy code App

Y

N

Execution Execution

Reward Values (R)

Discriminative Model of RewardRepair

candidatepatches

(compilablebug-fix

patches + executable

tests)

(bug-fixpatches,

code only)

expectedoutput

Inference(bugs under actual repair)

PatchGenerator suggested

patches

inference data point

Buggy code

inferenceinput

Fig. 1: Overview of RewardRepair: syntactic training is per the state-of-the-art of neural program repair, semantic training is our corecontribution. RewardRepair produces a neural network that is used later at inference time to generate patches for new unseen bugs.

Inference is the final phase. Once RewardRepair has beentrained, it can generate patches for new and unseen bugs.

B. Syntactic Training of RewardRepairRewardRepair is initially trained with syntactic loss calcu-

lation per the state-of-the-art of NMT for program repair [4]–[7], [9], [18]. In this paper, we call this training “syntactictraining” which is considered as the pre-training of semantictraining of RewardRepair.

As code representation, we follow Lutellier et al. [5] rep-resent the buggy code and context code as two separate tokensequences. In addition, the context code is enriched with asummary of the buggy class as proposed by Chen et al. [4] asfollows: we keep all the instance variables and their initializers,along with the signature of the constructors and methods. Inorder to handle code vocabulary problems [19], tokenizationis done after identifier splitting with WordPiece [20].

In RewardRepair, the patch generator is trained in a su-pervised manner based on an encoder-decoder architecture[21], [22]. Syntactic training takes as input buggy code tokensB = (b0, b1 . . . bn), context code tokens C = (c0, c1 . . . cm),and tokens from the ground truth fix patch F = (f0, f1 . . . ft).RewardRepair transforms the B and C into a predicted patchF ′ = (f ′0, f

′1 . . . f

′k). Note that size of the buggy code, context

code, ground truth patch and predicted patch, i.e., n, m, tand k, can be different. Then, RewardRepair uses the cross-entropy loss to compare the distributions between the tokensfrom the ground truth patch F and predicted patch F ′ . Foreach training point, the optimization target is to minimize thecross-entropy between F and F ′.

Recall that a model optimized based on cross-entropy ispurely syntactic: the loss only requires a pairwise matchingat the token-level between the predicted patch F ′ and theground truth fix patch F . As discussed in subsection II-B andshown in Listing 1, because of this, there exists a discrepancy

between training objectives and evaluation loss metric. In thefield of neural machine translation (NMT), this problem isknown as the overcorrection phenomenon of cross-entropyloss [11], [16]: cross-entropy based model tends to learnstrictly identical translations and to overcorrect synonymouswords which would be acceptable. In both cases, programrepair and natural language translation, the problem is thatthe NMT-based model does not learn semantic relations. Tomitigate this problem, the key contribution of RewardRepairis to further train the patch generator with a novel kind oftraining, called semantic training, presented next.

C. Semantic Training of RewardRepairIn RewardRepair, once the convergence of syntactic training

has been achieved, we start the semantic training process.The goal of semantic training is to let the patch generatorbe aware of program-specific knowledge (compilation andexecution) beyond the syntactic loss computation at the token-level. For that, we propose a mixed learning objective [16],where “mixed” means that the core learning objective iscombined with two or more sub-learning objectives. In thispaper, our mixed learning objective combines the core cross-entropy objective with compilation and execution information.

At semantic training, RewardRepair employs a discrimina-tive model to assess the quality of generated patches based onthe compilation and test oracles. As a result, the discriminativemodel outputs a reward value to quantify the patch quality,which is used to adjust the weights of token-level loss duringtraining. A higher reward means a better quality of thegenerated patch, e.g., the patch is compilable and passes alltest cases. On the contrary, a lower reward means that thequality of the generated patch may be unsatisfying, e.g., thepatch is non-compilable.

To be specific, the patch-level reward value modulatesthe token-level loss value before the start of the backward

3

Page 4: Neural Program Repair with Execution-based Backpropagation

pass [23]–[25], so that the novel loss can then be properlyrepresented with program-specific knowledge. In such a way,the discriminator guides the neural network to generate high-quality patches1. To our knowledge, we are the first to intro-duce semantic training based on a patch discriminative modelfor neural program repair.

1) Discriminative model: In machine learning, a discrim-inator is a model for identifying “good” and “bad” predic-tions [27]. The discriminator of RewardRepair is the keycomponent of the semantic training phase. Its goal is todiscriminate the quality of the generated patches, which isquantified as a reward signal. In RewardRepair, the discrim-inative model is composed of four discriminators, each ofthem specialized in one aspect of patch quality. These fourdiscriminators are serially executed. Each discriminator doesa binary classification: whether a patch fulfills the discrimina-tor’s criterion or not. When a discriminator is affirmative, itmeans the patch passes the quality criterion, and it is passedto the next discriminator. In the rest of this section, we presentthe four discriminators of RewardRepair.

a) Difference discriminator: The difference discrimina-tor validates whether the generated patches are different fromthe given buggy code. It is required because we found thatneural repair models sometimes generate a “no-change” patchthat is identical to the given buggy code, i.e., the output ofthe model is identical to the input of the model. This happenswhen the neural nets discover that the buggy code achievesthe maximum likelihood estimation of the given ground truthpatch. This confirms previous research [28], which has shownthat the majority of buggy code is similar to correct code, withonly minor transformations and few changed tokens, resultingin a small cross-entropy loss at training time. Consequently,the generator tends to copy the buggy code because it is themaximum likelihood prediction, per the data seen at trainingtime.

We design the difference discriminator to syntactically com-pare the generated patch code with the input buggy code, witha token sequence comparison. If the two token sequences aredifferent, the difference discriminator assigns the generatedpatch a reward, called in this paper the difference reward(Rdiff ). Otherwise, a no-change penalty (Rno−change), i.e.,a negative reward, is assigned. These reward signals modulatethe RewardRepair loss to avoid the generation of no-changepatches. Furthermore, with the difference discriminator rewardsignals, the patch generator is encouraged to generate diversepatches.

b) Compilability discriminator: The compilability dis-criminator validates whether the generated patches are compi-lable. As shown in previous research [4]–[7], [9], [18] neuralrepair models suffer from outputting uncompilable patches.For example, the golden SEQUENCER model produces 79.7%of patches that are not compilable. To force the genera-tor towards producing compilable patches, we introduce a

1RewardRepair bypasses the generator differentiation problem by modulating lossscale [25], [26], which avoids the differentiation difficulty for discrete data.

compilability discriminator in the semantic training phase.The compilability discriminator works as follows: it tries tocompile all generated patches for a buggy project for whichthe compilation can be set up. If the patched program iscompilable, RewardRepair assigns and returns a compilablereward (Rcompilable) to the generated patch. This compilablereward is combined with cross-entropy loss at the token levelto update the weights of the generator so as to produce morecompilable patches.

c) Plausibility discriminator: The plausibility discrim-inator aims at encouraging the generator to produce patchesthat pass the human-written test cases provided by developers.Recall that per the seminal work on program repair [29], testcases can be used as an executable program specification.In other words, if a patch passes the human-written tests, itis considered as a plausible patch. During semantic training,RewardRepair is trained on buggy projects with executablehuman-written tests. Each candidate patch is executed againstthe human-written tests. If a patch makes all human-writtentestes pass, a plausible reward (Rplausible) is assigned. Bydoing so, the plausibility discriminator leverages the human-written tests to drive the network to mostly produce plausiblepatches that pass all human-written tests.

d) Regression discriminator: The last discriminator usedin the RewardRepair discriminator is the regression discrim-inator. The goal of this discriminator is to minimize thebehavioral changes introduced by the patch. This discriminatorcomplements the plausible patch discriminator by specifyingbehavior outside the human-written tests, in order to avoidpatch overfitting [30]–[33]. The RewardRepair regression dis-criminator employs automatically generated tests, per the RGTtechnique of Ye et al. [13]. Per the previous work [13],[34], the RGT technique has shown that 72% and 96.7%overfitting patches can be automatically detected for Defects4Jand QuixBugs benchmarks, respectively. The idea of the RGTtechnique is to automatically generate test cases to exposeprogram execution behavior differences between a generatedpatched program and a ground truth patched program [13],[14], [32]. The human (developer)-written patches are con-sidered as the ground truth. Consequently, the RGT testgeneration based on those ground truth programs encodes fullyor partially the expected program behavior on the new testcases.

If a candidate patch makes all RGT tests pass, i.e., notcontradicts the ground truth program behavior, it is consideredas likely-correct2.

During semantic training, all patches that are plausibleper the plausible patch discriminator are executed againstthe RGT tests. If a patch makes all automatically generatedtests pass, then a likely-correct reward signal (Rl−correct)is assigned to this patch. Otherwise, it means the generatedpatch is behavioral different from a ground truth’s patch, andshould thus be avoided. This discriminator’s reward is used toencourage the RewardRepair to avoid regressions.

2RGT cannot assure that a patch is fully correct as the generated test cases may notcover all the behaviors of a class under testing.

4

Page 5: Neural Program Repair with Execution-based Backpropagation

2) Defining reward values from discriminators: As shownpreviously, RewardRepair defines five reward signals R thatrepresents the patch quality assessment from the four dis-criminators, denoted as follows: a) the no-change penaltyRno−change is given when the generated code is identicalto the given buggy code; b) the difference reward Rdiff isgiven when the generated code is different from the givenbuggy code; c) the compilable reward Rcompilable is given toa compilable patch; d) the plausible reward Rplausible is givento a patch that passes all human-written tests; and e) the likely-correct reward Rl−correct is given to a patch that passes allgenerated RGT tests.

The discriminators are executed serially, that is, if a patchfails at one discriminator, then the reward R obtained up tothat moment is returned immediately and other discriminatorsare not executed. Consequently, the R is the maximum of thefive reward values as follows:

R = max

Rno change = s0Rdiff = s1Rcompilable = s2Rplausible = s3Rl−correct = s4

(1)

where si (i ∈ {0, 1, 2, 3, 4}) are five scaling parameters thatcontrol the range of the reward values. The scaling parametersof si define a configuration space that is controlled by end-users of RewardRepair. Additionally, those reward values mustfulfill the following constraint:

Rno change < Rdiff < Rcompilable < Rplausible < Rl−correct (2)

where the higher reward value represents the better quality ofthe generated patch.

Given the cross-entropy L, the loss function of Re-wardRepair LRewardRepair dynamically reweighs L with re-spect to the discriminator’s reward signal R, which encodesthe quality of the generated patch. We follow [25], [26] toformulate RewardRepair loss function as a scaling of the cross-entropy, as follows:

LRewardRepair = (1−R) ∗ L (where R < 1) (3)

The reward modulator (1 − R) constrains the domain ofR ∈ (−∞, 1), as it is meaningless for the objective functionto minimize a negative loss. In this formulation, the syntacticcross-entropy at token-level is combined with the semanticreward at patch-level, embedding compilation and executionknowledge deep into neural model. It mitigates the limitationsof only considering cross-entropy loss in program repair tasksand would solve the case discussed in subsection II-B.

For low-quality patches, RewardRepair assigns a nega-tive reward value to increase the original cross-entropy lossL. Thus the domain for Rno change ∈ (−∞, 0). On thecontrary, for the high-quality patches, RewardRepair scalesdown the original L by assigning positive reward values forRdiff , Rcompilable, Rplausible, and Rl−correct to encouragethe model. The domain for these four reward values is [0, 1).Based on Equation 3, the higher the reward, the smaller

Algorithm 1 One step of semantic training1: Input: buggy code b, context code c, ground-truth patch code p, human-

written test cases th, RGT test cases tm, learning rate α, G is theRewardRepair patch generator, D is the discriminator function

2: b← ϕ(b) {Encode buggy code}3: c← ϕ(c) {Encode context code}4: p← ϕ(p) {Encode ground-truth patch code}5: for i in t beam do6: q ← G(b, c) {Generate candidate patch for ith t beam}7: L =

∑x∈X p(x)log q(x)

8: R← D(b, c, q, th, tm)9: LRewardRepair = (1−R) ∗ L

10: G ← G − α∂LRewardRepair/∂G {update patch generator withbackpropagation}

11: end for

computed loss is back-propagated into the neural model. Theextreme case is the highest reward value approximate to 1,where the RewardRepair loss goes close to 0. This indicatesthat the generated patch is likely correct and the networkshould not be changed.

Combing the domain constrains discussed above, the rangeof scaling parameters si must meet the following criteria:{

s0 ∈ (−∞, 0)si ∈ [0, 1), si−1 ≤ si, i ∈ {1, 2, 3, 4}

(4)

The choice of the scaling parameters is open-ended. In thispaper, we simplify the scaling parameters as an arithmeticprogression for the positive scaling parameters si, wherei ∈ {1, 2, 3, 4}. In other words, the difference of any twoconsecutive si is a constant value c. Consequently, si couldbe simplified as si = i × c. In this way, the range of biggestreward Rl−correct = s4 = 4× c is a boundary in [0, 1). Basedon Equation 4, the boundary of parameter c is [0, 0.25).

3) Requirements for semantic training: Since the semantictraining phase is based on the compilation and test oracles, weneed a dataset with the ability to compile the code and runthe test cases. In this sense, a patch-level reward signal forthe generated patch is dynamically determined by executingthe compiler and running the test scripts in the original buggyprograms. On the contrary, recall that the syntactic trainingdataset only requires tokenization of buggy and patched files.

D. Algorithm

RewardRepair applies two phases for training the patchgenerator. First, using a training dataset of patches, it conductssyntactic training via minimizing the cross-entropy. Second, itconducts a semantic training process, which starts with thelearned model from the previous phase.

Algorithm 1 presents one step of semantic training. Giventhe encoded buggy b and context code c (line 2 and 3), thepatch generator G generates a candidate patch (line 6). Then,the cross-entropy loss L is computed by comparing the tokendistribution between the ground truth patch p and the generatedpatch q (line 7), where the x indicates the index of tokens.The discriminator provides a reward R based on generatedpatch q and the corresponding program compilability and testexecution information (line 8). Lastly, RewardRepair combines

5

Page 6: Neural Program Repair with Execution-based Backpropagation

the cross-entropy loss at token-level and reward value at patch-level to form the novel RewardRepair loss (line 9), whichencodes the program-specific knowledge. To obtain moresemantic training samples, RewardRepair generates multiplecandidate patches for a given bug. The number of patches isconfigured by the training beam t beam (line 5). We repeatthe optimization process until all candidate patches from thebeam are generated and evaluated by the discriminator.

E. Comparison of Syntactic and Semantic Training

While previous work only does syntactic training, our keynovelty is the semantic training phase. We now compare both.

The first main difference is the sophistication of the trainingdata. Large-scale training corpus is available for syntactictraining (e.g., [35]) because they can be relatively easily cre-ated by collecting and analyzing commits (e.g., from Github).On the contrary, semantic training datasets must include, foreach training point: 1) all the sources of the program underrepair and its test cases; 2) the script to build the program;and 3) the script to run the test cases.

The requirements on the data to be used for semantictraining are higher, because it requires compilation and testoracles for each training point. This means that is much harderto create such a dataset. As a result, semantic training is donewith fewer data points.

The second main difference is the computation requiredfor training. Syntactic training is much faster than semantictraining because of the static loss computation at the token-level. On the contrary, semantic training needs to compile andexecute the test cases to determine the reward value.

The third main difference we identify is the quality of thelearning signal. There is a trade-off between the amount ofdata and the quality of the learning signal: For syntactic train-ing, there is plenty of data, but with a low-quality signal as thepatched programs are not executed. For semantic training, thetraining data is scarce but with a high-quality learning signal:it includes information about the compilability and plausibilityof candidate patches. The strength of RewardRepair is tocombine both, where syntactic training can be considered aspre-training of the generator before semantic training.

F. Inference

In the inference phase, the trained RewardRepair is usedto predict the patches on previously new and unseen bugswith the same code representations of training samples. Perthe prior works [4], [5], [18], we assume that the perfectlocalization of the bug is known as the previous work, asother related tools can run fault localization to localize thebuggy source code [36]. For a fair comparison, RewardRepairis configured by a beam size n and outputs the n best patchesfor the given bug per the related work [4], [5], [37], [38]. Toevaluate the correctness of the suggested patches, we leveragethe reference patches of the given bugs. If a suggested patchis identical or semantically equivalent to the correspondingreference patch, it is deemed as a correct repair.

Requirements Name & Source #Patches

Syntactic training Tokenization Coconut [5] 3 241 966

Semantic training Tokenization, Compilation,Developer and RGT Tests

QuixBugs [39] 34

Validation, RQ1 Tokenization CodRep-4 [35] 4109

RQ2 & RQ3 Tokenization, Compilation,Developer and RGT Tests

Defects4J [12] 124

TABLE I: Datasets used for the different steps of our experiment.

G. Implementation

In this paper, we implement RewardRepair’s patch generatorwith the state-of-the-art Transformer based architecture [21].After hyper-optimization, we use a vocabulary size of 32 128,a batch size of 20, a max tokens of buggy and context codeof 512, the max length of generated patch is set to 100, thelearning rate is 1e−4. We configure the constant reward valuec = 0.2, s0 = −1, and the training as t beam = 5.

IV. EXPERIMENTAL METHODOLOGY

In this section, we describe our methodology for evaluatingRewardRepair by defining three research questions and howwe propose to answer them.

A. Research Questions

• RQ1 (syntactic loss improvement): To what extent doessemantic training improve the traditional syntactic train-ing?

• RQ2 (patch quality): To what extent does semantictraining improve the compilability and correctness ofgenerated patches?

• RQ3 (comparison with other tools): To what extent isRewardRepair effective at repairing bugs compared withthe state-of-the-art repair approaches?

B. Dataset

Table I shows the four datasets of patches that we considerexperimentally evaluate the performance of RewardRepair.The first column indicates the phase where it is used. For ex-ample, we train RewardRepair on different dataset in syntacticand semantic training, respectively. The second column givesthe requirements one each dataset. The third column givesthe source of the dataset and the fourth column indicates thenumber of patches in this dataset. For example as shown in thefirst row, RewardRepair is syntatically trained with more than3 million patches (3 241 966) from prior work CoCoNut [5],which all meet the requirements of syntactic training, beingtokenizable.

As aforementioned in Section III-C3, the selection criteriafor the semantic training dataset are: we need a compiler tobuild the patched program and the scripts to run the testcases. Also, we need the automatically generated tests tospecify the expected program behavior. Moreover, per previouswork [4], [18], [40], we focus on single-hunk bugs, wherethe ground truth patch is confined to a single contiguouschunk of code, at a single location. All criteria are met for34 bugs of the QuixBugs dataset [39], for which RGT tests

6

Page 7: Neural Program Repair with Execution-based Backpropagation

have been generated by previous research [34]. Since wegenerate multiple candidate patches for a given bug in oneepoch (shown in Algorithm 1), this amplifies the learningeffect on the 34 bugs based the number of generated patchesin the beam. This allows RewardRepair to explore how diversepatches are handled by the discriminator. To be used asa validation dataset for early stopping and hyper-parameteroptimization during syntactic and semantic training, with theonly constraint of being tokenizable, we use 4 109 patchesfrom CodRep-4 dataset [35] per Chen et al. [4].

For the final testing of RewardRepair, we use a wellaccepted dataset from program repair research [41]–[43] De-fects4J [12], a bug dataset for which all bugs can be compiledand tested, including with automatically generated tests [13],[44]. Also, using Defects4J as an evaluation dataset allowsus to do a fair comparison between the effectiveness ofRewardRepair and other approaches on a bug-per-bug basis.Per the prior work [4], [18], we select the subset of De-fects4J bugs consisting of single-hunk bugs, this results in124 Defects4J bugs for testing. With the compilation, human-written tests and RGT tests of Defects4J, we are able toevaluate the compilability, plausibility and regression (i.e.,likely-correctness) of the predicted patches.

C. Methodology for RQ1

In RQ1, we measure the impact of semantic training on thetraditional syntactic cross-entropy loss. For this, we comparethe total cross-entropy loss of RewardRepair: a) only withsyntactic training; and b) with both syntactic and semantictraining.

For syntactic training, RewardRepair is trained on the Co-CoNut dataset with 3 241 966 pairs of buggy and patch code(shown as Section IV-B). For semantic training, RewardRepairis trained on the QuixBugs dataset with 34 buggy programs,each of which is fully executable with compiler configuration,human-written tests and RGT tests.

We measure the impact of semantic training as follows. Theepochs are separated as either syntactic or semantic training.After each epoch, we compute the total cross-entropy loss ofRewardRepair over the whole validation dataset. By using thesame loss function and the same dataset, we fairly measurethe impact of semantic training w.r.t. syntactic training. Weuse CodRep-4 [35] as a validation set, which is composed of4 109 patches. CodRep-4 has been used by the neural programrepair approach SEQUENCER [4]. The number of epochs forsyntactic training and semantic training is determined by theloss convergence.

D. Methodology for RQ2

In RQ2, we compare the effectiveness of RewardRepair,w.r.t. compilability and correctness. The overall methodologyis per RQ1, with two major differences: 1) the consideredmetrics are not anymore the cross-entropy loss, but metricsrelated to execution, defined below; and 2) the testing datasetis Defects4J. We use the 124 Defects4J bugs as explained insubsection IV-B. Per the previous work in neural repair [4],

[5], [18], we assume that the perfect localization of the bug isknown. This could result in a fair comparison which enablesfair assessment of program repair techniques independently ofthe fault localization approach used [41]. In this experimenton Defects4J, RewardRepair is configured by a beam size andwe use the standard value used in the literature [4], [37],[38]: a beam size of 50. As a result, RewardRepair generates6200(124× 50) patches for Defects4j.

To compare how semantic training improves the repairabil-ity compared to syntactic training, we first classify the gen-erated patches for the 124 bugs from Defects4J into fourcategories: 1) different patches (denoted as |D|) on the op-posite of no-change patches; 2) compilable patches (denotedas |C|), i.e., the generated patches that are being acceptedby the Java compiler; 3) plausible patches (denoted as |P |),i.e., the generated patches that pass all human-written testsprovided in Defects4J; and 4) correct human patches (denotedas |H|), i.e., the generated patches are either identical orsemantically equivalent to the given human-written patches.We use both manual analysis and the RGT tests from priorwork [13] to automatically assess the correctness of thosepatches. Moreover, those categories are not exclusive: a singlepatch can be counted in several categories. For example, acorrect patch would be classified as |C|, |P | and |H|. Thosemetrics are meant to be interpreted as follows: The higher |D|,|C|, |P | and |H|, the better performance of RewardRepair, i.e.,we maximize the number of different, compilable, plausibleand correct patches that are generated by RewardRepair.

E. Methodology for RQ3

In RQ3, we compare the patches generated by Re-wardRepair against the ones of the state-of-the-art neuralrepair approaches: SEQUENCER [4], CoCoNut [5] and DL-Fix [18]. Beyond comparing with neural repair approaches,for the sake of perspective, we also compare RewardRepairwith the recent notable program repair approaches Hercules[45], TBar [40], SimFix [46] and CapGen [47] that are notbased on deep learning evaluated on Defects4J. We note thatthese four repair approaches have comparable performance asreported in previous work [5], [41]. As in the previous studies[5], [18], we take the results either from original papers orfrom previous work [41] which reran the repair approacheswith perfect localization.

We compute two key performance metrics for the generatedDefects4J patches: 1) the number of bugs that are correctlyrepaired. In our paper, a patch is deemed correctly repaired ifit meets the two following criteria: it passes all RGT tests [13]and it is considered as correct by manual analysis; 2) thenumber of bugs that can be uniquely be repaired by individualrepair models.

V. EXPERIMENTAL RESULTS

A. RQ1: Syntactic Loss Decrease

Figure 2 compares the RewardRepair’s effectiveness on thevalidation set CodRep-4 as measured by the traditional cross-entropy loss. The X-axis indicates the training epochs, and is

7

Page 8: Neural Program Repair with Execution-based Backpropagation

Fig. 2: The total cross-entropy loss over the validation datasetCodRep-4 further decreases with RewardRepair’s semantic training.

split in two, the left-hand side is made of syntactic trainingepochs (8 epochs) and the right-hand side part is composedof semantic training epochs (7 epochs). In this experiment,RewardRepair is trained over 15 epochs in total. The Y-axisindicates the corresponding total cross-entropy loss over the4 109 patches of the CodRep-4 dataset.

As shown in Figure 2, the 8 epochs of syntactic training canbe considered sufficient because the cross-entropy loss on thevalidation set has converged to a stable plateau (see the lossbetween epoch 6 and 8). Notably, at the beginning of semantictraining, the cross-entropy loss decreases again (see the lossbetween epoch 8 and 9). This clearly shows that semantictraining improves the neural model of RewardRepair, andproduces a better model than the best model found with cross-entropy optimization only. Overall, the results on the commonvalidation dataset show that the semantic training strategybased on an advanced discriminator finds new opportunitiesto further optimize the patch generator.

Answer to RQ1: After convergence of the cross-entropy loss based on traditional syntactic training, Re-wardRepair’s semantic training is able to further reducethe cross-entropy loss. This shows that execution-basedbackpropagation identifies new opportunities for betteroptimizing the neural network weights.

B. RQ2: Improvement of Compilability and Correctness

Figure 3 compares the evaluation of the syntactic andsemantic training on the Defects4J dataset. The vertical dashedline indicates the end of syntactic training and the beginningof semantic training. The grey, green, yellow, and red linesrespectively indicate the number of generated patches that aredifferent from the given buggy code (|D|), compilable (|C|),plausible (|P |), and equivalent to human-written (|H|) patches.The X-axis gives the number of training epochs and the Y-axis shows the number of generated patches on a logarithmicscale. For example, at the end of the first epoch of syntactictraining, RewardRepair generates 524 compilable patches and5 plausible patches.

We make the following observations. First, all metrics showthat semantic training helps: the number of compilable |C|,plausible |P |, and correct |H| patches all significantly increase

Fig. 3: The key four repair metrics on the Defects4J dataset.RewardRepair ”s semantic training clearly increases the number of

compilable and correct patches.

- } else if (offsetLocal > 0 ){ //buggy patch

+ } else if (offsetLocal >= 0 ){ //human patch

//RewardRepair Patch

+ } else if (offsetLocal > 0 || offsetLocal == 0){

Listing 2: The RewardRepair patch generated for Time-19 issyntactically different but semantically equivalent.

when semantic training is introduced. For example, the numberof correct patches increases from 32 at the end of the syntactictraining stage (epoch 7) to 73 at the end of the semantictraining stage. Overall, there are 201 more different patches(5783 versus 5984, +3.48%), 374 more compilable patches(1217 versus 1591, +30.73%), 90 more plausible patches (40versus 130, +225%), and 41 more correct patches (32 versus73, +128%), which is arguably a significant improvement oversyntactic training.

RewardRepair achieves the best performance at 13 epochs,where it generates 1 593 compilable patches, including 132plausible patches (|P |) that pass the human-written test ora-cles, and 73 patches that are either identical or semanticallyequivalent to the human-written (|H|) patches. Overall, theconvergence of semantic training is comparable to the one ofsyntactic training in terms of the number of epochs. Now wemanually analyze some cases where the model is improved bysemantic training.

Semantically equivalent patch. Listing 2 shows the human-written patch and the RewardRepair patch for Defects4J bugTime-19. The RewardRepair patch is semantically equivalentto the human-written patch. Notably, it is also quite differentfrom the human-written patch, showing RewardRepair’s capa-bility to learn beyond syntactic identity. To our knowledge,there is the first time ever that a semantically equivalentpatch is reported for this bug. Interestingly, in the beam,RewardRepair also generates the human-written patch, plustwo other semantically equivalent patches.

Generating subtle one-character patches. Listing 3 showsbug Mockito-26 that is correctly repaired after semantic train-

8

Page 9: Neural Program Repair with Execution-based Backpropagation

- primitiveValues.put(double.class, 0); // buggy code

+ primitiveValues.put(double.class, 0D); //human patch

+ primitiveValues.put(double.class, 0d); // RewardRepair

Listing 3: The one character bug of Mockito-26 is correctlyrepaired after semantic training.

(a) (b)

Fig. 4: Overlap analysis of RewardRepair with the patches repairedin the literature. Project names: C:Chart, Clo:Closure, L:Lang, M:Math,

Moc:Mockito, T:Time.

ing while not being repaired by RewardRepair with purelysyntactic training. This is a case of a one-character change(in Java, number format specifications are case-insensitive).For similar examples seen during syntactic training, such aminor character difference has resulted in a little cross-entropyloss. Consequently, the backpropagation was too weak andthe neural net weights were not updated enough to generatethe expected patch. However, during semantic training, asubtle one-character change makes a big difference in dynamicexecution, and thus in the final reward. This example showsthat RewardRepair does improve the generator below purelysyntactic optimization. To our knowledge, this Defects4J bughas never been correctly repaired by the previous neural repairmodels based on pure syntactic training [4], [5], [18].

Answer to RQ2: Semantic training significantly improvesRewardRepair’s performance compared to only doingsyntactic training. The semantic training phase increasesthe number of patches that are compilable, not overfitting,and fully correct, per the strong ground truth of identityto the developer patch. Semantic training captures syn-tactically small changes (few characters) that are not wellhandled with traditional cross-entropy optimization.

C. RQ3: Comparative Study with other Repair Approaches

Figure 4a gives an overlapping analysis of the four neuralprogram repair approaches selected (see subsection IV-B). Asshown in RQ2, RewardRepair generates 73 correct patches,which correctly fixes 43/124 bugs. RewardRepair outperformstwo state-of-the-art neural program repair models (29 morefixed bugs than SEQUENCER and 4 more than DLFix).CoCoNut generates one more correct patch than RewardRepair(44 patches versus 43). We have two potential explanationsfor this: 1) we only use around 13.25% of the training dataset

private void removeUnreferencedFunctionArgs(...){

+ if (!removeGlobals) {+ return;

+ }

Listing 4: Closure-1 patch generated by RewardRepair, identical tothe developer patch. Few neural repair systems handle addition-only

patches.

compared with CoCoNut; and 2) we follow the most usualbeam configuration [4], [38] of 50, while CoCoNut’s beamis configured to 1000. As shown in previous research [6], alarger beam size leads to more correct patches. However, thisis out of the scope of our paper to analyze the impact of beamsearch.

Now, we discuss the number of uniquely fixed bugs. Asshown in Figure 4a, we can see that the considered approachesall fix 9 common bugs (in the center of the Venn diagram). Be-yond that, we make the following observations: RewardRepaircan fix 7 unique bugs which are not repaired by SEQUENCER,CoCoNut and DLFix. Five out of seven are identical to theground truth patches. Interestingly, two of the 7 unique patches(Math-11 and Mockito-26) are semantically equivalent patches.This indicates that semantic training based on compilation andtest cases provides a reward signal to encourage RewardRepairtowards generating semantically equivalent patches.

Next, we also compare the effectiveness of RewardRepairwith the state-of-the-art similarity-based program repair ap-proaches Hercules [45] and CapGen [47], and pattern-basedapproach TBar [40] and SimFix [46]. Figure 4b is a Venndiagram, showing the repaired Defects4J bugs for all of them.As shown in Figure 4b, RewardRepair outperforms SimFix(28 bugs are repaired) and CapGen (22 bugs are repaired) inrepairing Defects4J bugs, while RewardRepair underperformsw.r.t. to TBar (53 bugs are repaired) and Hercules (49 bugs arerepaired). We explain the latter as follows: this is because TBaris the combination of heavy manual work to define manually-crafted templates and fix patterns. Plus, TBar combines therepair knowledge from previous work [36], [46]–[53]. To thatextent, TBar can be considered as an integrated collection oftools. On the contrary, RewardRepair does not require anymanually-crafted pattern.

Notably, RewardRepair is able to fix 8 unique bugs that havenever been repaired by other techniques. These bugs requirefix patterns that have never been proposed by the similarity-based or pattern-based repair approaches. Let us now discussone of them, Closure-1.

Listing 4 gives the patch of Closure-1, which can onlybe fixed by RewardRepair. This is a patch with an additionif block, including the code of the then statement. Re-wardRepair learns the if condition from the given contextcode outside the buggy method, the condition having beenrepeatedly used in the buggy class. While pattern-based repairis able to synthesize condition, no pattern-based repair sys-tems have this complete if/then/return pattern. With

9

Page 10: Neural Program Repair with Execution-based Backpropagation

respect to neural repair models, SEQUENCER and DLFixcannot produce addition patches by design (while this patchis a pure code addition patch). CoCoNut does not generatethis patch, we suspect that with a strict token-based cross-entropy optimization, CoCoNut does not learn such a complexpatch structure. RewardRepair with semantic training is thefirst to produce this addition only patch based a non-trivialif/then/return structure.

Answer to RQ3: RewardRepair can correctly fix 43Defects4J bugs with unique Defects4J bugs are repairedfor the first time ever. RewardRepair is fully data-driven,and thus does not require any manually crafted repairtemplates or repair strategies.

D. Threats to Validity

A threat to external validity relates to whether the perfor-mance of RewardRepair generalizes to arbitrary programminglanguages and bug benchmarks. Per the standards of the field,our approach has been tested in one language (Java) and theevaluation is carried out on the standard Defects4J dataset. Inprinciple, our approach can be applied to other programminglanguages and datasets. A threat to internal validity relatesto the hyper-parameter configuration we adopted that mayinfluence the RewardRepair. To ensure replicability, we explainin Section III-G how the hyper-parameters are configured.

VI. RELATED WORK

A. Automatic Program Repair

A decade of research has generated a rich body of work onautomatic program repair [1], [2]. We have already discussedneural repair techniques [4]–[9], [18] in section II. Theseapproaches only use the syntactic cross-entropy loss objective,which poses a discrepancy between the training objective ofgenerating compilable correct patches and the loss criterion.The key novelty of RewardRepair is the discriminative modelto capture the compilation and execution knowledge duringmodel training and backpropagation.

We mentioned generate and validate (G&V) program re-pair techniques in RQ3 subsection V-C. Other notable G&Vsystems include [36], [54]–[56]. Moreover, the third line ofresearch is about synthesis-based repair [50], [57]–[62] whichconverts the search problem to a satisfiability problem. Allthese techniques require significant domain knowledge andmanually crafted patterns that are language-dependent. On thecontrary, RewardRepair automatically learns such fix operatorsand the language grammar and semantics from the trainingcorpus.

B. Discriminators for Machine Learning on Code

Several works propose deep learning on code based on adiscriminator [63], [64] where the discriminator provides aloss which solves the discrepancy between the generated andreal distributions of the object under study, as pioneered bygenerative adversarial networks (GAN) [27]. Harer et al. [63]propose an adversarial learning approach to solve software

vulnerabilities. Alhefdhi et al. [64] leverage a similar GANarchitecture to suggest repairs that are as close as possibleto human-written repairs. RewardRepair shares the conceptsof employing a traditional NMT model as a generator, andof replacing the cross-entropy loss with the feedback froma discriminator. The key difference between this related workand ours and that our discriminator uses execution information,through compilation and test execution.

C. Semantic learning based on execution

Recently, semantic information has been used in programsynthesis tasks. Chen et al. [65] and Gupta et al. [66] proposeexecution-guided synthesis leveraging the semantics of thelanguage. These approaches execute a partial program toobtain intermediate states to guide program synthesis. Wang etal. [67] use dynamic information from execution to measuresemantic redundancy between student programs. Mesbah etal. [68] extract compiler diagnostic information as an inputsource for repairing compilation errors. As in our work, theseapproaches use execution information as an additional inputfor the considered model. The key difference is that none ofthem employ the execution information as a reward signal toupdate the neural net weights through backpropagation.

Wang and colleagues [69], [70] leverage the full executiontraces to learn neural semantic program embeddings. Theserelated works improve the code representation based on se-mantic information. Our novelty is not on the representation,but on the training objective improvement, which was is notaddressed in [69], [70].

D. Improved backpropagation

Past research has improved the cross-entropy loss basedon domain-specific knowledge. In neural machine translation,Zhang et al. [11] show the limitation of considering cross-entropy loss and its tendency to overcorrect synonymouswords and phrases. To relieve the problem, further research[11], [16] proposed to combine cross-entropy loss and addtranslation evaluation at the sentence level. In object detection,Ryou et al. [25] proposed AnchorLoss to dynamically rescalethe cross-entropy based on prediction difficulty. Loss scaling isa technique used in floating-point optimization, consisting ofscaling up the loss value up before the start of backpropagation[23], [24].

VII. CONCLUSION

We have presented a novel neural program repair modelRewardRepair based on compilation and test execution. Thekey idea is to employ a discriminative model to providea reward signal on the generated patches according to theactual execution outcome. This signal modulates the purelysyntactic cross-entropy loss function in what we call semantictraining. We have conducted an extensive empirical evaluation,including a comprehensive experiment on the widely usedbenchmark Defects4J. Our results clearly show that it is pos-sible to embed execution information in the backpropagationprocess of neural program repair.

10

Page 11: Neural Program Repair with Execution-based Backpropagation

REFERENCES

[1] M. Monperrus, Automatic software repair: a bibliography, ACMComputing Surveys 51 (2017) 1–24. doi:10.1145/3105906.URL https://hal.archives-ouvertes.fr/hal-01206501/file/survey-automatic-repair.pdf

[2] L. Gazzola, D. Micucci, L. Mariani, Automatic software repair: Asurvey, IEEE Transactions on Software Engineering (2017).

[3] Z. Chen, M. Monperrus, A literature study of embeddings on sourcecode (2019). arXiv:1904.03061.

[4] Z. Chen, S. J. Kommrusch, M. Tufano, L. Pouchet, D. Poshyvanyk,M. Monperrus, Sequencer: Sequence-to-sequence learning for end-to-end program repair, IEEE Transactions on Software Engineering (2019)1–1.

[5] T. Lutellier, H. V. Pham, L. Pang, Y. Li, M. Wei, L. Tan, Coconut:Combining context-aware neural translation models using ensemble forprogram repair, ISSTA 2020, 2020.

[6] M. Tufano, J. Pantiuchina, C. Watson, G. Bavota, D. Poshyvanyk, Onlearning meaningful code changes via neural machine translation, in:Proceedings of the 41st International Conference on Software Engineer-ing, ICSE ’19, IEEE Press, 2019, p. 25–36. doi:10.1109/ICSE.2019.00021.URL https://doi.org/10.1109/ICSE.2019.00021

[7] S. Chakraborty, Y. Ding, M. Allamanis, B. Ray, Codit: Code editing withtree-based neural models, IEEE Transactions on Software Engineering(2020) 1–1doi:10.1109/TSE.2020.3020502.

[8] M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, D. Poshy-vanyk, An empirical study on learning bug-fixing patches in the wild vianeural machine translation, ACM Trans. Softw. Eng. Methodol. 28 (4)(Sep. 2019). doi:10.1145/3340544.URL https://doi.org/10.1145/3340544

[9] R. Gupta, S. Pal, A. Kanade, S. Shevade, Deepfix: Fixing common clanguage errors by deep learning, in: Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence, AAAI’17, AAAI Press,2017, p. 1345–1351.

[10] M. R. I. Rabin, N. D. Q. Bui, K. Wang, Y. Yu, L. Jiang, M. A.Alipour, On the generalizability of neural program models with re-spect to semantic-preserving program transformations (2021). arXiv:2008.01566.

[11] W. Zhang, Y. Feng, F. Meng, D. You, Q. Liu, Bridging the gap betweentraining and inference for neural machine translation, in: Proceedingsof the 57th Annual Meeting of the Association for ComputationalLinguistics, Association for Computational Linguistics, Florence, Italy,2019, pp. 4334–4343. doi:10.18653/v1/P19-1426.URL https://www.aclweb.org/anthology/P19-1426

[12] R. Just, D. Jalali, M. D. Ernst, Defects4j: A database of existing faultsto enable controlled testing studies for java programs, in: Proceedingsof the 2014 International Symposium on Software Testing and Analysis,ACM, 2014, pp. 437–440.

[13] H. Ye, M. Martinez, M. Monperrus, Automated patch assessment forprogram repair at scale, Empirical Software Engineering 26 (2) (2021)20. doi:10.1007/s10664-020-09920-w.URL https://doi.org/10.1007/s10664-020-09920-w

[14] S. Wang, M. Wen, B. Lin, H. Wu, Y. Qin, D. Zou, X. Mao, H. Jin, Auto-mated patch correctness assessment: How far are we?, in: Proceedings ofthe 35th International Conference on Automated Software Engineering(ASE), ACM, 2020.

[15] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointlylearning to align and translate, ArXiv 1409 (09 2014).

[16] W. Lu, L. Zhou, G. Liu, Q. Zhang, A mixed learning objective forneural machine translation, in: M. Sun, S. Li, Y. Zhang, Y. Liu,S. He, G. Rao (Eds.), Chinese Computational Linguistics, SpringerInternational Publishing, Cham, 2020, pp. 201–213.

[17] M. Ghazvininejad, V. Karpukhin, L. Zettlemoyer, O. Levy, Aligned crossentropy for non-autoregressive machine translation (2020). arXiv:2004.01655.

[18] Y. Li, S. Wang, T. N. Nguyen, Dlfix: Context-based code transfor-mation learning for automated program repair, in: Proceedings of theACM/IEEE 42nd International Conference on Software Engineering,ICSE ’20, Association for Computing Machinery, New York, NY, USA,2020, p. 602–614. doi:10.1145/3377811.3380345.URL https://doi.org/10.1145/3377811.3380345

[19] R.-M. Karampatsis, H. Babii, R. Robbes, C. Sutton, A. Janes, Big code!=big vocabulary: Open-vocabulary models for source code, in: 2020

IEEE/ACM 42nd International Conference on Software Engineering(ICSE), IEEE, 2020, pp. 1073–1085.

[20] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rarewords with subword units, in: Proceedings of the 54th Annual Meetingof the Association for Computational Linguistics (Volume 1: LongPapers), Association for Computational Linguistics, Berlin, Germany,2016, pp. 1715–1725. doi:10.18653/v1/P16-1162.URL https://www.aclweb.org/anthology/P16-1162

[21] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with aunified text-to-text transformer, Journal of Machine Learning Research21 (140) (2020) 1–67.URL http://jmlr.org/papers/v21/20-074.html

[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,L. u. Kaiser, I. Polosukhin, Attention is all you need, in: Advances inNeural Information Processing Systems, Vol. 30, 2017.

[23] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia,B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, H. Wu, Mixedprecision training, in: International Conference on Learning Represen-tations, 2018.URL https://openreview.net/forum?id=r1gs9JgRZ

[24] R. Zhao, B. Vogel, T. Ahmed, Adaptive loss scaling for mixed precisiontraining (2020).URL https://openreview.net/forum?id=rJlnfaNYvB

[25] S. Ryou, S.-G. Jeong, P. Perona, Anchor loss: Modulating loss scalebased on prediction difficulty, 2019 IEEE/CVF International Conferenceon Computer Vision (ICCV) (2019) 5991–6000.

[26] L. Yu, W. Zhang, J. Wang, Y. Yu, Seqgan: Sequence generative adversar-ial nets with policy gradient, in: Proceedings of the Thirty-First AAAIConference on Artificial Intelligence, AAAI’17, AAAI Press, 2017, p.2852–2858.

[27] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, NIPS’14,MIT Press, Cambridge, MA, USA, 2014, p. 2672–2680.

[28] H. Tian, K. Liu, A. K. Kabore, A. Koyuncu, L. Li, J. Klein, T. F.Bissyande, Evaluating representation learning of code changes for pre-dicting patch correctness in program repair, in: Proceedings of the 35thIEEE/ACM International Conference on Automated Software Engineer-ing, IEEE, 2020, pp. 981–992. doi:10.1145/3324884.3416532.URL https://doi.org/10.1145/3324884.3416532

[29] C. Le Goues, T. Nguyen, S. Forrest, W. Weimer, Genprog: A genericmethod for automatic software repair, Software Engineering, IEEETransactions on 38 (1) (2012) 54–72. doi:10.1109/TSE.2011.104.

[30] Z. Yu, M. Martinez, B. Danglot, T. Durieux, M. Monperrus, Alleviatingpatch overfitting with automatic test generation: a study of feasibilityand effectiveness for the nopol repair system, Empirical SoftwareEngineering (2018).

[31] X.-B. D. Le, F. Thung, D. Lo, C. L. Goues, Overfitting in semantics-based automated program repair, in: Proceedings of the 40th Interna-tional Conference on Software Engineering, ICSE ’18, ACM, New York,NY, USA, 2018.

[32] X. D. Le, L. Bao, D. Lo, X. Xia, S. Li, On reliability of patch correct-ness assessment, in: Proceedings of the 41st ACM/IEEE InternationalConference on Software Engineering, 2019.

[33] E. K. Smith, E. T. Barr, C. Le Goues, Y. Brun, Is the cure worse thanthe disease? overfitting in automated program repair, in: Proceedings ofthe 2015 10th Joint Meeting on Foundations of Software Engineering,ESEC/FSE 2015, 2015.

[34] H. Ye, M. Martinez, T. Durieux, M. Monperrus, A comprehensive studyof automatic program repair on the quixbugs benchmark, Journal ofSystems and Software 171 (2021) 110825. doi:https://doi.org/10.1016/j.jss.2020.110825.

[35] Z. Chen, M. Monperrus, The codrep machine learning on source codecompetition, Tech. Rep. 1807.03200, arXiv (2018).URL http://arxiv.org/pdf/1807.03200

[36] M. Martinez, M. Monperrus, Astor: A program repair library for java,in: Proceedings of ISSTA, 2016.

[37] U. Z. Ahmed, P. Kumar, A. Karkare, P. Kar, S. Gulwani, Compilationerror repair: For the student programs, from the student programs, in:Proceedings of the 40th International Conference on Software Engi-neering: Software Engineering Education and Training, ICSE-SEET ’18,Association for Computing Machinery, New York, NY, USA, 2018, p.

11

Page 12: Neural Program Repair with Execution-based Backpropagation

78–87. doi:10.1145/3183377.3183383.URL https://doi.org/10.1145/3183377.3183383

[38] M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, D. Poshy-vanyk, An empirical study on learning bug-fixing patches in the wild vianeural machine translation, ACM Trans. Softw. Eng. Methodol. 28 (4)(Sep. 2019). doi:10.1145/3340544.URL https://doi.org/10.1145/3340544

[39] D. Lin, J. Koppel, A. Chen, A. Solar-Lezama, Quixbugs: A multi-lingual program repair benchmark set based on the quixey challenge,in: Proceedings Companion of the 2017 ACM SIGPLAN InternationalConference on Systems, Programming, Languages, and Applications:Software for Humanity, SPLASH Companion 2017, Association forComputing Machinery, New York, NY, USA, 2017, p. 55–56. doi:10.1145/3135932.3135941.URL https://doi.org/10.1145/3135932.3135941

[40] K. Liu, A. Koyuncu, D. Kim, T. F. Bissyande, TBar: Revisiting template-based automated program repair, in: Proceedings of the 28th ACMSIGSOFT International Symposium on Software Testing and Analysis,ACM, 2019, pp. 31–42. doi:10.1145/3293882.3330577.

[41] K. Liu, S. Wang, A. Koyuncu, K. Kim, T. F. Bissyande, D. Kim,P. Wu, J. Klein, X. Mao, Y. L. Traon, On the efficiency of testsuite based program repair: A systematic assessment of 16 automatedrepair systems for java programs, in: Proceedings of the ACM/IEEE42nd International Conference on Software Engineering, ICSE ’20,Association for Computing Machinery, New York, NY, USA, 2020, p.615–627. doi:10.1145/3377811.3380338.URL https://doi.org/10.1145/3377811.3380338

[42] M. Martinez, T. Durieux, R. Sommerard, J. Xuan, M. Monperrus,Automatic Repair of Real Bugs in Java: A Large-scale Experiment onthe Defects4J Dataset, Empirical Software Engineering 22 (4) (2017)1936–1964. doi:10.1007/s10664-016-9470-4.

[43] T. Durieux, F. Madeiral, M. Martinez, R. Abreu, Empirical review ofjava program repair tools: A large-scale experiment on 2,141 bugs and23,551 repair attempts, in: Proceedings of the 2019 27th ACM JointMeeting on European Software Engineering Conference and Symposiumon the Foundations of Software Engineering, ACM, 2019, pp. 302–313.doi:10.1145/3338906.3338911.

[44] S. Shamshiri, R. Just, J. M. Rojas, G. Fraser, P. McMinn, A. Arcuri, Doautomatically generated unit tests find real faults? an empirical study ofeffectiveness and challenges (t), in: 2015 30th IEEE/ACM InternationalConference on Automated Software Engineering (ASE), 2015, pp. 201–211. doi:10.1109/ASE.2015.86.

[45] S. Saha, R. K. Saha, M. R. Prasad, Harnessing evolution for multi-hunkprogram repair, in: Proceedings of the 41st International Conference onSoftware Engineering, ICSE ’19, IEEE Press, 2019, p. 13–24. doi:10.1109/ICSE.2019.00020.URL https://doi.org/10.1109/ICSE.2019.00020

[46] J. Jiang, Y. Xiong, H. Zhang, Q. Gao, X. Chen, Shaping program repairspace with existing patches and similar code, ISSTA, 2018.

[47] M. Wen, J. Chen, R. Wu, D. Hao, S.-C. Cheung, Context-aware patchgeneration for better automated program repair, in: Proceedings of the40th International Conference on Software Engineering, ICSE ’18, 2018.

[48] X. B. D. Le, D. Lo, C. Le Goues, History driven program repair, in:Software Analysis, Evolution, and Reengineering (SANER), 2016 IEEE23rd International Conference on, Vol. 1, IEEE, 2016, pp. 213–224.

[49] R. K. Saha, Y. Lyu, H. Yoshida, M. R. Prasad, Elixir: Effectiveobject oriented program repair, in: Proceedings of the 32Nd IEEE/ACMInternational Conference on Automated Software Engineering, ASE2017, IEEE Press, Piscataway, NJ, USA, 2017, pp. 648–659.

[50] X.-B. D. Le, D.-H. Chu, D. Lo, C. Le Goues, W. Visser, S3: Syntax-and semantic-guided repair synthesis via programming by examples, in:Proceedings of the 2017 11th Joint Meeting on Foundations of SoftwareEngineering, ESEC/FSE 2017, 2017.

[51] X. Liu, H. Zhong, Mining stackoverflow for program repair, in: 2018IEEE 25th International Conference on Software Analysis, Evolutionand Reengineering (SANER), 2018.

[52] J. Hua, M. Zhang, K. Wang, S. Khurshid, Towards practical programrepair with on-demand candidate generation, in: Proceedings of the 40thInternational Conference on Software Engineering, 2018.

[53] K. Liu, A. Koyuncu, D. Kim, T. F. Bissyande, AVATAR: fixing semanticbugs with fix patterns of static analysis violations, in: 26th IEEE Interna-tional Conference on Software Analysis, Evolution and Reengineering,2019, IEEE, 2019, pp. 456–467. doi:10.1109/SANER.2019.8667970.

[54] L. Chen, Y. Pei, C. A. Furia, Contract-based program repair withoutthe contracts, in: 2017 32nd IEEE/ACM International Conference onAutomated Software Engineering (ASE), 2017.

[55] Y. Yuan, W. Banzhaf, Arja: Automated repair of java programs viamulti-objective genetic programming, in: IEEE Transactions on SoftwareEngineering, 2018.

[56] A. Ghanbari, S. Benton, L. Zhang, Practical program repair via bytecodemutation, in: Proceedings of the 28th ACM SIGSOFT InternationalSymposium on Software Testing and Analysis, ISSTA 2019, Associationfor Computing Machinery, New York, NY, USA, 2019, p. 19–30.doi:10.1145/3293882.3330559.URL https://doi.org/10.1145/3293882.3330559

[57] Y. Xiong, J. Wang, R. Yan, J. Zhang, S. Han, G. Huang, L. Zhang,Precise condition synthesis for program repair, in: Proceedings of the39th International Conference on Software Engineering, ICSE ’17, IEEEPress, 2017, p. 416–426. doi:10.1109/ICSE.2017.45.URL https://doi.org/10.1109/ICSE.2017.45

[58] J. Xuan, M. Martinez, F. Demarco, M. Clement, S. Lamelas, T. Durieux,D. Le Berre, M. Monperrus, Nopol: Automatic repair of conditionalstatement bugs in java programs, IEEE Transactions on Software Engi-neering (2016).

[59] S. Mechtaev, J. Yi, A. Roychoudhury, Angelix: Scalable multilineprogram patch synthesis via symbolic analysis, in: 2016 IEEE/ACM38th International Conference on Software Engineering (ICSE), 2016.

[60] S. Mechtaev, J. Yi, A. Roychoudhury, Directfix: Looking for simpleprogram repairs, in: 2015 IEEE/ACM 37th IEEE International Con-ference on Software Engineering, Vol. 1, 2015, pp. 448–458. doi:10.1109/ICSE.2015.63.

[61] X. Gao, S. Mechtaev, A. Roychoudhury, Crash-avoiding program repair,in: Proceedings of the 28th ACM SIGSOFT International Symposium onSoftware Testing and Analysis, ISSTA 2019, Association for ComputingMachinery, New York, NY, USA, 2019, p. 8–18. doi:10.1145/3293882.3330558.URL https://doi.org/10.1145/3293882.3330558

[62] R. Shariffdeen, Y. Noller, L. Grunske, A. Roychoudhury, Concolicprogram repair, in: 42nd ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation (PLDI), 2021.

[63] J. A. Harer, O. Ozdemir, T. Lazovich, C. P. Reale, R. L. Russell, L. Y.Kim, P. Chin, Learning to repair software vulnerabilities with generativeadversarial networks, NIPS’18, Curran Associates Inc., Red Hook, NY,USA, 2018, p. 7944–7954.

[64] A. Alhefdhi, H. K. Dam, X.-B. D. Le, A. Ghose, Adversarial patch gen-eration for automatic program repair (2020). arXiv:2012.11060.

[65] X. Chen, C. Liu, D. Song, Execution-guided neural program synthesis,in: International Conference on Learning Representations, 2019.URL https://openreview.net/forum?id=H1gfOiAqYm

[66] K. Gupta, P. E. Christensen, X. Chen, D. Song, Synthesize, execute anddebug: Learning to repair for neural program synthesis, NIPS’20, 2020.

[67] K. Wang, R. Singh, Z. Su, Search, align, and repair: Data-drivenfeedback generation for introductory programming exercises, SIGPLANNot. 53 (4) (2018) 481–495. doi:10.1145/3296979.3192384.URL https://doi.org/10.1145/3296979.3192384

[68] A. Mesbah, A. Rice, E. Johnston, N. Glorioso, E. Aftandilian, Deepdelta:Learning to repair compilation errors, in: Proceedings of the 2019 27thACM Joint Meeting on European Software Engineering Conference andSymposium on the Foundations of Software Engineering, ESEC/FSE2019, Association for Computing Machinery, New York, NY, USA,2019, p. 925–936. doi:10.1145/3338906.3340455.URL https://doi.org/10.1145/3338906.3340455

[69] K. Wang, R. Singh, Z. Su, Dynamic neural program embedding forprogram repair, ICLR, 2018.

[70] K. Wang, Z. Su, Blended, precise semantic program embeddings, in:Proceedings of the 41st ACM SIGPLAN Conference on Program-ming Language Design and Implementation, PLDI 2020, Associationfor Computing Machinery, New York, NY, USA, 2020, p. 121–134.doi:10.1145/3385412.3385999.URL https://doi.org/10.1145/3385412.3385999

12