automated clustering and program repair for introductory ......exploit the fact that mooc courses...

16
Automated Clustering and Program Repair for Introductory Programming Assignments Sumit Gulwani Microsoft Corporation USA sumitg@microsoſt.com Ivan Radiček TU Wien Austria [email protected] Florian Zuleger TU Wien Austria [email protected] Abstract Providing feedback on programming assignments is a tedious task for the instructor, and even impossible in large Massive Open Online Courses with thousands of students. Previous research has suggested that program repair techniques can be used to generate feedback in programming education. In this paper, we present a novel fully automated program repair algorithm for introductory programming assignments. The key idea of the technique, which enables automation and scalability, is to use the existing correct student solutions to repair the incorrect attempts. We evaluate the approach in two experiments: (I) We evaluate the number, size and quality of the generated repairs on 4,293 incorrect student attempts from an existing MOOC. We find that our approach can repair 97% of student attempts, while 81% of those are small repairs of good quality. (II) We conduct a preliminary user study on performance and repair usefulness in an interactive teaching setting. We obtain promising initial results (the average usefulness grade 3.4 on a scale from 1 to 5), and conclude that our approach can be used in an interactive setting. CCS Concepts Applied computing Computer-assisted instruction; Software and its engineering Software testing and debugging; Keywords programming education, MOOC, dynamic anal- ysis, program repair, clustering ACM Reference Format: Sumit Gulwani, Ivan Radiček, and Florian Zuleger. 2018. Auto- mated Clustering and Program Repair for Introductory Program- ming Assignments. In Proceedings of 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’18). Supported by the Austrian National Research Network S11403-N23 (RiSE) of the Austrian Science Fund (FWF). Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third- party components of this work must be honored. For all other uses, contact the owner/author(s). PLDI’18, June 18–22, 2018, Philadelphia, PA, USA © 2018 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-5698-5/18/06. hps://doi.org/10.1145/3192366.3192387 REPAIR ALGORITHM .py B .py D .py C .py F .py E .py A CLUSTERING .py A .py B .py D .py E .py C .py F .py G R 1 MINIMAL R 2 R 3 Figure 1. High-level overview of our approach. ACM, New York, NY, USA, 16 pages. hps://doi.org/10.1145/3192366. 3192387 1 Introduction Providing feedback on programming assignments is an inte- gral part of a class on introductory programming and re- quires substantial effort by the teaching personnel. This problem has become even more pressing with the increasing demand for programming education, which universities are unable to meet (it is predicted that in the US by 2020 there will be one million more programming jobs than students [1]). This has given rise to several Massive Open Online Courses (MOOCs) that teach introductory programming [28]; the biggest challenge in such a setting is scaling personalized feedback to a large number of students. The most common approach to feedback generation is to present the student with a failing test case; either generated automatically using test input generation tools [40] or se- lected from a comprehensive collection of representative test inputs provided by the instructor. This is useful feedback, especially since it mimics the setting of how programmers debug their code. However, this is not sufficient, especially for students in an introductory programming class, who are looking for more guided feedback to make progress towards a correct solution. A more guided feedback can be generated from modi- fications that make a student’s program correct, using a program repair technique as pioneered by the AutoGrader tool [33]. Generating feedback from program repair is an ac- tive area of research: One line of work focuses on improving the technical capabilites of program repair in introductory education [9, 31, 32], while another line of research focuses

Upload: others

Post on 20-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automated Clustering and Program Repair for Introductory ......exploit the fact that MOOC courses already have tens of thousands of existing student attempts; this was already noticed

Automated Clustering and Program Repair forIntroductory Programming Assignments∗

Sumit GulwaniMicrosoft Corporation

[email protected]

Ivan RadičekTU WienAustria

[email protected]

Florian ZulegerTU WienAustria

[email protected]

AbstractProviding feedback on programming assignments is a tedioustask for the instructor, and even impossible in large MassiveOpen Online Courses with thousands of students. Previousresearch has suggested that program repair techniques canbe used to generate feedback in programming education.In this paper, we present a novel fully automated programrepair algorithm for introductory programming assignments.The key idea of the technique, which enables automation andscalability, is to use the existing correct student solutions torepair the incorrect attempts. We evaluate the approach intwo experiments: (I)We evaluate the number, size and qualityof the generated repairs on 4,293 incorrect student attemptsfrom an existing MOOC. We find that our approach canrepair 97% of student attempts, while 81% of those are smallrepairs of good quality. (II) We conduct a preliminary userstudy on performance and repair usefulness in an interactiveteaching setting. We obtain promising initial results (theaverage usefulness grade 3.4 on a scale from 1 to 5), andconclude that our approach can be used in an interactivesetting.

CCSConcepts •Applied computing→Computer-assistedinstruction; • Software and its engineering→ Softwaretesting and debugging;

Keywords programming education, MOOC, dynamic anal-ysis, program repair, clustering

ACM Reference Format:Sumit Gulwani, Ivan Radiček, and Florian Zuleger. 2018. Auto-mated Clustering and Program Repair for Introductory Program-ming Assignments. In Proceedings of 39th ACM SIGPLAN Conferenceon Programming Language Design and Implementation (PLDI’18).

∗Supported by the Austrian National Research Network S11403-N23 (RiSE)of the Austrian Science Fund (FWF).

Permission to make digital or hard copies of part or all of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contactthe owner/author(s).PLDI’18, June 18–22, 2018, Philadelphia, PA, USA© 2018 Copyright held by the owner/author(s).ACM ISBN 978-1-4503-5698-5/18/06.https://doi.org/10.1145/3192366.3192387

REPAIR

ALGORITHM

.pyB

.pyD

.pyC

.pyF

.pyE

.pyA

CLUSTERING.pyA

.pyB

.pyD

.pyE

.pyC

.pyF

.pyG

R1MINIMAL

R2 R3

Figure 1. High-level overview of our approach.

ACM,NewYork, NY, USA, 16 pages. https://doi.org/10.1145/3192366.3192387

1 IntroductionProviding feedback on programming assignments is an inte-gral part of a class on introductory programming and re-quires substantial effort by the teaching personnel. Thisproblem has become even more pressing with the increasingdemand for programming education, which universities areunable tomeet (it is predicted that in the US by 2020 therewillbe one million more programming jobs than students [1]).This has given rise to several Massive Open Online Courses(MOOCs) that teach introductory programming [28]; thebiggest challenge in such a setting is scaling personalizedfeedback to a large number of students.

The most common approach to feedback generation is topresent the student with a failing test case; either generatedautomatically using test input generation tools [40] or se-lected from a comprehensive collection of representative testinputs provided by the instructor. This is useful feedback,especially since it mimics the setting of how programmersdebug their code. However, this is not sufficient, especiallyfor students in an introductory programming class, who arelooking for more guided feedback to make progress towardsa correct solution.A more guided feedback can be generated from modi-

fications that make a student’s program correct, using aprogram repair technique as pioneered by the AutoGradertool [33]. Generating feedback from program repair is an ac-tive area of research: One line of work focuses on improvingthe technical capabilites of program repair in introductoryeducation [9, 31, 32], while another line of research focuses

Page 2: Automated Clustering and Program Repair for Introductory ......exploit the fact that MOOC courses already have tens of thousands of existing student attempts; this was already noticed

PLDI’18, June 18–22, 2018, Philadelphia, PA, USA Sumit Gulwani, Ivan Radiček, and Florian Zuleger

on pedagogical questions such as how to best provide repair-based feedback to the students [20, 37]. In this paper wepropose a new completely automated approach for repairingintroductory programming assignments, while sideliningthe pedagogical questions for future work.

Our approach The key idea of our approach is to use thewisdom of the crowd: we use the existing correct studentsolutions to repair the new incorrect student attempts1. Weexploit the fact that MOOC courses already have tens ofthousands of existing student attempts; this was alreadynoticed by Drummond et al. [13].

Fig. 1 gives a high-level overview of our approach: (A) Fora given programming assignment, we automatically clusterthe correct student solutions (A-F in the figure), based on anotion of dynamic equivalence (see §2.1 for an overview and§4 for details). (B) Given an incorrect student attempt (G inthe figure) we run the repair algorithm against all clusters,and then select a minimal repair (R2 in the figure) from thegenerated repair candidates (R1-R3 in the figure). The repairalgorithm uses expressions from multiple correct solutions togenerate a repair (see §2.2 for an overview and §5 for details).

Intuitively, our clustering algorithm groups together sim-ilar correct solutions. Our repair algorithm can be seen asa generalization of the clustering approach of correct solu-tions to incorrect attempts. The key motivation behind thisapproach is as follows: to help the student, with an incorrectattempt, our approach finds the set of most similar correct so-lutions, written by other students, and generates the smallestmodifications that get the student to a correct solution.We have implemented the proposed approach in a tool

called Clara and evaluated it in two experiments:(I) On 12,973 correct and 4,293 incorrect (total 17,266)

student attempts from an MITx MOOC, written in Python,we evaluate the number, size and quality of the generatedrepairs. Clara is able to repair 97% of student attempts, in3.2s on average; we study the quality of the generated repairsby manual inspection and find that 81% of the generatedrepairs are of good-quality and the size of the generatedrepairmatches the size of the required changes to the student’sprogram. Additionally, we compare AutoGrader and Clara,on the same MOOC data.

(II) We performed a preliminary user study about the per-formance and usefulness of Clara’s repairs in an interactiveteaching setting. The study consisted of 52 participants whowere asked to solve 6 programming assignments in C. Theparticipants judged the usefulness of the generated repair-based feedback by 3.4 in average on a scale from 1 to 5.

Our experimental results demonstrate that Clara can, com-pletely automatically, generate repairs of high quality, for a

1We distinguish correct and incorrect attempts by running them on a set ofinputs, and comparing their output to the expected output. This is the stan-dard way of assessing correctness of student attempts in most introductoryprogramming classes.

large number of incorrect student attempts in an interactiveteaching setting for introductory programming problems.

This paper makes the following contributions:• We propose an algorithm to automatically cluster cor-rect student solutions based on a dynamic programanalysis.• We propose a completely automated algorithm for pro-gram repair of incorrect student attempts that lever-ages the wisdom of the crowd in a novel way.• We evaluate our approach on a large MOOC datasetand show that Clara can repair almost all the pro-grams while generating repairs of high quality.• We find in a real-time user study that Clara is suffi-ciently fast to be used in an interactive teaching settingand obtain promising preliminary results from the par-ticipants of the user study on the usefulness of thegenerated feedback.

Differences to our earlier Technical Report. Our approachfirst appeared as a technical report2, together with a pub-licly available implementation3. While our core ideas are thesame as in the technical report, we since have made severalimprovements of which we state here the two most impor-tant: (1) The repair algorithm uses expressions from differentcorrect solutions in a cluster, as opposed to using only a sin-gle correct solution from a cluster. (2) We conducted a userstudy and added an extensive manual investigation of thegenerated repairs to the experimental evaluation.

Structure of the paper. In §2 we present an overview of ourapproach, in §3 we describe a simple imperative language,used for formalizing the notions of matching and clusteringin §4 and the repair procedure in §5. We discuss our imple-mentation and the experimental evaluation in §6, overviewthe related work in §7, discuss limitations and directions forfuture work in §8, and conclude in §9.

2 OverviewWe discuss the high-level ideas of our approach on the stu-dent attempts to the assignment derivatives: “Computeand return the derivative of a polynomial function (representedas a list of floating point coefficients). If the derivative is 0,return [0.0].” Fig. 2 (a), (b), (e) and (f) show four studentattempts to the above programming assignment: C1 and C2are functionally correct, while I1 and I2 are incorrect.

2.1 Clustering of correct student solutionsThe goal of clustering is two-fold. (1) Scalability: eliminationof dynamically equivalent correct solutions that the repairalgorithm would otherwise consider separately. (2) Diversityof repairs: mining of dynamically equivalent, but syntacticallydifferent expressions from the same cluster, which are used

2https://arxiv.org/abs/1603.03165v13https://github.com/iradicek/clara

Page 3: Automated Clustering and Program Repair for Introductory ......exploit the fact that MOOC courses already have tens of thousands of existing student attempts; this was already noticed

Clustering and Program Repair for Programming Assignments PLDI’18, June 18–22, 2018, Philadelphia, PA, USA

1 def computeDeriv(poly):

2 result = []

3 for e in range(1, len(poly)):

4 result.append(float(poly[e]*e))

5 if result == []:

6 return [0.0]

7 else:

8 return result

1 def computeDeriv(poly):

2 deriv = []

3 for i in xrange(1,len(poly)):

4 deriv+=[float(i)*poly[i]]

56 if len(deriv)==0:

7 return [0.0]

8 return deriv

1. result += [float(poly[e]*e),]

2. if(e==0): result.append(0.0)else:result.append(float(poly[e]*e))

3. result.append(1.0*poly[e]*e)

4. result.append(float(e*poly[e]))

5. result += [e*poly[e]]

1. if(len(result)==0):return [0.0]else: return result

2. if(len(result)>0):return resultelse: return [0.0]

3. return result or [0.0]

(a) Correct attempt C1. (b) Correct attempt C2. (c) Different expressions for result. (d) Different expressions for the returnstatement.

1 def computeDeriv(poly):

2 new = []

3 for i in xrange(1,len(poly)):

4 new.append(

5 float(i*poly[i]))

6 if new==[]:

7 return 0.0

8 return new

1 def computeDeriv(poly):

2 result = []

3 for i in range(len(poly)):

4 result[i]=float((i)*poly[i])

5 return result

1. In return statement at line 7, change0.0 to [0.0].

1. In iterator expression at line 3,change range(len(poly)) torange(1, len(poly)).

2. In assignment at line 4, changeresult[i]=float(i*poly[i]) toresult.append(float(i*poly[i])).

3. In return statement at line 5, changeresult to result or [0.0].

(e) Incorrect attempt I1. (f) Incorrect attempt I2. (g) Repair for I1. (h) Repair for I2.

Figure 2. Motivation examples of real student attempts on the programming assignment derivatives.

to repair incorrect student attempts. We discuss the notionof dynamic equivalence next.

Matching The clusters in our approach are the equivalenceclasses of a matching relation. We say that two programs Pand Q match, written P ∼ Q , when: (1) they have the samecontrol-flow (see the discussion below), and (2) there is atotal bijective relation between the variables of P andQ , suchthat related variables take the same values, in the same order,during the program execution on the same inputs. This isinspired by the notion of a simulation relation [29], adaptedfor a dynamic program analysis: whereas a simulation rela-tion establishes that a program P produces exactly the samevalues as program Q at corresponding program locations forall inputs, we are interested only in a fixed finite set of inputs.Therefore, we also call this notion dynamic equivalence, tostress that we use dynamic program analysis.

Control-flow Our algorithms consider two control-flowsthe same if they have the same looping structure. That is,any loop-free sequence of code is treated as a single block;in particular, blocks can include (nested) if-then-else state-ments without loops (similar to Beyer et al. [7]). We pointout that we could have picked a different granularity ofcontrol-flow; e.g., to treat only straight line of code (withoutbranching) as a block. We have picked this granularity be-cause it enables matching of programs that have differentbranching-structure, and as a result our algorithm is able togenerate repairs that involve missing if-then-else statements.Example. Programs C1 and C2, from Fig. 2 (a) and (b),

match, because: (1) they have the same control-flow sincethere is only a single loop in both programs; (2) there isthe bijective variable relation τ = {poly 7→ poly, deriv 7→result, ? 7→ ?, i 7→ e, return 7→ return}, where variables ? andreturn are special variables denoting the loop condition andthe return value, whichwe need tomake the control-flow and

the return values explicit. For example, on the input poly =[6.3, 7.6, 12.14], the variable result, takes the value [] beforethe loop, the sequence of values [7.6], [7.6, 24.28] inside theloop, and the value [7.6, 24.28] after the loop. Exactly thesame values are taken by the variable deriv; and similarlyfor all the other variables. Therefore, C1 and C2 belong tothe same cluster, which we denote by C. For the furtherdiscussion, we need to fix one correct solution as a clusterrepresentative; we pick C1, although it is irrelevant whichprogram from the cluster we pick.

Although the expressions in the assignments to variablesresult and deriv generate the same values (i.e., they matchor they are dynamically equivalent), the assignment expres-sions inside the loops are syntactically quite different; westate the expressions in terms of the variables of the clusterrepresentative C1:

• append(result, float(poly[e]*e)) in C1, and• result + [float(e)*poly[e]] in C2, where

the second expression has been obtained by replacing thevariables from C2 with variables from C1 using the variablerelation τ (e.g., we have replaced deriv with result, becauseτ (deriv) = result). In our benchmark we found 15 syntacti-cally different, but dynamically equivalent, ways to writeexpressions for the assignment to result, and 6 different waysto write the return expression (observing only different ASTs,and ignoring formatting differences). Some of these examplesare shown in Fig. 2 (c) and (d), respectively. As we discuss inthe next section, these different expressions are used by ourrepair algorithm.

2.2 Repair of incorrect student attemptsOur repair algorithm takes an incorrect student attempt(which we call an implementation) and a cluster (of correct

Page 4: Automated Clustering and Program Repair for Introductory ......exploit the fact that MOOC courses already have tens of thousands of existing student attempts; this was already noticed

PLDI’18, June 18–22, 2018, Philadelphia, PA, USA Sumit Gulwani, Ivan Radiček, and Florian Zuleger

CO

NSTR

AIN

T-

OPTIM

IZATIO

N

LOCAL REPAIRS

Figure 3. High-level overview of the repair algorithm.

programs), and returns a repair ; a repair modifies the imple-mentation such that the repaired implementation and thecluster representative match. The top-level repair proceduretakes an implementation and runs the repair algorithm oneach cluster separately. Each repair has a certain cost w.r.t.some cost metric. In this paper we use syntactic distance.Finally, the repair with the minimal cost is chosen.

Fig. 2 (e) and (f) shows two incorrect programs, I1 and I2,and the generated repairs in (g) and (h), respectively. Theserepairs were generated using cluster C, with its representa-tive C1. Our algorithm also considered other cluster besidesC (which are not discussed here), but found the smallest re-pair using C. In the rest of this section we discuss the repairalgorithm on a single cluster.Our algorithm generates repairs in two steps, which we

discuss in more detail next: (1) The algorithm generates a setof local repairs for each implementation variable (its assignedexpression). (2) Using constraint-optimization techniques,the algorithm selects a consistent subset of the local repairswith the smallest cost. The high-level overview of the algo-rithm is given in Fig. 3.

Local Repairs A local repair ensures that an implementa-tion expression eimpl , either modified or unmodified,matchesa corresponding cluster representative expression eC . Toestablish that the expressions match, we need a variablerelation that translates implementation variables to clustervariables, so that the expressions range over the same vari-ables.We point out that for expressionmatching it is sufficientto consider partial variable relations that relate variables ofthe expressions, as opposed to total variable relations thatrelate all program variables, which we considered above forprogram matching.Let v be an implementation variable, with expression

eimpl assigned to it at some program location. We say that(ω, erepaired ) is a local repair for v , if the expressions eC anderepaired match, where ω is a partial variable relation thatestablishes the matching. In this case the implementationexpression eimpl is modified to the repaired expression erepaired .We discuss below how the algorithm generates the repaired

expression erepaired . We say that (ω, •) is a local repair for v ,if the expressions eC and eimpl match, where ω is a partialvariable relation that establishes the matching. In this casethe implementation expression eimpl remains unmodified.

We illustrate the notion of local repairs on I1 with regardto the cluster representative C1:

1. (ω1, •) is a local repair for new before the loop, whereω1 = {new 7→ result}, since the expressions of newand result match (i.e., they take the same values).

2. (ω2, if new==[]: return [0.0] else:return new) is a local repair for return after the loop,where ω2 = {new 7→ result, return 7→ return}, sincethen the return expressions match.

3. (ω3, if poly==[]: return [0.0] else:return poly) is a local repair for return after the loop,where ω3 = {poly 7→ result, return 7→ return}, sincethen the return expressions match.

The algorithm generates a set of such local repairs for eachvariable and each program location in the implementation.

Finding a repair A (whole program) repair is a consistentsubset of the generated local repairs, such that there is exactlyone local repair for each variable and each program locationin the implementation. A set of local repairs is consistentwhen all partial variable relations in the local repairs aresubsets of some total variable relation. For example, localrepairs (1) and (3) from above are inconsistent, since we haveω1 (new) = result and ω3 (poly) = result, and hence there isno total variable relation that is consistent with both ω1 andω3. On the other hand, local repairs (1) and (2) are consistent,since there is a total variable relation consistent both withω1andω2: {poly 7→ poly, new 7→ result, i 7→ e, ? 7→ ?, return 7→return}.

There are many choices for a whole program repair, thatis, choices for a consistent subset of the generated local re-pairs; however, we are interested in one that has the smallestcost. Our algorithm finds such a repair using constraint-optimization techniques.

Generating the set of potential local repairs The repairalgorithm takes expressions from the different correct solu-tions in order to generate the set of potential local repairs(erepaired discussed above). The algorithm translates theseexpressions to range over the implementation variables, us-ing a partial variable relation. Since each cluster expressionmatches a corresponding cluster representative expression(recall the discussion in the previous sub-section), this partialvariable relation ensures that erepaired matches a correspond-ing cluster representative expression (i.e., that it is a correctrepair).

For example, the generated repair for I2, shown in Fig. 2 (h),combines the following expressions from the cluster:

Page 5: Automated Clustering and Program Repair for Introductory ......exploit the fact that MOOC courses already have tens of thousands of existing student attempts; this was already noticed

Clustering and Program Repair for Programming Assignments PLDI’18, June 18–22, 2018, Philadelphia, PA, USA

• The first modification is generated using the expres-sion range(1, len(poly)) from C1, at line 3, usingthe partial variable relation {poly 7→ poly}.• The second modification is generated using the ex-pression (4.) from Fig. 2 (c), using the partial variablerelation {result 7→ result, poly 7→ poly, i 7→ e}.• The third modification is generated using the expres-sion (3.) from Fig. 2 (d), using the partial variable rela-tion {return 7→ return, result 7→ result}.

3 Program ModelIn this section we define a program model that captures keyaspects of imperative languages (e.g.,C, Python). This modelallows us to formalize our notions of program matching andprogram repair.

Definition 3.1 (Expressions). Let V be a set of variables, Ka set of constants, and O a set of operations. The set of ex-pressions E is built from variables, constants and operationsin the usual way. We fix a set of special variables V ♯ ⊆ V .We assume thatV ♯ includes at least the variable ?, which wewill use to model conditions, and the variable return, whichwe will use to model return values.

Def. 3.1 can be instantiated by a concrete programminglanguage. For example, for the C language, K can be chosento be the set of all C constants (e.g., integer, float), andO canbe chosen to be the set of unary and binary C operations aswell as library functions. The special variables are assumedto not appear in the original program text, and are only usedfor modelling purposes.

Definition 3.2 (Program). A program P = (LP , ℓinit ,VP ,UP ,SP ) is a tuple, where LP is a (finite) set of locations, ℓinit ∈ LPis the initial location, VP is a (finite) set of program variables,UP : (LP × VP ) → E is the variable update function thatassigns an expression to every location-variable pair, andSP : (LP × {true, false}) → (LP ∪ {end}) is the successorfunction, which either returns a successor location in LP orthe special value end (we assume end < LP ). We drop thesubscript P when it is clear from the context.

We point the reader to the discussion around the semanticsbelow for an intuitive explanation of the program model.

Definition 3.3 (Computation domain,Memory). Weassumesome (possibly infinite) set D of values, which we call thecomputation domain, containing at least the following val-ues: (1) true (bool true); (2) false (bool false); and (3) ⊥(undefined).

Let V be a set of variables. A memory σ : (V ∪V ′) → Dis a mapping from program variables to values, where theset V ′ = {v ′ | v ∈ V } denotes the primed version of thevariables in V ; let ΣV denote the set of all memories overvariables V ∪V ′.

Intuitively, the primed variables are used to denote thevariable values after a statement has been executed (see thediscussion around the semantics below).

Definition 3.4 (Evaluation). A functionJ·K : E → Σ→ D isan expression evaluation function, whereJeK(σ ) = d denotesthat e , on a memory σ , evaluates to a value d .

The function J·K is defined by a concrete programminglanguage. The function returns the undefined value (⊥) whenan error occurs during the execution of an actual program.

Definition 3.5 (Program Semantics). Let P = (LP , ℓinit ,VP ,UP ,SP ) be a program. A sequence of location-memory pairsγ ∈ (LP ×ΣVP )

∗ is called a trace. Given some (input) memoryρ, we writeJPK(ρ) = (ℓ1,σ1) · · · (ℓn ,σn ) if: (1) ℓ1 = ℓinit ; (2)σ1 (v ) = ρ (v ) for all v ∈ VP ; (3) (a) σj (v ′) = JUP (ℓj ,v )K(σj ),and (b) σj+1 (v ) = σj (v

′), and (c) ℓj+1 = SP (ℓj ,σj (?′)), forall v ∈ VP and 0 ≤ j < n; and (4) SP (ℓn ,b) = end, for anyb ∈ {true, false}.

For some trace element (ℓ,σ ) ∈ γ and a variable v , σ (v )denotes the value of v before the location ℓ is evaluated (theold value of v at ℓ), and σ (v ′) denotes the value of v afterthe location ℓ is evaluated (the new value of v at ℓ). Thedefinition ofJPK(ρ) then says:(1) The first location of the trace is the initial location ℓinit .(2) The old values of the variables at the initial locationℓinit are defined by the input memory ρ.

(3a) The new value of variable v at location ℓj is deter-mined by the semantic function J·Kevaluated on theexpressionUP (ℓj ,v ).

(3b) The old value of variable v at location ℓj+1 is equal tothe new value at location ℓj .

(3c) The next location ℓj+1 in a trace is determined by thesuccessor function SP for the current location ℓj andthe new value of ? at ℓj (i.e., σj (?′)).

(4) The successor of the last location, ℓn , for any Booleanb ∈ {true, false}, is the end location end.

Modelling of if-then-else statements In our implementa-tion we model if-then-else statements differently, accordingto when they contain loops and when they are loop-free(as mentioned in §2.1). In the former case, the branching ismodelled, as usual, directly in the control-flow. In the lattercase, (loop-free) statements are (recursively) converted toite expressions that behave like a C ternary operator (as inthe example that follows).

Example.We now show how a concrete program (C1 fromFig. 2 (a)) is represented in our model. The set of locations isL = {ℓbefore, ℓcond , ℓloop, ℓafter }, where ℓinit = ℓbefore is the loca-tion before the loop, and the initial location, ℓcond is the loca-tion of the loop condition, ℓloop is the loop body, and ℓafter isthe location after the loop. The successor function is given byS (ℓbefore, true) = S (ℓbefore, false) = ℓcond , S (ℓcond , true) =ℓloop ,S (ℓcond , false) = ℓafter ,S (ℓloop, true) = S (ℓloop, false)

Page 6: Automated Clustering and Program Repair for Introductory ......exploit the fact that MOOC courses already have tens of thousands of existing student attempts; this was already noticed

PLDI’18, June 18–22, 2018, Philadelphia, PA, USA Sumit Gulwani, Ivan Radiček, and Florian Zuleger

= ℓcond , and S (ℓafter , true) = S (ℓafter , true) = end. Notethat for non-branching locations the successor functionspoints to the same location for both true and false.The set of variables is V = {poly, result, e, return, ?, it},

where it is used to model Python’s for-loop iterator. An iter-ator is a sequence whose elements are assigned, one by one,to some variable (e in this example) in each loop iteration.The expression labeling function is given by:

• U (ℓbefore, result) = [],• U (ℓbefore, it) = range(1,len(poly)),• U (ℓcond , ?) = len(it)>0,• U (ℓloop, e ) = ListHead(it),• U (ℓloop, it) = ListTail(it),• U (ℓloop, result) = append(float(poly[e]*e)),• U (ℓafter , return) = ite(result==[], [0.0], result).

For any variable v that is unassigned at some location ℓ wesetU (ℓ,v ) = v , i.e., the variable remains unchanged.Finally, we state the trace when C1 is executed on ρ ={poly 7→ [6.3, 7.6, 12.14]}. We state only defined variablesthat change from one trace element to the next. Otherwise,we assume the values remain the same or are undefined (⊥)(if no previous value existed). JC1K(ρ) = (ℓbefore, {poly 7→[6.3, 7.6, 12.14], result ′ = [], i ′ = 0, it ′ = [1, 2]}), (ℓcond , {?′ =true}), (ℓloop, {e ′ 7→ 1, it ′ = [2], i ′ 7→ 1, result ′ 7→ [7.6]},(ℓcond , {?

′ = true}), (ℓloop, {e ′ 7→ 2, it ′ = [], i 7→ 3, result ′ 7→[7.6, 24.28]}, (ℓcond , {?′ = false}), (ℓafter , {return 7→[7.6, 24.28]}).

4 Matching and ClusteringIn this section we formally define our notion of matching.

Informally, two programs match, if (1) the programs havethe same control-flow (i.e., the same looping structure), and(2) the corresponding variables in the programs take thesame values in the same order. For the following discussionwe fix two programs, P = (LP , ℓinitP ,VP ,UP ,SP ) and Q =(LQ , ℓinitQ ,VQ ,UQ ,SQ ).

Definition 4.1 (Program Structure). Programs P andQ havethe same control-flow if there exists a bijective function, calledstructural matching, π : LQ → LP , s.t., for all ℓ ∈ LQ andb ∈ {true, false}, SP (π (ℓ),b) = π (SQ (ℓ,b)).

We remind the reader, as discussed in §2 and §3, that weencode any loop-free program part as single control-flowlocation; as a result we compare only the looping structureof two programs.We note that both our matching and repair algorithms

require the existence of a structural matching π betweenprograms. Therefore, in the rest of the paper we assume thatsuch a π exists between any two programs that we discuss,and assume that L = LP = LQ and S = SP = SQ , since theycan be always converted back and forth using π . Next, westate two definitions that will be useful later on.

1 fun Match(Programs P, Q, Inputs I):2 π = structural matching or abort

3 γP,ρ = JPK(ρ ) for all ρ ∈ I4 γQ,ρ = JQK(ρ ) for all ρ ∈ I5 M = VQ ×VP6 for v2, v1 ∈ VQ ×VP :7 for ρ ∈ I:8 if γQ,ρ |v2 , γP,ρ |v1:9 M = M \ {(v2, v1 ) }10 break

11 return BijectiveMapping(M )

Figure 4. The Matching Algorithm.

Definition 4.2 (Variables of expression). Let e be some ex-pression, by V (e ) = {v | v ∈ e} we denote the set ofvariables used in the expresison e . We also say that e rangesover V (e ).

Definition 4.3 (Variable substitution). Let τ : V1 → V2 besome function that maps variables V1 to variables V2. Givenan expression e over variablesV1, i.e.,V (e ) ⊆ V1, by τ (e ) wedenote the expression, which we obtain from e , by substitut-ing v with τ (v ) for all v ∈ V1. Note thatV (τ (e )) ⊆ V2.Given a memory σ ∈ ΣV1 , over variables V1, we define

τ (σ ) = {τ (v ) 7→ σ (v ),τ (v )′ 7→ σ (v ′) | v ∈ V1}, that is, thememory where v1 is substituted with v2, for all v2 = τ (v1).We lift substitution to a trace γ = (ℓ1,σ1) · · · (ℓn ,σn ) ∈ (L ×ΣV1 )

∗, by applying it to the each element:τ (γ ) = (ℓ1,τ (σ1)) · · ·(ℓn ,τ (σn )) ∈ (L × ΣV2 )

∗.

In the rest of the paper we call a bijective function τ : V1 →V2, between two sets of variablesV1 andV2, a total variablerelation betweenV1 andV2. Next we give a formal definitionof matching between two programs. Afterwards, we givea formal definition of matching between two expressions.The definitions involve execution of the programs on a setof inputs.

Definition 4.4 (Program Matching). Let I be a set of inputs,and let γP,ρ = JPK(ρ) and γQ,ρ = JQK(ρ) be sets of tracesobtained by executing P and Q on ρ ∈ I , respectively.

We say that P and Q match over a set of inputs I , denotedby P ∼I Q , if there exists a total variable relation τ : VQ →VP , such that γP,ρ = τ (γQ,ρ ), for all inputs ρ ∈ I . We call τ amatching witness.

Intuitively, a matching witness τ defines a way of translat-ingQ to range over variablesVP , such that P andQ translatedwith τ produce the same traces.

Given a set of inputs I , ∼I is an equivalence relation on aset of programs P: the identity relation on program variablesgives a matching witness for reflexivity, the inverse τ−1 ofsome total variable relation τ gives a matching witness forsymmetry, and the composition τ1 ◦ τ2 of some total variablesrelations τ1,τ2 gives a matching witness for transitivity.The algorithm for finding τ is given in Fig. 4: given two

programs P andQ and a set of inputs I , it returns a matching

Page 7: Automated Clustering and Program Repair for Introductory ......exploit the fact that MOOC courses already have tens of thousands of existing student attempts; this was already noticed

Clustering and Program Repair for Programming Assignments PLDI’18, June 18–22, 2018, Philadelphia, PA, USA

witness τ , if one exists. The algorithm first executes bothprograms on the inputs (lines 3 and 4), and then for eachvariable v2 ∈ VQ finds a set of variables from VP that takethe same values during execution, thus defining a set ofpotential matches M ⊆ VQ × VP (lines 5-10); here γ |v , de-notes a projection of values of variable v from a trace γ , thatis: ((ℓ1,σ1) · · · (ℓn ,σn )) |v = σ1 (v

′) · · ·σn (v′). The matching

witness is then a bijective mapping τ ⊆ M , if one exists (line11); τ can be found inM by constructing a maximum bipar-tite matching in the bipartite graph defined by M . We notethat this problem can be solved in polynomial time, w.r.t. thenumber of the edges and vertices inM (e.g.,Uno [41]).

Definition 4.5 (Expression matching). Let Γ ⊆ (L × ΣVP )∗

be a set of traces over variables VP , and let e1 and e2 be twoexpressions over variables VP , at some location ℓ ∈ L.

We say that e1 and e2 match over a set of traces Γ, denotede1 ≃Γ, ℓ e2, if Je1K(σ ) = Je2K(σ ), for all (σ , ℓi ) ∈ γ whereℓi = ℓ, and all γ ∈ Γ.

Expression matching says that two expressions producethe same values, when considering the memories at locationℓ, in a set of traces Γ. In the following lemma we state thatexpression matching is equivalent to program matching; thislemma will be useful for our repair algorithm, which we willstate in the next section.

Lemma 4.6 (Matching Equivalence). Let I be a set of inputs,and let ΓI = {JPK(ρ) | ρ ∈ I } be a set of traces obtained byexecuting P on I . We have the following equivalence: P ∼I Qwitnessed by τ : VQ → VP , if and only if, eP ≃ΓI , ℓ τ (eQ ), forall (ℓ,v1) ∈ L × VP , where v1 = τ (v2), eP = UP (ℓ,v1), andeQ = UQ (ℓ,v2).

Proof. “⇒”: Directly from the definitions. “⇐”: By inductionon the length of the trace γ =JPK(ρ) for some ρ ∈ I . □

Clustering We define clusters as the equivalence classesof ∼I . For the purpose of matching and repair we pick anarbitrary class representative from the class and collect ex-pressions from all programs in the same cluster:

Definition 4.7 (Cluster). Let P be a set of (correct) pro-grams. A cluster C ⊆ P is an equivalence class of ∼I . Givensome cluster C, we fix some arbitrary class representativePC ∈ C.

We define the set EC (ℓ,v1) of cluster expressions for a pair(ℓ,v1) ∈ L × VPC : e1 ∈ EC (ℓ,v1) iff there is some Q ∈ Cwitnessed by τ : VQ → VPC such that v1 = τ (v2), e2 =UQ (ℓ,v2) and e1 = τ (e2).

Note that it is irrelevant which program from C is chosenas cluster representative PC ; we just need to fix some pro-gram in order to be able to define the expressions EC overa common set of variables VPC . We note that by definitionthe sets of expressions EC have the following property: forall e1, e2 ∈ EC (ℓ,v ) we have e1 ≃ΓI , ℓ e2 that is, expressionse1 and e2 match.

Example. In §2.1 we discussed why the solutions C1 andC2 match; therefore these two solutions belong to the samecluster, which we denote here by C, and chose C1 as its rep-resentative PC . Also, in Fig. 2 (c) and (d) we gave examples ofdifferent equivalent expressions of assignment to the variableresult inside the loop body and the return statement after theloop, respectively. To be more precise, these were examplesof sets EC (ℓloop, result) and EC (ℓafter , return), respectively.

5 Repair AlgorithmIn the previous section we defined the notion of a matchingbetween two programs. In this section we consider an im-plementation Pimpl and a cluster C (with its representativePC) between which there is no matching. We assume thatPimpl and PC have the same control-flow. The goal is to repairPimpl ; that is, to modify Pimpl minimally, w.r.t. some notionof cost, such that the repaired program matches the cluster.More precisely, the repair algorithm searches for a programPrepaired , such that PC ∼I Prepaired , and Prepaired should be syn-tactically close to Pimpl . Therefore, our repair algorithm canbe seen as a generalization of clustering to incorrect programs.

We first define a version of the repair algorithm that doesnot change the set of variables, i.e., Vimpl = Vrepaired . Belowwe extend this algorithm to include changes of variables, i.e.,we allow Vimpl , Vrepaired . In both cases the control-flow ofPimpl remains the same.For the following discussion we fix some set of inputs I .

Let Γ = {JPCK(ρ) | ρ ∈ I } be the set of traces of clusterrepresentative PC for the inputs I .As we discussed in the previous section, two programs

match if all of their corresponding expressions match (seeLemma 4.6). Therefore the key idea of our repair algorithmis to consider a set of local repairs that modify individualimplementation expressions. We first discuss local repairs;later on we will discuss how to combine local repairs into afull program repair.Local repairs are defined with regard to partial variable

relations. It is enough to consider partial variable relations(as opposed to the total variable relations needed for match-ings) because these relations only need to be defined for theexpressions that need to be repaired.

Definition 5.1 (Local Repair). Let (ℓ,v2) ∈ L × Vimpl be alocation-variable pair from Pimpl , and let eimpl = Uimpl (ℓ,v2)be the corresponding expression. Further, letω : Vimpl ⇀ VPCbe a partial variable relation such that v2 ∈ dom(ω), letv1 = ω (v2) be the related cluster representative variable, andlet eC = UPC (ℓ,v1) be the corresponding expression.A pair r = (ω, erepaired ), where erepaired is an expression

over implementation variables Vimpl , is a local repair for(ℓ,v2) when eC ≃Γ, ℓ ω (erepaired ) andV (erepaired ) ⊆ dom(ω).A pair r = (ω, •) is a local repair for (ℓ,v2) when eC ≃Γ, ℓω (eimpl ) andV (eimpl ) ⊆ dom(ω).

Page 8: Automated Clustering and Program Repair for Introductory ......exploit the fact that MOOC courses already have tens of thousands of existing student attempts; this was already noticed

PLDI’18, June 18–22, 2018, Philadelphia, PA, USA Sumit Gulwani, Ivan Radiček, and Florian Zuleger

We define the cost of a local repair r = (ω, erepaired ) ascost (r ) = diff (eimpl, erepaired ) and the cost of a local repairr = (ω, •) as cost (r ) = 0.

We comment on the definition of a local repair. Let (ℓ,v2) ∈L × Vimpl be a location-variable pair from Pimpl , let eimpl =

Uimpl (ℓ,v2) be the corresponding expression, and let r be alocal repair for some (ℓ,v2). In case r = (ω, •), the expressioneimpl matches the corresponding expression of the cluster rep-resentative under the partial variable mapping ω; this repairhas cost zero because the expression eimpl is not modified.In case r = (ω, erepaired ), the expression erepaired constitutes amodification of eimpl that matches the corresponding expres-sion of the cluster representative under the partial variablemapping ω; this repair has some cost diff (eimpl, erepaired ). Inour implementation we define diff (eimpl, erepaired ) to be thetree edit distance [38, 43] between the abstract syntax trees(ASTs) of the expressions eimpl and erepaired .

Example. We remind the reader that in §2.2 we discussedthree examples of local repairs for I1 (Fig. 2). Formally, ex-ample (1) is a local repair for (ℓ1, new), while (2) and (3) arelocal repairs for (ℓ4, return). Next we state how to combinelocal repairs into a full program repair.

Definition 5.2 (Repair). Let R be a function that assignsto each pair (ℓ,v ) ∈ L × Vimpl a local repair for (ℓ,v ). Wesay that R is consistent, if there exists a total variable rela-tion τ : Vimpl → VPC , such that ω ⊆ τ , for all R (ℓ,v ) =(ω,−). A consistent R is called a repair. We define the costof R as the sum of the costs of all local repairs: cost (R) =∑

(ℓ,v )∈L×Vimpl cost (R (ℓ,v )).A repair R defines a repaired implementation Prepaired =

(L, ℓinit ,Vimpl,Urepaired ,S), whereUrepaired (ℓ,v ) = erepaired ifM (ℓ,v ) = (−, erepaired ), and Urepaired (ℓ,v ) = Uimpl (ℓ,v ) ifM (ℓ,v ) = (−, •), for all (ℓ,v ) ∈ L ×Vimpl .

Theorem 5.3 (Soundness of Repairs). PC ∼I Prepaired .

Proof. (sketch) From the definition ofR (ℓ,v2), we have eC ≃Γ, ℓτ (erepaired ), for all (ℓ,v2) ∈ L × Vimpl , where v1 = τ (v2),eC = UC (ℓ,v1) and erepaired = Urepaired (ℓ,v2). Then it fol-lows from Lemma 4.6 that PC ∼I Prepaired . □

In the above definition we use notation r = (ω,−) whenerepaired or • is not important in r ; similarly we use r =(−, erepaired ) and r = (−, •) when ω is not important in r .Example. The repair for example I1 (Fig. 2) corresponds

to the total variable relation τ = {poly 7→ poly, new 7→result, e 7→ i, return 7→ return, ? 7→ ?}. The repairM includeslocal repairs (1) and (2) from the previous examples, whereonly (2) has cost > 0 (see the repair in Fig. 2 (g)).Next we discuss the algorithm for finding a repair.

The repair algorithm The algorithm is given in Fig. 5:given a cluster C, an implementation Pimpl , and a set of in-puts I , it returns a repair R. The algorithm has two mainparts: First, the algorithm constructs a set of possible local

1 fun Repair(Cluster C, Implementation Pimpl, Inputs I):2 π = structural matching or abort

3 Γ = {JPCK(ρ ) | ρ ∈ I }4 for (ℓ, v2 ) ∈ L ×Vimpl:

5 LR(ℓ, v2 ) = ∅

6 eimpl = Uimpl (ℓ, v2 )7 for v1 ∈ VPC :8 eC = UPC (ℓ, v1 )

9 for ω : (V (eimpl ) ∪ {v2 }) → VPC s.t. ω (v2 ) = v1:

10 if eC ≃Γ, ℓ ω (eimpl ):11 LR(ℓ, v2 ) = LR(ℓ, v2 ) ∪ {(ω, •) }12 for e ∈ EC (ℓ, v1 ):13 for ω : (V (e ) ∪ {v1 }) → Vimpl s.t. ω (v1 ) = v2:

14 LR(ℓ, v2 ) = LR(ℓ, v2 ) ∪ {(ω−1, ω (e )) }15 return FindRepair(LR)

Figure 5. The Repair Algorithm.

repairs; we define and discuss the possible local repairs below.Second, the algorithm searches for a consistent subset of thepossible local repairs, which has minimal cost; this searchcorresponds to solving a constraint-optimization system.

Definition 5.4 (Set of possible local repairs). For all (ℓ,v ) ∈L×Vimpl , we define the set of possible local repairs LR(ℓ,v ) as:(1) (ω, e ) ∈ LR(ℓ,v ), if ω (e ) ∈ EC (ℓ,ω (v )); and (2) (ω, •) ∈LR(ℓ,v ), if eC ≃Γ, ℓ ω (eimpl ), where eC = UPC (ℓ,ω (v )) andeimpl = Uimpl (ℓ,v ).

The set of possible local repairs LR(ℓ,v ) includes all ex-pressions from the cluster EC (ℓ,ω (v )), translated by somepartial variable relation ω in order to range over implemen-tation variables. It also includes (ω, •) if eimpl matches eCunder partial variable mapping ω. Next we describe how thealgorithm constructs the set LR(ℓ,v ) (at lines 4-14).

For the following discussion we fix a pair (ℓ,v2) ∈ L×Vimpl(corresponding to line 4), and some v1 ∈ VPC (correspondingto line 7); we set eimpl = Uimpl (ℓ,v2) and eC = UPC (ℓ,v1).Possible local repairs for (ℓ,v2) are constructed in two steps:In the first step, the algorithm checks if there are partialvariable relations ω : Vimpl ⇀ VPC s.t. eC ≃Γ, ℓ ω (eimpl ) (atline 9), and in that case adds a pair (ω, •) to LR(ℓ,v2) (at line11). In the second step, the algorithm iterates over all clusterexpressions e = EPC (ℓ,v1) (at line 12), and all partial variablerelations ω : VPC ⇀ Vimpl (at line 13), and then adds a pair(ω−1,ω (e )) to LR(ℓ,v2) (at line 14).We note thatω−1 (ω (e )) =e ∈ EC (ℓ,v1) = EC (ℓ,ω

−1 (v2)), and thus (ω−1,ω (e )) is apossible local repair as in Def. 5.4.

We remark that in both steps, the algorithm iterates overall possible variable relations ω. However, since ω relatesonly the variables of a single expression — usually a smallsubset of all program variables, this iteration is feasible.

Finding a repair with the smallest cost Finally, the al-gorithm uses sub-routine FindRepair (at line 15) that, givena set of possible local repairs LR, finds a repair with smallestcost. FindRepair encodes this problem as a Zero-One IntegerLinear Program (ILP), and then hands it to an off-the-shelf

Page 9: Automated Clustering and Program Repair for Introductory ......exploit the fact that MOOC courses already have tens of thousands of existing student attempts; this was already noticed

Clustering and Program Repair for Programming Assignments PLDI’18, June 18–22, 2018, Philadelphia, PA, USA

ILP solver. Next, we define the ILP problem, describe howwe encode the problem of finding a repair as an ILP problem,and how we decode the ILP solution to a repair.

Definition 5.5 ((Zero-One) ILP). An ILP problem, over vari-ables I = {x1, . . . ,xn }, is defined by an objective functionO = �

∑1≤i≤n wi · xi , and a set of linear (in)equalities C,

of the form∑

1≤i≤n ai · xi ⊵ b. Where � ∈ {min,max} and⊵= {≥,=}. A solution to the ILP problem is a variable as-signment A : I → {0, 1}, such that all (in)equalities hold,and the value of the objective functions is minimal (resp.maximal) for A.

We encode the problem of finding a consistent subset ofpossible local repairs as an ILP problem with variables I ={xv1v2 | v1 ∈ VPC and v2 ∈ Vimpl }∪{xr | r ∈ LR(ℓ,v ), (ℓ,v ) ∈L × Vimpl }; that is, one variable for each pair of variables(v1,v2), and one variable for each possible local repair r . Theset of constraints C is defined as follows:(∑

v2∈Vimpl xv1v2

)= 1 for each v1 ∈ VPC (1)(∑

v1∈VPCxv1v2

)= 1 for each v2 ∈ Vimpl (2)(∑

r ∈LR(ℓ,v ) xr)= 1 for each (ℓ,v ) ∈ L ×Vimpl (3)

−xr + xu1u2 ≥ 0 for each r = (ω,−) ∈ LR (4)and each ω (u2) = u1

Intuitively, the constraints encode:1. Each v1 ∈ VPC is related to exactly one of v2 ∈ Vimpl .2. Each v2 ∈ Vimpl is related to exactly one of v1 ∈ VPC .

Together (1) and (2) encode that there is a total variablerelation τ : Vimpl → VPC .

3. For each (ℓ,v ) ∈ L × Vimpl exactly one local repair isselected.

4. Each selected local repair r = (ω,−) ∈ LR is consistentwith τ , i.e., ω ⊆ τ .

The objective function O = min (∑

r ∈LR cost (r ) · xr ) en-sures that the sum of the costs of the selected local repairs isminimal.Let A : I → {0, 1} be a solution of the ILP problem. We

obtain the following total variable relation from A: τ (v2) =v1 iff A (xv1v2 ) = 1. For A (xr ) = 1, where LR(ℓ,v ) = r , weset R (ℓ,v ) = r .

Adding and Deleting Variables The repair algorithm de-scribed so far does not change the set of variables, i.e.,Vrepaired =Vimpl . However, since the repair algorithm constructs a bijec-tive variable relation, this only works when the implemen-tation and cluster representative have the same number ofvariables, i.e., |Vimpl | = |VPC |. Hence, we extend the algorithmto also allow the addition and deletion of variables.

We extend total variable relations τ : Vimpl → VPC to rela-tions τ ⊆ (Vimpl ∪ {⋆}) × (VPC ∪ {−}). We relax the conditionabout τ being total and bijective: ⋆ and − can be related tomultiple variables or none. When some variable v ∈ VPC is

related to ⋆, that is τ (⋆) = v , it denotes that a fresh variableis added to Pimpl , in order to matchv . Conversely, when somevariable v ∈ Vimpl is related to −, that is τ (v ) = −, variable vis deleted from Pimpl , together with all its assignments.

An example of a repair where a fresh variable is added isgiven in the extended version [19].

Completeness of the algorithm We point out that withthis extension the repair algorithm is complete (assumingPimpl and PC have the same control-flow). This is becausethe repair algorithm can always generate a trivial repair:all variables v2 ∈ Vimpl are deleted, that is τ (v2) = − forall v2 ∈ Vimpl ; and a fresh variable is introduced for everyvariable v1 ∈ VPC , that is τ (⋆) = v1 for all v1 ∈ VPC . Clearly,this trivial repair has high cost, and in practice it is very rarelygenerated, as witnessed by our experimental evaluation inthe next section.

6 Implementation and ExperimentsWe now describe our implementation (§6.1) and an experi-mental evaluation, which consists of two parts: (I) an evalu-ation on MOOC data (§6.2), and (II) a user study about theusefulness of the generated repairs (§6.3). The evaluationwas performed on a server with anAMDOpteron 6272 2.1GHzprocessor and 224 GB RAM.

6.1 ImplementationWe implemented the proposed approach in the publicly avail-able tool Clara4 (for CLuster And RepAir). The tool cur-rently supports programs in the programming languages Cand Python, and consists of: (1) Parsers for C and Pythonthat convert programs to our internal program representa-tion; (2) Program and expression evaluation functions for Cand Python, used in the matching and repair algorithms;(3) Matching algorithm; (4) Repair algorithm; (5) Simple feed-back generation system. We use the lpsolve [2] ILP solver,and the zhang-shasha [4] tree-edit-distance algorithm.

Feedback generation We have implemented a simple feed-back generation system that generates the location and atextual description of the required modifications (similar toAutoGrader). Other types of feedback can be generated fromthe repair as well, and we briefly discuss it in §9.

6.2 MOOC evaluationSetup In the first experiment we evaluate Clara on datafrom the MITx introductory programming MOOC [3], whichis similar to the data used in evaluation of AutoGrader [33].This data is stripped from all information about student

identity, i.e., there are not even anonymous identifiers. Toavoid the threat that a student’s attempt is repaired by herown future correct solution, we split the data into two sets.From the first (chronologically earlier) set we take only the4https://github.com/iradicek/clara

Page 10: Automated Clustering and Program Repair for Introductory ......exploit the fact that MOOC courses already have tens of thousands of existing student attempts; this was already noticed

PLDI’18, June 18–22, 2018, Philadelphia, PA, USA Sumit Gulwani, Ivan Radiček, and Florian Zuleger

Table 1. List of the problems with evaluation results for the MOOC data (with AutoGrader comparison).

Problem LOC AST size # correct # clusters # incorrect # repaired (% of # incorrect) avg. (median) time in sname median median (% of # correct) Clara AutoGrader Clara AutoGrader

derivatives 14 33 1472 532 (36.14%) 481 472 (98.13%) 235 (48.86%) 4.9s (4.4s) 6.6s (5.2s)oddTuples 10 25 9001 454 (5.04%) 3584 3514 (98.05%) 576 (16.07%) 3.0s (2.6s) 25.5s (13.3s)polynomials 13 25 2500 234 (9.36%) 228 197 (86.40%) 17 (7.46%) 1.9s (1.6s) 4.3s (4.0s)

Total 11 25 12973 1220 (9.40%) 4293 4183 (97.44%) 828 (19.29%) 3.2s (2.7s) 19.7s (6.3s)

correct solutions: these solutions are then clustered, and theobtained clusters are used during the repair of the incorrectattempts. From the second (chronologically later) set we takeonly the incorrect attempts: on these attempts we performrepair. We have split the data in 80 : 20 ratio since then wehave a large enough number (12973; see the discussion below)of correct solutions that our approach requires, while stillhaving quite a large number (4293) of incorrect attempts forthe repair evaluation. We point out that this is precisely thesetting that we envision our approach to be used in: a largenumber of existing correct solutions (e.g., from a previousoffering of a course) are used to repair new incorrect studentsubmissions.

Results The evaluation summary is in Table 1; the descrip-tions of the problems are in the extended version of the pa-per [19]. Clara automatically generates a repair for 97.44%of attempts. As expected, Clara can generate repairs in al-most all the cases, since there is always the trivial repairof completely replacing the student implementation withsome correct solution of the same control flow. Hence, it ismandatory to study the quality and size of the generatedrepairs. We evaluate the following questions in more detail:(1) What are the reasons when Clara fails? (2) Does Claragenerate non-trivial repairs? (3) What is the quality and sizeof the generated repairs?We discuss the results of this evaluation below, while

further examples can be found in the extended version [19].

(1) Clara fails Clara fails to generate repair in 110 cases:in 69 cases there are unsupported Python features (e.g.,nested function definitions), in 35 cases there is no correctattempt with the same control-flow, and in 6 cases a numericprecision error occurs in the ILP solver. The only funda-mental problem of our approach is the inability to generaterepairs without matching control-flow; however, since thisoccurs very rarely, we leave the extension for a future work.Hence, we conclude that CLARA can repair almost all pro-grams.

(2) Non-trivial repairs Since a correct repair is also a triv-ial one that completely replaces a student’s attempt with acorrect program, we measure how much a repair changesthe student’s program. To measure this we examine the rela-tive repair size: the tree-edit-distance of the repair divided bythe size of the AST of the program. Intuitively, the tree-edit-distance tells us how many changes were made in a program,

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9≥ 1. 0∞Relative repair size

0

100

200

300

400

500

600

Number of attempts

Figure 6. Histogram of relative repair sizes.

and normalization with the total number of AST nodes givesus the ratio of how much of the whole program changed. Note,however that this ratio can be > 1.0, or even ∞ if the pro-gram is empty. Fig. 6 shows a histogram of relative repairsizes. We note that 68% of all repairs have relative size < 0.3,53% have < 0.2 and 25% have < 0.1; the last column (∞) iscaused by 436 completely empty student attempts. As anexample, the two repairs in Fig. 2 (g) and (h), have relativesizes of 0.03 and 0.24.We conclude that Clara in almost allcases generates a non-trivial repair that is not a replacementof the whole student’s program.

Equalnumber

LessAG

LessClara

101

102

103

Number of attempts

580

164

83

1 2 3 4 ≥ 50

10

20

30

40

50

60

70

Percentage of attempts

68.0%

23.3%

6.6%1.7% 0.4%

37.2% 35.5%

12.6%

7.3% 7.4%

AutoGrader

Clara

(a) The number of modified expres-sions per repair.

(b) Distribution of the number ofmodified expressions per repair.

Figure 7. Comparison of the generated repairs size betweenAutoGrader and Clara.

(3) Repair quality and repair size We inspected 100 ran-domly selected generated repairs, with the goal of evaluatingtheir quality and size. Our approach of judging repair qual-ity and size mirrors a human teacher helping a student: theteacher has to guess the student’s idea and use subjectivejudgment on what feedback to provide. We obtained the fol-lowing results: (a) In 72 cases Clara generates the smallest,

Page 11: Automated Clustering and Program Repair for Introductory ......exploit the fact that MOOC courses already have tens of thousands of existing student attempts; this was already noticed

Clustering and Program Repair for Programming Assignments PLDI’18, June 18–22, 2018, Philadelphia, PA, USA

most natural repair; (b) In 9 cases the repair is almost thesmallest, but involves an additional modification that is notrequired; (c) In 11 cases we determined the repair, althoughcorrect, to be different from the student’s idea; (d) In 8 cases itis not possible to determine the idea of the student’s attempt,although Clara generates some correct repair.For the cases in (d), we found that program repair is not

adequate and further research is needed to determine whatkind of feedback is suitable when the student is far fromany correct solution. For the cases in (c), the set of correctsolutions does not contain any solution which is syntacti-cally close to the student’s idea; we conjecture that Clara’sresults in these cases can partially be improved by consider-ing different cost functions which do not only take syntacticdifferences into account but also make use of semantic infor-mation (see the discussion in §9). However, in 81 cases (thesum of (a) and (b)), Clara generates good quality repairs. Weconclude that Clara mostly produces good quality repairs.

Summary Our large-scale experiment on the MOOC data-set shows that Clara can fully automatically repair almost allprograms and the generated repairs are of high quality.

Clusters Finally, we briefly discuss the correct solutions,since our approach depends on their existence. The qualityof the generated repair should increase with the number ofclusters, since then the algorithm can generate more diverserepairs. Thus, it is interesting to note that we experiencedno performance issues with a large number of clusters; e.g.,on derivatives, with 532 clusters, a repair is generated onaverage in 4.9s. This is because the repair algorithm pro-cesses multiple clusters in parallel. Nonetheless, clusteringis important for repair quality, since it enables repairs thatcombine expressions taken from different correct solutionsfrom the same cluster, which would be impossible withoutclustering. We found that 2093 (around 50%) repairs weregenerated using at least two different correct solutions, and110 (around 3%) were generated using at least three differentcorrect solutions, in the same cluster.

6.2.1 Comparison with AutoGraderWhile the setting of AutoGrader is different (a teacher hasto provide an error model, while our approach is fully auto-mated), the same high-level goal (finding a minimal repairfor a student attempt to provide feedback) warrants an ex-perimental comparison between the approaches.

Setup and Data We were not able to obtain the data usedin AutoGrader’s evaluation, which stems from an internalMIT course, because of privacy concerns regarding studentdata. Hence, we compare the tools on the same MITx in-troductory programming MOOC data, which we used inthe paper for Clara evaluation. This dataset is similar tothe dataset used in AutoGrader’s evaluation. AutoGrader’s

authors provided us with an AutoGrader version that is op-timized to scale to a MOOC, that is, it has a weaker errormodel than in the original AutoGrader’s publication [33].According to the authors, some error rewrite rules were in-tentionally omitted, since they are too slow for interactiveonline feedback generation.

Results The evaluation summary is in Table 1. AutoGraderis able to generate a repair for 19.29% of attempts, usingman-ually specified rewrite-rules, compared to 97.44% automati-cally generated repairs by Clara. (We note that AutoGraderis able to repair fewer attempts than reported in the originalpublication [33] due to the differences discussed in the previ-ous paragraph.) As Clara can generate repairs in almost allthe cases, these numbers are not meaningful on their own;the numbers are, however, meaningful in conjunction withour evaluation of the following questions: (1) How manyrepairs can one tool generate that other cannot, and what arethe reasons when AutoGrader fails? (2) What are the sizes ofrepairs? (3) What is the quality of the generated repairs, incase both tools generate a repair?

We summarize the results of this evaluation below, whilea more detailed discussion can be found in the extendedversion [19].

(1) Repair numbers In all but one case, when Clara failsto generate a repair, AutoGrader also fails. Further, we man-ually inspected 100 randomly selected cases where Auto-Grader fails, and determined that in 77 cases there is a fun-damental problem with AutoGrader’s approach: The mod-ifications require fresh variables, new statements or largermodifications, which are beyond AutoGrader’s capabilities.This shows that Clara can generate more complicated repairsthan AutoGrader.

In the 100 cases wemanually inspectedwe also determinedthat in 74 cases Clara generates good quality repairs, whenAutoGrader fails.

(2) Repair sizes We do not report the relative repair sizemetric for AutoGrader, because we were not able to extractthe repair size from its (textual) output. However, Fig. 7 (a)compares the relation of the number of modified expressionswhen both tools generate a repair. We note that the numberof modified expressions is a weaker metric than the tree-edit-distance, however, we were only able to extract this metricfor the repairs generated by AutoGrader.We conclude thatAutoGrader produces smaller repair in around 10% of the cases.Fig. 7 (b) also compares the overall (not just when both

tools generate a repair) distribution of the number of changedexpression per repair. We notice that most of AutoGrader’srepairs modify a single expression, and the percentage fallsfaster than in Clara’s case.

(3) Repair quality Finally, we manually inspected 100 ran-domly selected cases where both tools generate a repair. In61 cases we found both tools to produce the same repair; in

Page 12: Automated Clustering and Program Repair for Introductory ......exploit the fact that MOOC courses already have tens of thousands of existing student attempts; this was already noticed

PLDI’18, June 18–22, 2018, Philadelphia, PA, USA Sumit Gulwani, Ivan Radiček, and Florian Zuleger

Table 2. List of the problems with evaluation details for user study.

Problem LOC # correct # clusters # incorr. # feedback # repair feedback time (in s) # gradesmedian (exist.+study) (exist.+study) (% of # incorr.) (% of # feedback) avg. median 1/2/3/4/5

Fibonacci sequence 12 512+84 70 + 17 (14.60%) 572 539 (94.23%) 440 (81.63%) 10.44 8.51 1 / 7 / 9 / 16 / 13Special number 15 358+59 39 + 3 (10.07%) 121 109 (90.08%) 94 (86.24%) 3.77 2.38 2 / 3 / 8 / 9 / 13

Reverse Difference 17 342+46 48 + 8 (14.43%) 103 77 (74.76%) 68 (88.31%) 4.39 3.07 4 / 4 / 5 / 3 / 5Factorial interval 14 391+44 56 + 8 (14.71%) 234 232 (99.15%) 185 (79.74%) 3.33 3.17 2 / 5 / 4 / 5 / 13

Trapezoid 14 281+41 36 + 15 (15.84%) 143 129 (90.21%) 121 (93.80%) 7.55 4.82 7 / 5 / 7 / 7 / 5Rhombus 21 264+38 73 + 22 (31.46%) 525 417 (79.43%) 192 (46.04%) 9.16 5.35 6 / 9 / 6 / 5 / 3

19 cases different, although of the same quality; in 9 caseswe consider AutoGrader to be better; in 5 cases we considerClara to be better; and in 6 cases we found that AutoGradergenerates an incorrect repair. We conclude that there is nonotable difference between the tools when both tools generatea repair.

6.3 User study on usefulnessIn the second experiment we performed a user study, evalu-ating Clara in real time. We were interested in the followingquestions: (1)How often and fast is feedback generated (perfor-mance)? (2)How useful is the generated repair-based feedback?

Setup To answer these questions we developed a web in-terface for Clara and conducted a user study, which weadvertised on programming forums, mailing lists, and socialnetworks. Each participant was asked to solve six introduc-tory C programming problems, for which the participantsreceived feedback generated by Clara. There was one addi-tional problem, not discussed here, that was almost solved,and whose purpose was to familiarize the participants withthe interface. After solving a problem each participant waspresented with the question: “How useful was the feedbackprovided on this problem?”, and could select a grade on thescale from 1 ("Not useful at all") to 5 ("Very useful"). Addi-tionally, each participant could enter an additional textualcomment for each generated feedback individually and atthe end of solving a problem.We also asked the participants to assess their program-

ming experience with the question: “Your overall program-ming experience (your own, subjective, assessment)”, withchoices on the scale from 1 (“Beginner”) to 5 (“Expert”).

The initial correct attempts were taken from an introduc-tory programming course at IIT Kanpur, India. The course istaken by around 400 students of whom several have neverwritten a program before. We selected problems from twoweeks where students start solving more complicated prob-lems using loops. Of the 16 problems assigned in these twoweeks, we picked those 6 that were sufficiently different.

Results Table 2 shows the summary of the results; detaileddescriptions of all problems are available in the extendedpaper version [19]. The columns # correct and # clustersshow the number of correct attempts and clusters obtainedfrom: (a) the existing ESC 101 data (exist. in the table), and

(b) during the case study from participants’ correct attempts(study in the table). We plan to make the complete data, withall attempts, grades and textual comments publicly available.

Performance of Clara Feedback was generated for 1503(88.52%) of incorrect attempts. In the following we discussthe 3 reasons why feedback could not be generated: (1) In 57cases there was a bug in Clara, which we have fixed afterthe experiment finished. Then we confirmed that in all 57cases the program is correctly repaired and feedback is gen-erated (note that this bug was only present in this real-timeexperiment, i.e., it did not impact the experiment described inthe previous section); (2) In 43 cases a timeout occurred (setto 60s); (3) In 95 cases a program contained an unsupportedC construct, or there was a syntactic compilation error thatwas not handled by the web interface (Clara currently pro-vides no feedback on programs that cannot be even parsed).Further, the average time to generate feedback was 8 sec-onds. These results show that Clara provides feedback on alarge percentage of attempts in real time.

Feedback usefulness The results are based on 191 gradesgiven by 52 participants. Note that problems have a differentnumber of grades. This is because we asked for a grade onlywhen feedback was successfully generated (as noted above,in 88.52% cases), and because some of the participants didnot complete the study. The average grade over all problemsis 3.4. This shows very promising preliminary results on theusefulness of Clara. However, we believe that these resultscan be further improved (see §9).The participants declared their experience as follows: 22

as 5, 19 as 4, 9 as 3, 0 as 2, and 2 as 1. While these are usefulpreliminary results, a study with beginner programmers isan important future work.Note. In the case of a very large repair (cost > 100 in our

study), we decided to show a generic feedback explaining ageneral strategy on how to solve the problem. This is becausethe feedback generated by such a large repair is usually notuseful. We generated such a general strategy in 403 cases.

6.4 Threats to ValidityProgram size We have evaluated our approach on small tomedium size programs typically found in introductory pro-gramming problems. The extension of our approach to largerprogramming problems, as found in more advanced courses,

Page 13: Automated Clustering and Program Repair for Introductory ......exploit the fact that MOOC courses already have tens of thousands of existing student attempts; this was already noticed

Clustering and Program Repair for Programming Assignments PLDI’18, June 18–22, 2018, Philadelphia, PA, USA

is left for future work. Focusing on small to medium sizeprograms is in line with related work on automated feedbackgeneration for introductory programming (e.g., D’Antoniet al. [9], Singh et al. [33], Head et al. [20]). We stress thatthe state-of-the-art in teaching is manual feedback (as wellas failing test cases); thus, automation, even for small tomedium size programs, promises huge benefits. We alsomention that our dataset contains larger and challengingattempts by students which use multiple functions, multipleand nested loops, and our approach is able to handle them.The focus of our work differs from the related work on

program repair (see §7 for a more detailed discussion). Ourapproach is specifically designed to perform well on smallto medium size programs typically found in introductoryprogramming problems, rather than the larger programstargeted in the literature on automated program repair. Inparticular, our approach addresses the challenges of (1) ahigh number of errors (education programs are expectedto have higher error density [33]), (2) complex repairs, and(3) a runtime fast enough for use in an interactive teachingenvironment. These goals are often out of reach for programrepair techniques. For example, the general purpose programrepair techniques discussed in Goues et al. [17] on the In-troClass benchmark, either repair a small number of defects(usually <50%) or take a long time (i.e., over one minute).

Unsoundness Our approach guarantees only that repairsare correct over a given set of test cases. This is in accor-dance with the state-of-the-art in teaching, where testingis routinely used by course instructors to grade program-ming assignments and provide feedback (e.g., for ESC101 atIIT Kanpur, India [10]). When we manually inspected therepairs for their correctness, we did not find any problemswith soundness. We believe that this due to the fact that pro-gramming problems are small, human-designed problemsthat have comprehensive sets of test cases.In contrast to our dynamic approach, one might think

about a sound static approach based on symbolic executionand SMT solving. We decided for a dynamic analysis becausesymbolic execution can sometimes take a long time or evenfail when constructs are not supported by an SMT solver. Forexample, reasoning about floating points and lists is difficultfor SMT solvers. On the other hand, our method only exe-cutes given expressions on a set of inputs, so we can handleany expression, and our method is fast. Further, our evalua-tion showed our dynamic approach to be precise enough forthe domain of introductory programming assignments. Theinvestigation of a static verification of the results generatedby our repair approach is an interesting direction for futurework: one could take the generated repair expressions andverify that they indeed establish a simulation with the clusteragainst which the program was repaired.

7 Related WorkAutomated FeedbackGeneration Ihantola et al. [21] presenta survey of tools for the automatic assessment of program-ming exercises. Pex4Fun [40] and its successor CodeHunt [39]are browser-based, interactive platforms where studentssolve programming assignments with hidden specifications,and are presented with a list of automatically generated testcases. LAURA [5] heuristically applies program transfor-mations to a student’s program and compares it to a ref-erence solution, while reporting mismatches as potentialerrors (they could also be correct variations). Apex [25] is asystem that automatically generates error explanations forbugs in assignments, while our work automatically clusterssolutions and generates repairs for incorrect attempts.

Trace analysis Striewe and Goedicke [35] have proposedpresenting full program traces to the students, but the inter-pretation of the traces is left to the students. They have alsosuggested automatically comparing the student’s trace tothat of a sample solution [36]. However, the approach missesa discussion of the situation when the student’s code entersan infinite loop, or has an error early in the program thatinfluences the rest of the trace. The approach of Gulwaniet al. [18] uses a dynamic analysis based approach to find astrategy used by the student, and to generate feedback forperformance aspects. However, the approach requires specifi-cations manually provided by the teacher, written in a spe-cially designed specification language, and it only matchesspecifications to correct attempts, i.e., it cannot provide feed-back on incorrect attempts.

ProgramClassification in Education CodeWebs [30] clas-sifies different AST sub-trees in equivalence classes basedon probabilistic reasoning and program execution on a setof inputs. The classification is used to build a search engineover ASTs to enable the instructor to search for similar at-tempts, and to provide feedback on some class of ASTs. Over-Code [15] is a visualization system that uses a lightweightstatic and dynamic analysis, together with manually pro-vided rewrite rules, to group student attempts. Drummondet al. [13] propose a statistical approach to classify interac-tive programs in two categories (good and bad). Head et al.[20] cluster incorrect programs by the type of the requiredmodifications. CoderAssist [23] provides feedback on studentimplementations of dynamic programming algorithms: theapproach first clusters both correct and incorrect programsbased on their syntactic features; feedback for incorrect pro-gram is generated from a counterexample obtained from anequivalence check (using SMT) against a correct solution inthe same cluster.

Program Repair The research on program repair is vast;we mention some work, with emphasis on introductory edu-cation. The non-education program repair approaches arebased on SAT [16], symbolic execution [26], games [22, 34],

Page 14: Automated Clustering and Program Repair for Introductory ......exploit the fact that MOOC courses already have tens of thousands of existing student attempts; this was already noticed

PLDI’18, June 18–22, 2018, Philadelphia, PA, USA Sumit Gulwani, Ivan Radiček, and Florian Zuleger

program mutation [11], and genetic programming [6, 14].In contrast, our approach uses dynamic analysis for scala-bility. These approaches aim at repairing large programs,and therefore are not able to generate complex repairs. Ourapproach repairs small programs in education and uses mul-tiple correct solutions to find the best repair suggestions, andtherefore is able to suggest more complex repairs.Prophet [27] mines a database of successful patches and

uses these patches to repair defects in large, real-world ap-plications. However, it is unclear how this approach wouldbe applicable to our educational setting. SearchRepair [24]mines a body of code for short snippets that it uses for repair.However, SearchRepair has different goals than our work andhas not been used or evaluated in introductory education.Angelic Debugging [8] is an approach that identifies at mostone faulty expression in the program and tries to replace itwith a correct value (instead of replacement expression).

Yi et al. [42] explore different automated program repair(APR) systems in the context of generating feedback in in-telligent tutoring systems. They show that using APR out-of-the-box seems infeasible due to the low repair rate, butdiscuss how these systems can be used to generate partialrepairs. In contrast, our approach is designed to provide com-plete repairs. They also conclude that further research isrequired to understand how to generate the most effectivefeedback for students from these repairs.AutoGrader [33] takes as input an incorrect student pro-

gram, along with a reference solution and a set of potentialcorrections in the form of expression rewrite rules (bothprovided by the course instructor), and searches for a set ofminimal corrections using program synthesis. In contrast,our approach is completely automatic and can generate morecomplicated repairs. Refazer [32] learns programs transfor-mations from example code edits made by the students, andthen uses these transformations to repair incorrect studentsubmissions. In comparison to our approach, Refazer doesnot have a cost model, and hence the generated repair isthe first one found (instead of the smallest one). Rivers andKoedinger [31] transform programs to a canonical form us-ing semantic-preserving syntax transformations, and thenreport syntax difference between an incorrect program andthe closest correct solution; the paper reports evaluation onloop-less programs. In contrast, our approach uses dynamicequivalence, instead of (canonical) syntax equivalence, forbetter robustness under syntactic variations of semanticallyequivalent code. Qlose [9] automatically repairs programsin education based on different program distances. The ideato consider different semantic distances is very interesting,however the paper reports only a very small initial evalua-tion (on 11 programs), and Qlose is only able to generatesmall, template-based repairs.

8 Future WorkIn this section we briefly discuss the limitations of our ap-proach, and possible directions for future work.

Cost function The cost function in our approach com-pares only the syntactic difference between the original andthe replacement expressions (specifically, we use the tree-edit-distance in our implementation). We believe that thecost function could take into account more information; e.g.,variable roles [12] or semantic distance [9].

Control-flow The clustering and repair algorithms are re-stricted to the analysis of programs with the same control-flow. As the case of “no matching control-flow to generatea repair” rarely occurs in our experiments (only 35 casesin the MOOC experiment), we have left the extension ofour algorithm to programs with different control-flow forfuture work. We conjecture that our algorithm could be ex-tended to programs with similar control-flow (e.g., differentlooping-structure).

Feedback Our tool currently outputs a textual descriptionof the generated repair, very similar to the feedback gener-ated by AutoGrader. We believe that the generated repairscould be used to derive other types of feedback as well. Forexample, a more abstract feedback with the help of a courseinstructor: A course instructor could annotate variables inthe correct solutions with their descriptions, and when arepair for some variable is required, a matching feedback isshown to a student.While this paper is focused on the technical problem of

finding possible repairs, an interesting orthogonal directionfor future work is to consider pedagogical research questions,for example: (1) How much information should be revealedto the student (the line number, an incorrect expression,the whole repair)? (2) Should the use of automated helpbe penalized? (3) How much do students learn from theautomated help?

9 ConclusionWe present novel algorithms for clustering and program re-pair in introductory programming education. The key ideabehind our approach is to use the existing correct student solu-tions, which are available in tens of thousands in largeMOOCcourses, to repair incorrect student attempts. Our evaluationshows that Clara can generate a large number of repairswithout any manual intervention, can perform complicatedrepairs, can be used in an interactive teaching setting, andgenerates good quality repairs in a large percentage of cases.

References[1] [n. d.]. Analysis: The exploding demand for computer science educa-

tion, and why America needs to keep up. http://www.geekwire.com/2014/analysis-examining-computer-science-education-explosion/.

[2] [n. d.]. lpsolve - ILP solver. http://sourceforge.net/projects/lpsolve/.

Page 15: Automated Clustering and Program Repair for Introductory ......exploit the fact that MOOC courses already have tens of thousands of existing student attempts; this was already noticed

Clustering and Program Repair for Programming Assignments PLDI’18, June 18–22, 2018, Philadelphia, PA, USA

[3] [n. d.]. MITx MOOC. https://www.edx.org/school/mitx.[4] [n. d.]. zhang-shasha - tree-edit-distance algorithm implemented in

Python. https://github.com/timtadh/zhang-shasha.[5] Anne Adam and Jean-Pierre Laurent. 1980. LAURA, a system to debug

student programs. Artificial Intelligence 15, 1âĂŞ2 (1980), 75 – 122.https://doi.org/10.1016/0004-3702(80)90023-5

[6] Andrea Arcuri. 2008. On the Automation of Fixing Software Bugs. InCompanion of the 30th International Conference on Software Engineering(ICSE Companion ’08). ACM, New York, NY, USA, 1003–1006. https://doi.org/10.1145/1370175.1370223

[7] D. Beyer, A. Cimatti, A. Griggio, M. E. Keremoglu, S. F. University,and R. Sebastiani. 2009. Software model checking via large-blockencoding. In 2009 Formal Methods in Computer-Aided Design. 25–32.https://doi.org/10.1109/FMCAD.2009.5351147

[8] Satish Chandra, Emina Torlak, Shaon Barman, and Rastislav Bodik.2011. Angelic Debugging. In Proceedings of the 33rd InternationalConference on Software Engineering (ICSE ’11). ACM, New York, NY,USA, 121–130. https://doi.org/10.1145/1985793.1985811

[9] Loris D’Antoni, Roopsha Samanta, and Rishabh Singh. 2016. Qlose:Program Repair with Quantitative Objectives. In Computer Aided Veri-fication - 28th International Conference, CAV 2016, Toronto, ON, Canada,July 17-23, 2016, Proceedings, Part II. 383–401. http://dx.doi.org/10.1007/978-3-319-41540-6_21

[10] Rajdeep Das, Umair Z. Ahmed, Amey Karkare, and Sumit Gulwani.2016. Prutor: A System for Tutoring CS1 and Collecting StudentPrograms for Analysis. CoRR abs/1608.03828 (2016). http://arxiv.org/abs/1608.03828

[11] V. Debroy and W.E. Wong. 2010. Using Mutation to AutomaticallySuggest Fixes for Faulty Programs. In Software Testing, Verificationand Validation (ICST), 2010 Third International Conference on. 65–74.https://doi.org/10.1109/ICST.2010.66

[12] Yulia Demyanova, Helmut Veith, and Florian Zuleger. 2013. On theconcept of variable roles and its use in software analysis. In FormalMethods in Computer-Aided Design, FMCAD 2013, Portland, OR, USA,October 20-23, 2013. 226–230. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=6679414

[13] A. Drummond, Y. Lu, S. Chaudhuri, C. Jermaine, J. Warren, and S.Rixner. 2014. Learning to Grade Student Programs in a Massive OpenOnline Course. In Data Mining (ICDM), 2014 IEEE International Con-ference on. 785–790. https://doi.org/10.1109/ICDM.2014.142

[14] Stephanie Forrest, ThanhVu Nguyen, Westley Weimer, and ClaireLe Goues. 2009. A Genetic Programming Approach to AutomatedSoftware Repair. In Proceedings of the 11th Annual Conference on Geneticand Evolutionary Computation (GECCO ’09). ACM, New York, NY, USA,947–954. https://doi.org/10.1145/1569901.1570031

[15] Elena L. Glassman, Jeremy Scott, Rishabh Singh, Philip Guo, and RobertMiller. 2014. OverCode: Visualizing Variation in Student Solutions toProgramming Problems at Scale. In Proceedings of the Adjunct Publi-cation of the 27th Annual ACM Symposium on User Interface Softwareand Technology (UIST’14 Adjunct). ACM, New York, NY, USA, 129–130.https://doi.org/10.1145/2658779.2658809

[16] Divya Gopinath, Muhammad Zubair Malik, and Sarfraz Khurshid. 2011.Specification-based Program Repair Using SAT. In Proceedings of the17th International Conference on Tools and Algorithms for the Construc-tion and Analysis of Systems: Part of the Joint European Conferenceson Theory and Practice of Software (TACAS’11/ETAPS’11). Springer-Verlag, Berlin, Heidelberg, 173–188. http://dl.acm.org/citation.cfm?id=1987389.1987408

[17] C. Le Goues, N. Holtschulte, E. K. Smith, Y. Brun, P. Devanbu, S. Forrest,and W. Weimer. 2015. The ManyBugs and IntroClass Benchmarksfor Automated Repair of C Programs. IEEE Transactions on SoftwareEngineering 41, 12 (Dec 2015), 1236–1256. https://doi.org/10.1109/TSE.2015.2454513

[18] Sumit Gulwani, Ivan Radiček, and Florian Zuleger. 2014. FeedbackGeneration for Performance Problems in Introductory ProgrammingAssignments. In Proceedings of the 22Nd ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering (FSE 2014). ACM,New York, NY, USA, 41–51. https://doi.org/10.1145/2635868.2635912

[19] Sumit Gulwani, Ivan Radiček, and Florian Zuleger. 2018. Auto-mated Clustering and Program Repair for Introductory Program-ming Assignments. CoRR abs/1603.03165 (2018). arXiv:1603.03165http://arxiv.org/abs/1603.03165

[20] Andrew Head, Elena Glassman, Gustavo Soares, Ryo Suzuki, Lu-cas Figueredo, Loris D’Antoni, and Björn Hartmann. 2017. Writ-ing Reusable Code Feedback at Scale with Mixed-Initiative ProgramSynthesis. In Proceedings of the Fourth (2017) ACM Conference onLearning @ Scale (L@S ’17). ACM, New York, NY, USA, 89–98. https://doi.org/10.1145/3051457.3051467

[21] Petri Ihantola, Tuukka Ahoniemi, Ville Karavirta, and Otto Seppälä.2010. Review of Recent Systems for Automatic Assessment of Program-ming Assignments. In Proceedings of the 10th Koli Calling InternationalConference on Computing Education Research (Koli Calling ’10). ACM,New York, NY, USA, 86–93. https://doi.org/10.1145/1930464.1930480

[22] Barbara Jobstmann, Andreas Griesmayer, and Roderick Bloem. 2005.Program Repair As a Game. In Proceedings of the 17th InternationalConference on Computer Aided Verification (CAV’05). Springer-Verlag,Berlin, Heidelberg, 226–238. https://doi.org/10.1007/11513988_23

[23] Shalini Kaleeswaran, Anirudh Santhiar, Aditya Kanade, and SumitGulwani. 2016. Semi-supervised Verified Feedback Generation. InProceedings of the 2016 24th ACM SIGSOFT International Symposiumon Foundations of Software Engineering (FSE 2016). ACM, New York,NY, USA, 739–750. https://doi.org/10.1145/2950290.2950363

[24] Yalin Ke, Kathryn T. Stolee, Claire Le Goues, and Yuriy Brun. 2015. Re-pairing Programs with Semantic Code Search (T). In Proceedings of the2015 30th IEEE/ACM International Conference on Automated SoftwareEngineering (ASE) (ASE ’15). IEEE Computer Society, Washington, DC,USA, 295–306. http://dx.doi.org/10.1109/ASE.2015.60

[25] Dohyeong Kim, Yonghwi Kwon, Peng Liu, I. Luk Kim, David MitchelPerry, Xiangyu Zhang, and Gustavo Rodriguez-Rivera. 2016. Apex:Automatic ProgrammingAssignment Error Explanation. In Proceedingsof the 2016 ACM SIGPLAN International Conference on Object-OrientedProgramming, Systems, Languages, and Applications (OOPSLA 2016).ACM, New York, NY, USA, 311–327. https://doi.org/10.1145/2983990.2984031

[26] Robert Könighofer and Roderick Bloem. 2011. Automated ErrorLocalization and Correction for Imperative Programs. In Proceed-ings of the International Conference on Formal Methods in Computer-Aided Design (FMCAD ’11). FMCAD Inc, Austin, TX, 91–100. http://dl.acm.org/citation.cfm?id=2157654.2157671

[27] Fan Long and Martin Rinard. 2016. Automatic Patch Generation byLearning Correct Code. SIGPLAN Not. 51, 1 (Jan. 2016), 298–312.http://doi.acm.org/10.1145/2914770.2837617

[28] Ken Masters. 2011. A Brief Guide To Understanding MOOCs. TheInternet Journal of Medical Education 1, 2 (2011). https://doi.org/10.5580/1f21

[29] Robin Milner. 1971. An Algebraic Definition of Simulation BetweenPrograms. Technical Report. Stanford, CA, USA.

[30] Andy Nguyen, Christopher Piech, Jonathan Huang, and LeonidasGuibas. 2014. Codewebs: Scalable Homework Search for Massive OpenOnline Programming Courses. In Proceedings of the 23rd InternationalConference on World Wide Web (WWW ’14). ACM, New York, NY, USA,491–502. https://doi.org/10.1145/2566486.2568023

[31] Kelly Rivers and Kenneth R. Koedinger. 2017. Data-Driven Hint Gener-ation in Vast Solution Spaces: a Self-Improving Python ProgrammingTutor. International Journal of Artificial Intelligence in Education 27, 1(01 Mar 2017), 37–64. https://doi.org/10.1007/s40593-015-0070-z

Page 16: Automated Clustering and Program Repair for Introductory ......exploit the fact that MOOC courses already have tens of thousands of existing student attempts; this was already noticed

PLDI’18, June 18–22, 2018, Philadelphia, PA, USA Sumit Gulwani, Ivan Radiček, and Florian Zuleger

[32] Reudismam Rolim, Gustavo Soares, Loris D’Antoni, Oleksandr Polo-zov, Sumit Gulwani, Rohit Gheyi, Ryo Suzuki, and Björn Hartmann.2017. Learning Syntactic Program Transformations from Exam-ples. In Proceedings of the 39th International Conference on SoftwareEngineering (ICSE ’17). IEEE Press, Piscataway, NJ, USA, 404–415.https://doi.org/10.1109/ICSE.2017.44

[33] Rishabh Singh, Sumit Gulwani, and Armando Solar-Lezama. 2013.Automated Feedback Generation for Introductory Programming As-signments. In Proceedings of the 34th ACM SIGPLAN Conference onProgramming Language Design and Implementation (PLDI ’13). ACM,New York, NY, USA, 15–26. https://doi.org/10.1145/2491956.2462195

[34] Stefan Staber, Barbara Jobstmann, and Roderick Bloem. 2005. Find-ing and Fixing Faults. In Correct Hardware Design and VerificationMethods, Dominique Borrione andWolfgang Paul (Eds.). Lecture Notesin Computer Science, Vol. 3725. Springer Berlin Heidelberg, 35–49.https://doi.org/10.1007/11560548_6

[35] Michael Striewe and Michael Goedicke. 2011. Using run time traces inautomated programming tutoring. In ITiCSE. 303–307.

[36] Michael Striewe and Michael Goedicke. 2013. Trace Alignment forAutomated Tutoring. In CAA.

[37] Ryo Suzuki, Gustavo Soares, Elena Glassman, Andrew Head, LorisD’Antoni, and Björn Hartmann. 2017. Exploring the Design Space ofAutomatically Synthesized Hints for Introductory Programming As-signments. In Proceedings of the 2016 CHI Conference Extended Abstractson Human Factors in Computing Systems (CHI EA ’17). ACM, New York,NY, USA, 2951–2958. https://doi.org/10.1145/3027063.3053187

[38] Kuo-Chung Tai. 1979. The Tree-to-Tree Correction Problem. J. ACM26, 3 (July 1979), 422–433. https://doi.org/10.1145/322139.322143

[39] Nikolai Tillmann, Judith Bishop, R. Nigel Horspool, Daniel Perelman,and Tao Xie. 2014. Code Hunt: Searching for Secret Code for Fun.Proceedings of the International Conference on Software Engineering(Workshops) (June 2014). http://research.microsoft.com/apps/pubs/default.aspx?id=210651

[40] Nikolai Tillmann, Jonathan De Halleux, Tao Xie, Sumit Gulwani, andJudith Bishop. 2013. Teaching and Learning Programming and Soft-ware Engineering via Interactive Gaming. In Proc. 35th InternationalConference on Software Engineering (ICSE 2013), Software EngineeringEducation (SEE). http://www.cs.illinois.edu/homes/taoxie/publications/icse13see-pex4fun.pdf

[41] Takeaki Uno. 1997. Algorithms for Enumerating All Perfect, Maximumand Maximal Matchings in Bipartite Graphs. In ISAAC. 92–101.

[42] Jooyong Yi, Umair Z. Ahmed, Amey Karkare, Shin Hwei Tan, andAbhik Roychoudhury. 2017. A Feasibility Study of Using AutomatedProgram Repair for Introductory Programming Assignments. In Pro-ceedings of the 2017 11th Joint Meeting on Foundations of SoftwareEngineering (ESEC/FSE 2017). ACM, New York, NY, USA, 740–751.https://doi.org/10.1145/3106237.3106262

[43] K. Zhang and D. Shasha. 1989. Simple Fast Algorithms for the EditingDistance Between Trees and Related Problems. SIAM J. Comput. 18, 6(Dec. 1989), 1245–1262. https://doi.org/10.1137/0218082