clustering latex solutions to machine learning assignments ... · clustering latex solutions to...

Clustering LaTeX Solutions to Machine LearningAssignments for Rapid Assessment

Sindy Tan and FinaleDoshi-Velez

Harvard [email protected]

[email protected]

Juan QuirozSunway University

[email protected]

Elena GlassmanUC Berkeley

[email protected]

ABSTRACTMath assignments in machine learning (ML) classes allowstudents to gain competency in performing derivations. Thesederivations are graded by hand, which is a time-consumingprocess. We present our on-going work on clustering deriva-tions so that they can be graded by cluster, rather than oneat a time. This paper describes our current algorithms andinterface for merging mathematically equivalent expressions,a first step toward the goal of clustering solutions.

1. INTRODUCTIONDemand for machine learning (ML) education is rising

rapidly. While writing code to analyze data and train mod-els is an essential pedagogical component of ML courses,equally essential are math assignments that allow studentsto gain competency in performing derivations. Unlike pro-gramming assignments that can be empirically verified usinga battery of teacher-chosen test cases and unit tests, thesestudent-written ML derivations contain sequences of non-trivial mathematical expressions, often written in LATEX.There is little prior work on automatically checking LaTeXexpressions and derivations, let alone those of the length andcomplexity commonly seen in even basic undergraduate MLhomework assignments.

As a result, grading can be a time-consuming manualeffort. In large classes and Massive Open Online Courses(MOOCs), this grading process can require building a largestaff of sufficiently qualified teaching assistants to each per-form many hours of work grading solutions one at a time.Not only is time spent grading time spent away from stu-dents, but human graders are subject to fatigue and maymake mistakes or drift in their interpretation of a rubricover time.

One way to reduce time spent grading is to create toolsfor powergrading, i.e., clustering solutions so that teachingassistants can grade solutions one cluster at a time ratherthan one solution at a time [1]. In the powergrading paradigm,the teaching assistant only needs to check if all the solutionsin the same cluster should receive the same grade, and thenassign the grade. Checking clusters is faster than grading,thus reducing the overall time spent grading [5; 3]. By as-signing all solutions in a cluster the same grade, we hope toalso achieve greater grading consistency. Importantly, evenif not all solutions can be clustered, being able to cluster

Figure 1: The interface for reviewing LaTeX expressions andsolutions for a machince learning (ML) homework problem.

a large fraction of the solutions can still result in efficiencyand consistency gains.

Clustering LaTeX derivations is a large and challeng-ing task. In this paper, we share on-going work focusedon an intermediate problem: clustering expressions, or sin-gle lines of derivations. There are many different correctways of writing mathematically equivalent expressions, andstudents may use multiple ways of denoting mathematicaloperations, e.g., differentiation. Similarly, each student’schoices of symbols, like variable names in programs, suggesta certain semantic meaning, but they may still vary. Forexample, for problem 2 in our dataset, the grader can safelyassume that c0 and C are the same constant when part ofan expression that adds a constant to the term − 1

2wTwα.

In other words, − 12wTwα + c0 and − 1

2wTwα + C can be

considered equivalent in answers to problem 2, though thatmay not be true in general. The clustering is itself drivenby machine learning; we design a back-end that re-clustersinputs given corrections from the teaching assistant via thefront-end (Fig. 1). From our initial exploration on the inter-mediate problem, we find that even clustering expressionsfrom raw source can be challenging. We also notice thatwhile expression clustering is a necessary sub-task for clus-tering solutions, there will still be many additional steps re-quired to cluster solutions. Still, our initial results suggestthat this direction may be useful for clustering derivationsin the future.

2. RELATED WORKOur work is similar in spirit to prior work on provid-

ing feedback at scale for programming assignments [4; 3; 8].These systems group together identical code segments sothat the teaching staff can understand the different ways inwhich students completed assignments; correctness is checkedautomatically by running the code against various unit tests.

Nontrivial mathematical assignments that require non-trivial symbolic answers or derivations present a similar butdistinct set of challenges. Lan et al. [5] cluster Matlabanswers to open response math questions using a bag-of-expressions model. Responses are collected in Matlab, asymbolic format that is more structured than raw LaTeX.

Given that there are multiple ways to express the samemathematical idea, human judgment can be a powerful toolfor making sense of and clustering student-written answersbased on their semantic, not superficial, commonalities. In-teractive clustering (e.g. [7]) is popular in many domains,ranging from organizing personal travels [9] to documentcollections [6]. Our current work uses a fairly simple versionof interactive clustering, but there are many opportunitiesfor intelligent approaches using active learning.

3. DATA EXTRACTION AND PROCESSINGIn order to explore the feasibility of clustering LATEXsolutions,

we acquired a dataset of solutions to Machine Learning home-work assignments, parsed these solutions, and extracted aset of hand-chosen features.

3.1 Data setTo develop the grading tool, we used student submis-

sions collected over one semester of Harvard’s CS 181 Ma-chine Learning course. We received 341 and 219 LaTeXsubmissions for the first two homework assignments, respec-tively. Of these 341 submissions, we train on a randomsubset of 110 submissions that were not empty and couldbe parsed by our pipeline, described in further detail in thenext section. We only analyze these non-empty, parsablesubmissions so that, after preprocessing, the extracted mathexpressions are a comprehensive representation of the origi-nal student submissions.

3.2 Pre-processingStudents submitted solutions for each homework assign-

ment as a single LaTeX file. Each homework assignmentconsisted of three problems, the first two of which required amathematical derivation. The first problem had three parts,which we refer to as 1a, 1b, and 1c, while the second problemhad just one part, which we refer to as 2.

These files were parsed and split by problem. Withineach problem, the contents were further split into LaTeX“expressions”, which is any text entered in LATEX’s mathmode. Only standard commands such as “\begin{itemize}... \end{itemize}” and “$ ... $” can be parsed for detect-ing separate solution parts and math mode. Our parsercan handle standard LaTeX commands but not importedcommands from special packages or user-defined commandsor symbols. We ignored any non-math text, e.g., explana-tions of derivation steps in words, that may also be includedin the solutions, inline or otherwise. Any expression thatdid not include an equals sign (=) was also ignored underthe assumption that such expressions are most likely to be

Figure 2: A visual representation of how the expression “2+∑∞i=0 xi” is broken down into a math syntax tree (MST).

part of the explanations of derivation steps. The text thatsurrounds the extracted mathematical statements is infor-mative and may be included in our future clustering work,but for now, we only consider the mathematical statementsthemselves.

Each LATEXexpression was then tranlated, using the la-tex2sympy library, into a canonicalized expression that couldbe recognized by SymPy, a Python library for symbolicmath. As illustrated in Figure 2, this canonicalized ex-pression can be interpreted as a tree, where each node isa mathematical operation or subexpression and each leaf isa number or variable, which we call an “atom.” We refer tothis as a math syntax tree (MST), because it is similar to theabstract syntax tree (AST) used to describe the structureof source code in various programming languages. Everynode of the MST, including the root node and leaf nodes, isconsidered to be the root of an MST subtree. These MSTsubtrees represent meaningful groups of atoms and operatorsthat also preserve mathematical relationships, e.g., order ofoperations.

3.3 Feature ExtractionWe manually selected a set of operations and atoms that

capture aspects that we expect to be important in correctsolutions to the problems in our dataset. These featuresare listed in Table 1. After extracting all subtrees from theexpression MSTs, we check for these features within eachsubtree.

We use these features to later isolate the relative im-portance of specific operators and atoms within each prob-lem. For example, if, for one problem, graders are focusedon whether each student used the correct signs while, foranother problem, graders are focused on summation limits,these features will capture that difference.

Table 1: Hand-picked feature setDerivative -1 -2 0 1 2 AMultiply w w0 wi wj x xi∗jFunction y yi yj z bf 5Integer J Jw0 Σi Sum ∇ ∇w0

Symbol Add T X Xij Y aiszero j λ m n ∂

∑i

Power Tuple B C xi vec iα i

∑j N

Our current selection of features is a limitation, sincethe features are specific to the problems we were clustering.

We leave additional feature extraction that can generalizeacross a wider range of problems for future work.

3.4 Data handlingWe used MongoDB to store all preprocessed solutions,

expressions, subtrees, features and their mappings to eachother. User interface (UI) relevant attributes such as fre-quency and circulation status were also stored in the samedatabases. All derived math objects, such as matrices andvectors, are stored as separate files. The interaction be-tween the UI, database, and backend are as follows: the UIprompts backend computation which updates the databaseand math objects. Changes in the database update the UI.This architecture simplifies debugging and race conditionhandling.

4. CURRENT INTERFACEAs part of their grading responsibilities, graders must

distinguish between meaningfully distinct variations, e.g.,incorrect solutions, and superficially distinct variations, e.g.,correct solutions that follow the same approach but use dis-tinct notation and/or symbols. The interface, shown in Fig-ure 1, allows graders to inform the system of equivalencesthey recognize at both the expression and solution level.

The current interface renders all the solutions to oneproblem, and all the unique expressions in those solutions,in a two-column layout. In the left column, the expressionsare ordered from most to least frequently used in studentsolutions. In the right column, entire student solutions, i.e.,mathematical expressions and surrounding textual explana-tions, are listed. When the grader spots two or more itemsin a column that they consider semantically equivalent inthe problem context, they can select these items and clickthe Merge button. This interface does not yet render the re-sults of the clustering process described in the sections thatfollow.

5. MACHINE-ASSISTED EXPRESSION CLUS-TERING

While graders could manually merge all equivalent ex-pressions and/or solutions to form clusters of equivalent so-lutions that can receive the same grade, this process couldbenefit from machine assistance. We now describe a processfor assisting this clustering process via interactive cluster-ing: as the teaching assistant assigns expressions to clusters,our system learns parameters to cluster the remaining ex-pressions. The teaching assistant can keep this clusteringor continue to make corrections, which will again propagateinformation to expressions that have not been manually la-beled. Knowledge from clustering previous expressions isalso retained to reduce the number of labels required forfuture expressions.

The process of machine-assisted clustering has three maincomponents. First, we must define a distance between twoexpressions. This distance is parameterized by a weight vec-tor w. Given the weights, we use hierarchical agglomerativeclustering to define the clusters. Finally, we define a pro-cess for learning the weight vector w that parameterizes thedistance function, given human labels indicating which ex-pressions are semantically equivalent.

5.1 Computing Distances between Expressions

To compute distances between LATEXexpressions, we firstdefine the following terms:

• X: the set of unique expressions in all student solu-tions to a single problem

• Y : the set of unique subtrees over X

• S: an |X|× |Y | 1-hot binary matrix from the presenceor absence of each subtree from each expression

• Z: the set of features

• S′: a |Y | × |Z| 1-hot binary feature matrix from thepresence or absence of each feature in each subtree.

• w: the vector of feature weights

Next we compute anD, an |X|×|X|matrix of inter-expressionEuclidean distances via Algorithm 1.

Algorithm 1 Distance between expressions

1: function Distancee(X,w)2: S ← |X| × |Y | zero-matrix3: S′ ← |Y | × |Z| zero-matrix4: for x ∈ X do5: for node in MST(x) do6: Sx,node ← 1

7: for y ∈ Y do8: for z ∈ Z do9: if z is a feature of y then

10: S′y,z ← 1

11: for x1, x2 ∈ X do . only upper triangular needed12: M ← (SS′)x1 − (SS′)x2

13: Dx1,x2 ← (M � w)MT

14: return D

5.2 ClusteringGiven the distances between solutions, clustering them is

straight-forward. We apply hierarchical agglomerative clus-tering from sklearn to cluster solutions.

5.3 Learning Distances: Interactive Cluster-ing

As mentioned before, the process of clustering dependsvery much on our notion of distance. In our case, we haveparameterized the distance between two expressions—whichgoverns the rest of inter-solution distance and solution clus-tering process—by a set of weights w. We start with a uni-form set of feature weights w. This w is used to generatean initial clustering. Given a clustering, the teaching assis-tant can make two kinds of requests: “split” or “merge”. A“split” request is when the user moving two or more clus-tered expressions apart. A “merge” request is the oppositeaction, moving two or more separated expressions into thesame cluster. In our results, we compare between three waysof addressing these requests. We describe each of them be-low.

Simple update Clusters are presented in order of size.During a merge request, the expression with highest fre-quency of occurrence persists while all others stop circulat-ing. The frequency of the persisting expression is updated

with the sum of frequencies of all expressions in the request.Ties in frequency are settled by random selection. Duringa split request, a new cluster is made for each expression inthe request.

Simple update with similarity ranking Clusters arepresented in order of similarity, as determined by our dis-tance function. The update method itself is the same as thesimple update.

Weights-optimized update

When a user makes a merge request between expressionsxi and expressions xj , the system sets the distance betweenthese two expressions to a target value of 0. The updatedtarget matrix is then sent to a gradient descent algorithmwhich learns the values of the weights that best match thenew target matrix.

We use the L1 norm as the loss function for gradientdescent as shown in equation 1.

loss(w) =1

|X|∑

(i,j)∈X

|Dtarget(xi,xj) −Dpredicted(xi,xj)| (1)

6. RESULTS

6.1 Cluster fingerprintsThe first important hypothesis to check is whether com-

paring expressions based on the subtrees of their math syn-tax trees allows us to distinguish between equivalent anddistinct expressions. Figure 4 shows the subtree fingerprintsfor each of a small set of expressions. The expressions aremanually grouped by our own labeling of similarity.

As seen in all but the fourth cluster of Fig. ?? andFig. 4, the subtree fingerprints are in most cases sufficient indefining a cluster. The fourth cluster is problematic becausethere are more variations across expressions than similari-ties. This is a common issue for clusters with very short ex-pressions, i.e. expressions with few subtrees and thus sparsefeatures. This exploration is informative because it demon-strates that while subtrees may provide an effective meansfor distinguishing certain human-defined clusters (e.g. clus-ters 1, 2, 3, and 5), they may not be effective for identifyingother clusters (e.g. cluster 4). Knowing that expressionsmay be challenging to distinguish at the subtree level is im-portant because this challenge will persist for any choice ofweights w and features (not just those in table 1). One so-lution to this issue may be to improve canonicalization ofatoms in the preprocessing steps.

6.2 Weights optimizationFig. 5 shows a target distance matrix (343 × 343) that

was generated by calculating the euclidean distances be-tween 343 expressions extracted from all solutions to oneproblem. We set the first 80 rows of the target distance ma-trix to a value of 0 to simulate merge requests. Because ofsymmetry, we also set the first 80 columns to a value of 0.The dark areas represent distances equal to or close to 0.The gray and light areas represent expressions that are far-ther apart in distance. The learned distance matrix showsthat the optimization learning appropriate weights to getclose to the target distance matrix (note how the values in

Figure 3: A sample of five expression clusters and their fea-tures. The red lines separate expressions into their respec-tive clusters.

the left-most columns of the “Learned” matrix are lighterthan the columns of the “Initial”).

7. DISCUSSION AND CONCLUSIONSOur results above demonstrate that math syntax trees—

perhaps with some additional canonicalization—may be suf-ficient for distinguishing clusters of expressions in many cases.We also showed that our approach of parameterizing dis-tances as weights on features found in subtrees of math syn-tax trees can be used to make inter-expression distances getcloser to matching annotations (and once distance functionsare close, clusterings will also be close). That said, while our“Learned” matrix is clearly closer to the “Target” than the“Initial,” it is still far from a perfect match. It remains anopen direction for future work to determine the best way tofeaturize expressions such that it is possible to learn distancefunctions that recover human-desired clusterings.

Specifically, we believe that improving features will likelyrequire moving beyond a simple application of the latex2sympy

Figure 4: A sample of five expression clusters and their subtrees. The red lines separate expressions into their respectiveclusters.

Figure 5: Target distance matrix, initial distance matrixwith weights initialized to uniform value of 1, and learneddistance matrix after 100 steps of gradient descent.

library. The latex2sympy parser is unable to convert a largenumber of expressions to SymPy, resulting in loss of im-portant information for determining similarity between so-lutions. One solution is to improve the lexical analyzer andparser of the latex2sympy library. In addition, while thereis a set of standard commands in LaTeX, the availability ofspecialized packages and custom commands limits the scopeof what can be parsed by our math grading tool. Recent ad-vancements in decompiling visual markup suggest a promis-ing solution [2]. In particular, since there is less variation

in rendered math than the markup used to generate it, itwould be more manageable to parse decompiled LaTeX thanraw LaTeX submissions from students.

Finally, future work on comparing expressions shoulddetermine when we get distances “good enough.” While itmay be challenging to create feature sets such that expres-sions in the same cluster have perfect similarity, it may besufficient that they are all sufficiently similar that they getclustered together, and other points are sufficiently dissimi-lar that they are excluded.

More broadly, there remains the broader question of howto use ways of clustering expressions to help cluster solu-tions. Again, we will need to define a distance metric, nowbetween solutions. A simple approach would be to treatexpression distances as costs, and then compute the cost ofthe best bipartite match between two solutions as the clustersimilarity. However, from an initial exploration of the solu-tion clustering problem, we believe that more sophisticationwill be needed, and clustering expressions is only a piece ofthe broader problem. While there are cases—such as thosewe showed—where students write equivalent expressions indifferent ways, it is also true that simply the nature of theproblem set-up ensure that a lot of expressions are already

equivalent. To effectively cluster solutions, we will have toidentify patterns in how expressions are ordered and skipped(rather than a generic bipartite match), as well as integrateinformation from the surrounding text (which may corre-spond to a skipped line). Not being able to cluster certainexpressions—such as the short ones in figure 4—may not bea practical bottleneck if there are ways to make such lack ofexpression clusters not contribute overmuch to solution clus-ters. Once we tackle this objective, we will be closer to ouroverall goal: creating a tool to reduce the burden of gradingand providing feedback for a large number of solutions byclustering the solutions and allowing the teaching assistantto grade only one representative solution from each cluster.

8. ACKNOWLEDGEMENTSRyan Bellmore for extracting student submissions for us

to use as data.

9. REFERENCES

[1] S. Basu, C. Jacobs, and L. Vanderwende. Powergrad-ing: a clustering approach to amplify human effort forshort answer grading. ACL Association for Computa-tional Linguistics, October 2013.

[2] Y. Deng, A. Kanervisto, and A. M. Rush. What youget is what you see: A visual markup decompiler. As-sociation for the Advancement of Artificial Intelligence,pre-print on arXiv, 2017.

[3] E. L. Glassman, J. Scott, R. Singh, P. J. Guo,and R. C. Miller. OverCode: Visualizing variation instudent solutions to programming problems at scale.ACM Transactions on Computer-Human Interaction(TOCHI), 22(2):7, 2015.

[4] S. Gulwani, I. Radiek, and F. Zuleger. Feedback Genera-tion for Performance Problems in Introductory Program-ming Assignments. In Proceedings of the 22Nd ACMSIGSOFT International Symposium on Foundations ofSoftware Engineering, FSE 2014, pages 41–51, NewYork, NY, USA, 2014. ACM.

[5] A. S. Lan, D. Vats, A. E. Waters, and R. G. Baraniuk.Mathematical language processing: Automatic gradingand feedback for open response mathematical questions.In Proceedings of the Second (2015) ACM Conference onLearning@ Scale, pages 167–176. ACM, 2015.

[6] H. Lee, J. Kihm, J. Choo, J. Stasko, and H. Park. ivis-clustering: An interactive visual document clusteringvia topic modeling. In Computer Graphics Forum, vol-ume 31, pages 1155–1164. Wiley Online Library, 2012.

[7] M. Rasmussen and G. Karypis. gcluto: An interactiveclustering, visualization, and analysis system. UMN-CSTR-04, 21(7), 2004.

[8] S. Srikant and V. Aggarwal. A System to Grade Com-puter Programming Skills Using Machine Learning. InProceedings of the 20th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining,KDD ’14, pages 1887–1896, New York, NY, USA, 2014.ACM.

[9] C. Zhou, D. Frankowski, P. Ludford, S. Shekhar, andL. Terveen. Discovering personal gazetteers: an inter-active clustering approach. In Proceedings of the 12thannual ACM international workshop on Geographic in-formation systems, pages 266–273. ACM, 2004.

clustering latex solutions to machine learning assignments ... · clustering latex solutions to...

Documents