a syntactic neural model for general-purpose code …neural network, which naturally reßects the...

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 440–450Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1041

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 440–450Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1041

A Syntactic Neural Model for General-Purpose Code Generation

Pengcheng YinLanguage Technologies Institute

Carnegie Mellon [email protected]

Graham NeubigLanguage Technologies Institute

Carnegie Mellon [email protected]

Abstract

We consider the problem of parsing natu-ral language descriptions into source codewritten in a general-purpose programminglanguage like Python. Existing data-driven methods treat this problem as a lan-guage generation task without consideringthe underlying syntax of the target pro-gramming language. Informed by previ-ous work in semantic parsing, in this pa-per we propose a novel neural architecturepowered by a grammar model to explicitlycapture the target syntax as prior knowl-edge. Experiments find this an effectiveway to scale up to generation of complexprograms from natural language descrip-tions, achieving state-of-the-art results thatwell outperform previous code generationand semantic parsing approaches.

1 Introduction

Every programmer has experienced the situationwhere they know what they want to do, but donot have the ability to turn it into a concrete im-plementation. For example, a Python programmermay want to “sort my list in descending order,”but not be able to come up with the proper syn-tax sorted(my list, reverse=True) to real-ize his intention. To resolve this impasse, it iscommon for programmers to search the web innatural language (NL), find an answer, and mod-ify it into the desired form (Brandt et al., 2009,2010). However, this is time-consuming, andthus the software engineering literature is ripewith methods to directly generate code from NLdescriptions, mostly with hand-engineered meth-ods highly tailored to specific programming lan-guages (Balzer, 1985; Little and Miller, 2009;Gvero and Kuncak, 2015).

In parallel, the NLP community has developedmethods for data-driven semantic parsing, whichattempt to map NL to structured logical forms ex-ecutable by computers. These logical forms can begeneral-purpose meaning representations (Clarkand Curran, 2007; Banarescu et al., 2013), for-malisms for querying knowledge bases (Tang andMooney, 2001; Zettlemoyer and Collins, 2005;Berant et al., 2013) and instructions for robots orpersonal assistants (Artzi and Zettlemoyer, 2013;Quirk et al., 2015; Misra et al., 2015), among oth-ers. While these methods have the advantage ofbeing learnable from data, compared to the pro-gramming languages (PLs) in use by program-mers, the domain-specific languages targeted bythese works have a schema and syntax that is rela-tively simple.

Recently, Ling et al. (2016) have proposed adata-driven code generation method for high-level,general-purpose PLs like Python and Java. Thiswork treats code generation as a sequence-to-sequence modeling problem, and introduce meth-ods to generate words from character-level mod-els, and copy variable names from input descrip-tions. However, unlike most work in semanticparsing, it does not consider the fact that code hasto be well-defined programs in the target syntax.

In this work, we propose a data-driven syntax-based neural network model tailored for genera-tion of general-purpose PLs like Python. In or-der to capture the strong underlying syntax of thePL, we define a model that transduces an NL state-ment into an Abstract Syntax Tree (AST; Fig. 1(a),§ 2) for the target PL. ASTs can be deterministi-cally generated for all well-formed programs us-ing standard parsers provided by the PL, and thusgive us a way to obtain syntax information withminimal engineering. Once we generate an AST,we can use deterministic generation tools to con-vert the AST into surface code. We hypothesize

440

https://doi.org/10.18653/v1/P17-1041

https://doi.org/10.18653/v1/P17-1041

Production Rule Role ExplanationCall 7! expr[func] expr*[args] keyword*[keywords] Function Call . func: the function to be invoked . args: arguments

list . keywords: keyword arguments listIf 7! expr[test] stmt*[body] stmt*[orelse] If Statement . test: condition expression . body: statements in-

side the If clause . orelse: elif or else statementsFor 7! expr[target] expr*[iter] stmt*[body] For Loop . target: iteration variable . iter: enumerable to iter-

ate over . body: loop body . orelse: else statementsstmt*[orelse]FunctionDef 7! identifier[name] arguments*[args] Function Def. . name: function name . args: function arguments

. body: function bodystmt*[body]

Table 1: Example production rules for common Python statements (Python Software Foundation, 2016)

that such a structured approach has two benefits.First, we hypothesize that structure can be used

to constrain our search space, ensuring generationof well-formed code. To this end, we propose asyntax-driven neural code generation model. Thebackbone of our approach is a grammar model(§ 3) which formalizes the generation story of aderivation AST into sequential application of ac-tions that either apply production rules (§ 3.1), oremit terminal tokens (§ 3.2). The underlying syn-tax of the PL is therefore encoded in the grammarmodel a priori as the set of possible actions. Ourapproach frees the model from recovering the un-derlying grammar from limited training data, andinstead enables the system to focus on learning thecompositionality among existing grammar rules.Xiao et al. (2016) have noted that this impositionof structure on neural models is useful for seman-tic parsing, and we expect this to be even more im-portant for general-purpose PLs where the syntaxtrees are larger and more complex.

Second, we hypothesize that structural informa-tion helps to model information flow within theneural network, which naturally reflects the recur-sive structure of PLs. To test this, we extend astandard recurrent neural network (RNN) decoderto allow for additional neural connections whichreflect the recursive structure of an AST (§ 4.2).As an example, when expanding the node ? inFig. 1(a), we make use of the information fromboth its parent and left sibling (the dashed rectan-gle). This enables us to locally pass informationof relevant code segments via neural network con-nections, resulting in more confident predictions.

Experiments (§ 5) on two Python code gener-ation tasks show 11.7% and 9.3% absolute im-provements in accuracy against the state-of-the-artsystem (Ling et al., 2016). Our model also givescompetitive performance on a standard semanticparsing benchmark1.

1Implementation available at https://github.com/neulab/NL2code

2 The Code Generation Problem

Given an NL description x, our task is to generatethe code snippet c in a modern PL based on the in-tent of x. We attack this problem by first generat-ing the underlying AST. We define a probabilisticgrammar model of generating an AST y given x:p(y|x). The best-possible AST y is then given by

y = arg maxy

p(y|x). (1)

y is then deterministically converted to the corre-sponding surface code c.2 While this paper usesexamples from Python code, our method is PL-agnostic.

Before detailing our approach, we first presenta brief introduction of the Python AST and itsunderlying grammar. The Python abstract gram-mar contains a set of production rules, and anAST is generated by applying several productionrules composed of a head node and multiple childnodes. For instance, the first rule in Tab. 1 isused to generate the function call sorted(·) inFig. 1(a). It consists of a head node of type Call,and three child nodes of type expr, expr* andkeyword*, respectively. Labels of each node arenoted within brackets. In an AST, non-terminalnodes sketch the general structure of the targetcode, while terminal nodes can be categorized intotwo types: operation terminals and variable ter-minals. Operation terminals correspond to basicarithmetic operations like AddOp.Variable termi-nal nodes store values for variables and constantsof built-in data types3. For instance, all terminalnodes in Fig. 1(a) are variable terminal nodes.

3 Grammar Model

Before detailing our neural code generationmethod, we first introduce the grammar model atits core. Our probabilistic grammar model definesthe generative story of a derivation AST. We fac-

2We use astor library to convert ASTs into Python code.3bool, float, int, str.

441

Expr

root

expr[value]

Call

expr*[args] keyword*[keywords]

Name

str(sorted)

expr[func]

expr

Name

str(my_list)

keyword

str(reverse) expr[value]

Name

str(True)

Action FlowParent Feeding

Apply Rule

Generate Token

GenToken with Copy

(a) (b)

Input: Code:

. . .

Figure 1: (a) the Abstract Syntax Tree (AST) for the given example code. Dashed nodes denote terminals. Nodes are labeledwith time steps during which they are generated. (b) the action sequence (up to t14) used to generate the AST in (a)

torize the generation process of an AST into se-quential application of actions of two types:

• APPLYRULE[r] applies a production rule r tothe current derivation tree;

• GENTOKEN[v] populates a variable terminalnode by appending a terminal token v.

Fig. 1(b) shows the generation process of the tar-get AST in Fig. 1(a). Each node in Fig. 1(b) in-dicates an action. Action nodes are connected bysolid arrows which depict the chronological orderof the action flow. The generation proceeds indepth-first, left-to-right order (dotted arrows rep-resent parent feeding, explained in § 4.2.1).

Formally, under our grammar model, the prob-ability of generating an AST y is factorized as:

p(y|x) =TY

t=1

p(at|x, a<t), (2)

where at is the action taken at time step t, anda<t is the sequence of actions before t. We willexplain how to compute the action probabilitiesp(at|·) in Eq. (2) in § 4. Put simply, the gen-eration process begins from a root node at t0,and proceeds by the model choosing APPLYRULE

actions to generate the overall program structurefrom a closed set of grammar rules, then at leavesof the tree corresponding to variable terminals, themodel switches to GENTOKEN actions to gener-ate variables or constants from the open set. Wedescribe this process in detail below.

3.1 APPLYRULE ActionsAPPLYRULE actions generate program structure,expanding the current node (the frontier node at

time step t: nft) in a depth-first, left-to-righttraversal of the tree. Given a fixed set of produc-tion rules, APPLYRULE chooses a rule r from thesubset that has a head matching the type of nft ,and uses r to expand nft by appending all childnodes specified by the selected production. As anexample, in Fig. 1(b), the rule Call 7! expr. . .expands the frontier node Call at time step t4, andits three child nodes expr, expr* and keyword*

are added to the derivation.APPLYRULE actions grow the derivation AST

by appending nodes. When a variable terminalnode (e.g., str) is added to the derivation and be-comes the frontier node, the grammar model thenswitches to GENTOKEN actions to populate thevariable terminal with tokens.

Unary Closure Sometimes, generating an ASTrequires applying a chain of unary productions.For instance, it takes three time steps (t9 � t11)to generate the sub-structure expr* 7! expr 7!Name 7! str in Fig. 1(a). This can be effectivelyreduced to one step of APPLYRULE action by tak-ing the closure of the chain of unary productionsand merging them into a single rule: expr* 7!⇤str. Unary closures reduce the number of actionsneeded, but would potentially increase the size ofthe grammar. In our experiments we tested ourmodel both with and without unary closures (§ 5).

3.2 GENTOKEN ActionsOnce we reach a frontier node nft that correspondsto a variable type (e.g., str), GENTOKEN actionsare used to fill this node with values. For general-purpose PLs like Python, variables and constantshave values with one or multiple tokens. For in-

442

stance, a node that stores the name of a function(e.g., sorted) has a single token, while a nodethat denotes a string constant (e.g., a=‘hello

world’) could have multiple tokens. Our modelcopes with both scenarios by firing GENTOKEN

actions at one or more time steps. At each timestep, GENTOKEN appends one terminal token tothe current frontier variable node. A special </n>token is used to “close” the node. The grammarmodel then proceeds to the new frontier node.

Terminal tokens can be generated from a pre-defined vocabulary, or be directly copied from theinput NL. This is motivated by the observationthat the input description often contains out-of-vocabulary (OOV) variable names or literal valuesthat are directly used in the target code. For in-stance, in our running example the variable namemy list can be directly copied from the the inputat t12. We give implementation details in § 4.2.2.

4 Estimating Action ProbabilitiesWe estimate action probabilities in Eq. (2) usingattentional neural encoder-decoder models with aninformation flow structured by the syntax trees.

4.1 EncoderFor an NL description x consisting of n words{wi}n

i=1, the encoder computes a context sen-sitive embedding hi for each wi using a bidi-rectional Long Short-Term Memory (LSTM) net-work (Hochreiter and Schmidhuber, 1997), simi-lar to the setting in (Bahdanau et al., 2014). Seesupplementary materials for detailed equations.

4.2 DecoderThe decoder uses an RNN to model the sequentialgeneration process of an AST defined as Eq. (2).Each action step in the grammar model naturallygrounds to a time step in the decoder RNN. There-fore, the action sequence in Fig. 1(b) can be in-terpreted as unrolling RNN time steps, with solidarrows indicating RNN connections. The RNNmaintains an internal state to track the generationprocess (§ 4.2.1), which will then be used to com-pute action probabilities p(at|x, a<t) (§ 4.2.2).

4.2.1 Tracking Generation StatesOur implementation of the decoder resembles avanilla LSTM, with additional neural connections(parent feeding, Fig. 1(b)) to reflect the topologicalstructure of an AST. The decoder’s internal hiddenstate at time step t, st, is given by:

st = fLSTM([at�1 : ct : pt : nft ], st�1), (3)

sort my_list in descending

ApplyRule[Call] Parent State

+

ApplyRule GenToken

type of ?

order

... ...

nonterminal variable terminal

embedding of node type

embedding of

Figure 2: Illustration of a decoder time step (t = 9)

where fLSTM(·) is the LSTM update function.[:] denotes vector concatenation. st will then beused to compute action probabilities p(at|x, a<t)in Eq. (2). Here, at�1 is the embedding of the pre-vious action. ct is a context vector retrieved frominput encodings {hi} via soft attention. pt is avector that encodes the information of the parentaction. nft denotes the node type embedding ofthe current frontier node nft

4. Intuitively, feedingthe decoder the information of nft helps the modelto keep track of the frontier node to expand.Action Embedding at We maintain two actionembedding matrices, WR and WG. Each row inWR (WG) corresponds to an embedding vectorfor an action APPLYRULE[r] (GENTOKEN[v]).Context Vector ct The decoder RNN uses soft at-tention to retrieve a context vector ct from the in-put encodings {hi} pertain to the prediction of thecurrent action. We follow Bahdanau et al. (2014)and use a Deep Neural Network (DNN) with a sin-gle hidden layer to compute attention weights.Parent Feeding pt Our decoder RNN uses ad-ditional neural connections to directly pass infor-mation from parent actions. For instance, whencomputing s9, the information from its parent ac-tion step t4 will be used. Formally, we define theparent action step pt as the time step at whichthe frontier node nft is generated. As an exam-ple, for t9, its parent action step p9 is t4, sincenf9 is the node ?, which is generated at t4 by theAPPLYRULE[Call7!. . .] action.

We model parent information pt from twosources: (1) the hidden state of parent action spt ,and (2) the embedding of parent action apt . pt isthe concatenation. The parent feeding schema en-

4We maintain an embedding for each node type.

443

ables the model to utilize the information of par-ent code segments to make more confident predic-tions. Similar approaches of injecting parent in-formation were also explored in the SEQ2TREE

model in Dong and Lapata (2016)5.

4.2.2 Calculating Action ProbabilitiesIn this section we explain how action probabilitiesp(at|x, a<t) are computed based on st.APPLYRULE The probability of applying rule ras the current action at is given by a softmax6:

p(at = APPLYRULE[r]|x, a<t) =

softmax(WR · g(st))| · e(r) (4)

where g(·) is a non-linearity tanh(W ·st+b), ande(r) the one-hot vector for rule r.GENTOKEN As in § 3.2, a token v can be gener-ated from a predefined vocabulary or copied fromthe input, defined as the marginal probability:

p(at = GENTOKEN[v]|x, a<t) =

p(gen|x, a<t)p(v|gen, x, a<t)

+ p(copy|x, a<t)p(v|copy, x, a<t).

The selection probabilities p(gen|·) and p(copy|·)are given by softmax(WS · st). The prob-ability of generating v from the vocabulary,p(v|gen, x, a<t), is defined similarly as Eq. (4),except that we use the GENTOKEN embeddingmatrix WG, and we concatenate the context vectorct with st as input. To model the copy probability,we follow recent advances in modeling copyingmechanism in neural networks (Gu et al., 2016;Jia and Liang, 2016; Ling et al., 2016), and use apointer network (Vinyals et al., 2015) to computethe probability of copying the i-th word from theinput by attending to input representations {hi}:

p(wi|copy, x, a<t) =exp(!(hi, st, ct))Pn

i0=1 exp(!(hi0 , st, ct)),

where !(·) is a DNN with a single hidden layer.Specifically, if wi is an OOV word (e.g., the vari-able name my list), which is represented by aspecial <unk> token during encoding, we then di-rectly copy the actual word wi from the input de-scription to the derivation.

4.3 Training and InferenceGiven a dataset of pairs of NL descriptions xi andcode snippets ci, we parse ci into its AST yi and

5SEQ2TREE generates tree-structured outputs by condi-tioning on the hidden states of parent non-terminals, whileour parent feeding uses the states of parent actions.

6We do not show bias terms for all softmax equations.

Dataset HS DJANGO IFTTT

Train 533 16,000 77,495Development 66 1,000 5,171Test 66 1,805 758

Avg. tokens in description 39.1 14.3 7.4Avg. characters in code 360.3 41.1 62.2Avg. size of AST (# nodes) 136.6 17.2 7.0

Statistics of Grammarw/o unary closure# productions 100 222 1009# node types 61 96 828terminal vocabulary size 1361 6733 0Avg. # actions per example 173.4 20.3 5.0

w/ unary closure# productions 100 237 –# node types 57 92 –Avg. # actions per example 141.7 16.4 –

Table 2: Statistics of datasets and associated grammars

decompose yi into a sequence of oracle actions,which explains the generation story of yi under thegrammar model. The model is then optimized bymaximizing the log-likelihood of the oracle actionsequence. At inference time, given an NL descrip-tion, we use beam search to approximate the bestAST y in Eq. (1). See supplementary materials forthe pseudo-code of the inference algorithm.

5 Experimental Evaluation5.1 Datasets and Metrics

HEARTHSTONE (HS) dataset (Ling et al., 2016)is a collection of Python classes that implementcards for the card game HearthStone. Each cardcomes with a set of fields (e.g., name, cost, anddescription), which we concatenate to create theinput sequence. This dataset is relatively difficult:input descriptions are short, while the target codeis in complex class structures, with each AST hav-ing 137 nodes on average.DJANGO dataset (Oda et al., 2015) is a collectionof lines of code from the Django web framework,each with a manually annotated NL description.Compared with the HS dataset where card imple-mentations are somewhat homogenous, examplesin DJANGO are more diverse, spanning a wide va-riety of real-world use cases like string manipula-tion, IO operations, and exception handling.IFTTT dataset (Quirk et al., 2015) is a domain-specific benchmark that provides an interest-ing side comparison. Different from HS andDJANGO which are in a general-purpose PL, pro-grams in IFTTT are written in a domain-specificlanguage used by the IFTTT task automation

444

App. Users of the App write simple instruc-tions (e.g., If Instagram.AnyNewPhotoByYou

Then Dropbox.AddFileFromURL) with NL de-scriptions (e.g., “Autosave your Instagram photosto Dropbox”). Each statement inside the If orThen clause consists of a channel (e.g., Dropbox)and a function (e.g., AddFileFromURL)7. Thissimple structure results in much more conciseASTs (7 nodes on average). Because all examplesare created by ordinary Apps users, the datasetis highly noisy, with input NL very loosely con-nected to target ASTs. The authors thus provide ahigh-quality filtered test set, where each exampleis verified by at least three annotators. We use thisset for evaluation. Also note IFTTT’s grammar hasmore productions (Tab. 2), but this does not implythat its grammar is more complex. This is becausefor HS and DJANGO terminal tokens are generatedby GENTOKEN actions, but for IFTTT, all the codeis generated directly by APPLYRULE actions.Metrics As is standard in semantic parsing, wemeasure accuracy, the fraction of correctly gen-erated examples. However, because generating anexact match for complex code structures is non-trivial, we follow Ling et al. (2016), and use token-level BLEU-4 with as a secondary metric, definedas the averaged BLEU scores over all examples.8

5.2 SetupPreprocessing All input descriptions are tok-enized using NLTK. We perform simple canoni-calization for DJANGO, such as replacing quotedstrings in the inputs with place holders. See sup-plementary materials for details. We extract unaryclosures whose frequency is larger than a thresh-old k (k = 30 for HS and 50 for DJANGO).Configuration The size of all embeddings is 128,except for node type embeddings, which is 64.The dimensions of RNN states and hidden layersare 256 and 50, respectively. Since our datasets arerelatively small for a data-hungry neural model,we impose strong regularization using recurrent

7Like Beltagy and Quirk (2016), we strip function param-eters since they are mostly specific to users.

8These two metrics are not ideal: accuracy only measuresexact match and thus lacks the ability to give credit to seman-tically correct code that is different from the reference, whileit is not clear whether BLEU provides an appropriate proxyfor measuring semantics in the code generation task. A moreintriguing metric would be directly measuring semantic/func-tional code equivalence, for which we present a pilot studyat the end of this section (cf. Error Analysis). We leave ex-ploring more sophisticated metrics (e.g. based on static codeanalysis) as future work.

HS DJANGO

ACC BLEU ACC BLEU

Retrieval System† 0.0 62.5 14.7 18.6Phrasal Statistical MT† 0.0 34.1 31.5 47.6Hierarchical Statistical MT† 0.0 43.2 9.5 35.9

NMT 1.5 60.4 45.1 63.4SEQ2TREE 1.5 53.4 28.9 44.6SEQ2TREE–UNK 13.6 62.8 39.4 58.2LPN† 4.5 65.6 62.3 77.6

Our system 16.2 75.8 71.6 84.5

Ablation Study– frontier embed. 16.7 75.8 70.7 83.8– parent feed. 10.6 75.7 71.5 84.3– copy terminals 3.0 65.7 32.3 61.7+ unary closure – 70.3 83.3– unary closure 10.1 74.8 –

Table 3: Results on two Python code generation tasks.†Results previously reported in Ling et al. (2016).

dropouts (Gal and Ghahramani, 2016) for all re-current networks, together with standard dropoutlayers added to the inputs and outputs of the de-coder RNN. We validate the dropout probabilityfrom {0, 0.2, 0.3, 0.4}. For decoding, we use abeam size of 15.

5.3 ResultsEvaluation results for Python code generationtasks are listed in Tab. 3. Numbers for our sys-tems are averaged over three runs. We compareprimarily with two approaches: (1) Latent Pre-dictor Network (LPN), a state-of-the-art sequence-to-sequence code generation model (Ling et al.,2016), and (2) SEQ2TREE, a neural semantic pars-ing model (Dong and Lapata, 2016). SEQ2TREE

generates trees one node at a time, and the tar-get grammar is not explicitly modeled a priori,but implicitly learned from data. We test boththe original SEQ2TREE model released by the au-thors and our revised one (SEQ2TREE–UNK) thatuses unknown word replacement to handle rarewords (Luong et al., 2015). For completeness,we also compare with a strong neural machinetranslation (NMT) system (Neubig, 2015) using astandard encoder-decoder architecture with atten-tion and unknown word replacement9, and includenumbers from other baselines used in Ling et al.(2016). On the HS dataset, which has relativelylarge ASTs, we use unary closure for our modeland SEQ2TREE, and for DJANGO we do not.

9For NMT, we also attempted to find the best-scoring syn-tactically correct predictions in the size-5 beam, but this didnot yield a significant improvement over the NMT results inTab. 3.

445

Figure 3: Performance w.r.t reference AST size on DJANGO

Figure 4: Performance w.r.t reference AST size on HS

System Comparison As in Tab. 3, our modelregisters 11.7% and 9.3% absolute improvementsover LPN in accuracy on HS and DJANGO. Thisboost in performance strongly indicates the impor-tance of modeling grammar in code generation.For the baselines, we find LPN outperforms NMT

and SEQ2TREE in most cases. We also note thatSEQ2TREE achieves a decent accuracy of 13.6%on HS, which is due to the effect of unknown wordreplacement, since we only achieved 1.5% with-out it. A closer comparison with SEQ2TREE isinsightful for understanding the advantage of oursyntax-driven approach, since both SEQ2TREE

and our system output ASTs: (1) SEQ2TREE pre-dicts one node each time step, and requires addi-tional “dummy” nodes to mark the boundary ofa subtree. The sheer number of nodes in targetASTs makes the prediction process error-prone. Incontrast, the APPLYRULE actions of our grammarmodel allows for generating multiple nodes at asingle time step. Empirically, we found that in HS,SEQ2TREE takes more than 300 time steps on av-erage to generate a target AST, while our modeltakes only 170 steps. (2) SEQ2TREE does not di-rectly use productions in the grammar, which pos-sibly leads to grammatically incorrect ASTs andthus empty code outputs. We observe that the ra-tio of grammatically incorrect ASTs predicted bySEQ2TREE on HS and DJANGO are 21.2% and10.9%, respectively, while our system guaranteesgrammaticality.

Ablation Study We also ablated our best-performing models to analyze the contribution ofeach component. “–frontier embed.” removes thefrontier node embedding nft from the decoderRNN inputs (Eq. (3)). This yields worse results onDJANGO while gives slight improvements in ac-

CHANNEL FULL TREE

Classical Methodsposclass (Quirk et al., 2015) 81.4 71.0LR (Beltagy and Quirk, 2016) 88.8 82.5

Neural Network MethodsNMT 87.7 77.7NN (Beltagy and Quirk, 2016) 88.0 74.3SEQ2TREE (Dong and Lapata, 2016) 89.7 78.4Doubly-Recurrent NN 90.1 78.2(Alvarez-Melis and Jaakkola, 2017)

Our system 90.0 82.0– parent feed. 89.9 81.1– frontier embed. 90.1 78.7

Table 4: Results on the noise-filtered IFTTT test set of “>3agree with gold annotations” (averaged over three runs), ourmodel performs competitively among neural models.

curacy on HS. This is probably because that thegrammar of HS has fewer node types, and thusthe RNN is able to keep track of nft without de-pending on its embedding. Next, “–parent feed.”removes the parent feeding mechanism. The ac-curacy drops significantly on HS, with a marginaldeterioration on DJANGO. This result is interest-ing because it suggests that parent feeding is moreimportant when the ASTs are larger, which willbe the case when handling more complicated codegeneration tasks like HS. Finally, removing thepointer network (“–copy terminals”) in GENTO-KEN actions gives poor results, indicating that itis important to directly copy variable names andvalues from the input.

The results with and without unary closuredemonstrate that, interestingly, it is effective onHS but not on DJANGO. We conjecture that this isbecause on HS it significantly reduces the numberof actions from 173 to 142 (c.f., Tab. 2), with thenumber of productions in the grammar remainingunchanged. In contrast, DJANGO has a broaderdomain, and thus unary closure results in moreproductions in the grammar (237 for DJANGO

vs. 100 for HS), increasing sparsity.Performance by the size of AST We further in-vestigate our model’s performance w.r.t. the sizeof the gold-standard ASTs in Figs. 3 and 4. Notsurprisingly, the performance drops when the sizeof the reference ASTs increases. Additionally, onthe HS dataset, the BLEU score still remains ataround 50 even when the size of ASTs grows to200, indicating that our proposed syntax-drivenapproach is robust for long code segments.Domain Specific Code Generation Although thisis not the focus of our work, evaluation on IFTTT

brings us closer to a standard semantic parsing set-

446

input <name> Brawl </name> <cost> 5 </cost> <desc>Destroy all minions except one (chosen randomly)</desc> <rarity> Epic </rarity> ...

pred. class Brawl(SpellCard):def init (self):super(). init (’Brawl’, 5, CHARACTER CLASS.

WARRIOR, CARD RARITY.EPIC)def use(self, player, game):super().use(player, game)targets = copy.copy(game.other player.minions)targets.extend(player.minions)for minion in targets:minion.die(self) A

ref. minions = copy.copy(player.minions)minions.extend(game.other player.minions)if len(minions) > 1:survivor = game.random choice(minions)for minion in minions:if minion is not survivor: minion.die(self)

B

input join app config.path and string ’locale’ into a filepath, substitute it for localedir.

pred. localedir = os.path.join(app config.path, ’locale’) 3

input self.plural is an lambda function with an argumentn, which returns result of boolean expression n notequal to integer 1

pred. self.plural = lambda n: len(n) 7ref. self.plural = lambda n: int(n!=1)

Table 5: Predicted examples from HS (1st) and DJANGO.Copied contents (copy probability > 0.9) are highlighted.

ting, which helps to investigate similarities anddifferences between generation of more compli-cated general-purpose code and and more limited-domain simpler code. Tab. 4 shows the results,following the evaluation protocol in (Beltagy andQuirk, 2016) for accuracies at both channel andfull parse tree (channel + function) levels. Ourfull model performs on par with existing neu-ral network-based methods, while outperformingother neural models in full tree accuracy (82.0%).This score is close to the best classical method(LR), which is based on a logistic regressionmodel with rich hand-engineered features (e.g.,brown clusters and paraphrase). Also note that theperformance between NMT and other neural mod-els is much closer compared with the results inTab. 3. This suggests that general-purpose codegeneration is more challenging than the simplerIFTTT setting, and therefore modeling structuralinformation is more helpful.Case Studies We present output examples inTab. 5. On HS, we observe that most of thetime our model gives correct predictions by fill-ing learned code templates from training data witharguments (e.g., cost) copied from input. This is inline with the findings in Ling et al. (2016). How-ever, we do find interesting examples indicatingthat the model learns to generalize beyond trivial

copying. For instance, the first example is one thatour model predicted wrong — it generated codeblock A instead of the gold B (it also missed afunction definition not shown here). However, wefind that the block A actually conveys part of theinput intent by destroying all, not some, of theminions. Since we are unable to find code blockA in the training data, it is clear that the model haslearned to generalize to some extent from multipletraining card examples with similar semantics orstructure.

The next two examples are from DJANGO. Thefirst one shows that the model learns the usageof common API calls (e.g., os.path.join), andhow to populate the arguments by copying frominputs. The second example illustrates the dif-ficulty of generating code with complex nestedstructures like lambda functions, a scenario worthfurther investigation in future studies. More exam-ples are attached in supplementary materials.Error Analysis To understand the sources of er-rors and how good our evaluation metric (exactmatch) is, we randomly sampled and labeled 100and 50 failed examples (with accuracy=0) fromDJANGO and HS, respectively. We found thataround 2% of these examples in the two datasetsare actually semantically equivalent. These exam-ples include: (1) using different parameter nameswhen defining a function; (2) omitting (or adding)default values of parameters in function calls.While the rarity of such examples suggests that ourexact match metric is reasonable, more advancedevaluation metrics based on statistical code analy-sis are definitely intriguing future work.

For DJANGO, we found that 30% of failed caseswere due to errors where the pointer networkfailed to appropriately copy a variable name intothe correct position. 25% were because the gener-ated code only partially implemented the requiredfunctionality. 10% and 5% of errors were due tomalformed English inputs and pre-processing er-rors, respectively. The remaining 30% of exam-ples were errors stemming from multiple sources,or errors that could not be easily categorized intothe above. For HS, we found that all failed cardexamples were due to partial implementation er-rors, such as the one shown in Table 5.

6 Related WorkCode Generation and Analysis Most works oncode generation focus on generating code for do-main specific languages (DSLs) (Kushman and

447

Barzilay, 2013; Raza et al., 2015; Manshadi et al.,2013), with neural network-based approaches re-cently explored (Liu et al., 2016; Parisotto et al.,2016; Balog et al., 2016). For general-purposecode generation, besides the general frameworkof Ling et al. (2016), existing methods oftenuse language and task-specific rules and strate-gies (Lei et al., 2013; Raghothaman et al., 2016).A similar line is to use NL queries for code re-trieval (Wei et al., 2015; Allamanis et al., 2015).The reverse task of generating NL summaries fromsource code has also been explored (Oda et al.,2015; Iyer et al., 2016). Finally, our work falls intothe broad field of probabilistic modeling of sourcecode (Maddison and Tarlow, 2014; Nguyen et al.,2013). Our approach of factoring an AST usingprobabilistic models is closely related to Allama-nis et al. (2015), which uses a factorized model tomeasure the semantic relatedness between NL andASTs for code retrieval, while our model tacklesthe more challenging generation task.

Semantic Parsing Our work is related to the gen-eral topic of semantic parsing, which aims totransform NL descriptions into executable logicalforms. The target logical forms can be viewedas DSLs. The parsing process is often guidedby grammatical formalisms like combinatory cate-gorical grammars (Kwiatkowski et al., 2013; Artziet al., 2015), dependency-based syntax (Lianget al., 2011; Pasupat and Liang, 2015) or task-specific formalisms (Clarke et al., 2010; Yih et al.,2015; Krishnamurthy et al., 2016; Mei et al.,2016). Recently, there are efforts in designingneural network-based semantic parsers (Misra andArtzi, 2016; Dong and Lapata, 2016; Neelakantanet al., 2016; Yin et al., 2016). Several approacheshave be proposed to utilize grammar knowledgein a neural parser, such as augmenting the trainingdata by generating examples guided by the gram-mar (Kocisky et al., 2016; Jia and Liang, 2016).Liang et al. (2016) used a neural decoder whichconstrains the space of next valid tokens in thequery language for question answering. Finally,the structured prediction approach proposed byXiao et al. (2016) is closely related to our model inusing the underlying grammar as prior knowledgeto constrain the generation process of derivationtrees, while our method is based on a unified gram-mar model which jointly captures production ruleapplication and terminal symbol generation, andscales to general purpose code generation tasks.

7 ConclusionThis paper proposes a syntax-driven neural codegeneration approach that generates an abstractsyntax tree by sequentially applying actions froma grammar model. Experiments on both code gen-eration and semantic parsing tasks demonstrate theeffectiveness of our proposed approach.

AcknowledgmentWe are grateful to Wang Ling for his generous helpwith LPN and setting up the benchmark. We thankI. Beltagy for providing the IFTTT dataset. Wealso thank Li Dong for helping with SEQ2TREE

and insightful discussions.

ReferencesMiltiadis Allamanis, Daniel Tarlow, Andrew D. Gor-

don, and Yi Wei. 2015. Bimodal modelling ofsource code and natural language. In Proceedingsof ICML. volume 37.

David Alvarez-Melis and Tommi S. Jaakkola. 2017.Tree-structured decoding with doubly recurrent neu-ral networks. In Proceedings of ICLR.

Yoav Artzi, Kenton Lee, and Luke Zettlemoyer. 2015.Broad-coverage CCG semantic parsing with AMR.In Proceedings of EMNLP.

Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su-pervised learning of semantic parsers for mappinginstructions to actions. Transaction of ACL 1(1).

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate. CoRRabs/1409.0473.

Matej Balog, Alexander L. Gaunt, Marc Brockschmidt,Sebastian Nowozin, and Daniel Tarlow. 2016.Deepcoder: Learning to write programs. CoRRabs/1611.01989.

Robert Balzer. 1985. A 15 year perspective on au-tomatic programming. IEEE Trans. Software Eng.11(11).

Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract meaning representationfor sembanking. In Proceedings of LAW-ID@ACL.

I. Beltagy and Chris Quirk. 2016. Improved seman-tic parsers for if-then statements. In Proceedings ofACL.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on freebase fromquestion-answer pairs. In Proceedings of EMNLP.

448

Joel Brandt, Mira Dontcheva, Marcos Weskamp, andScott R. Klemmer. 2010. Example-centric program-ming: integrating web search into the developmentenvironment. In Proceedings of CHI.

Joel Brandt, Philip J. Guo, Joel Lewenstein, MiraDontcheva, and Scott R. Klemmer. 2009. Two stud-ies of opportunistic programming: interleaving webforaging, learning, and writing code. In Proceedingsof CHI.

Stephen Clark and James R. Curran. 2007. Wide-coverage efficient statistical parsing with CCG andlog-linear models. Computational Linguistics 33(4).

James Clarke, Dan Goldwasser, Ming-Wei Chang, andDan Roth. 2010. Driving semantic parsing from theworld’s response. In Proceedings of CoNLL.

Li Dong and Mirella Lapata. 2016. Language to logicalform with neural attention. In Proceedings of ACL.

Yarin Gal and Zoubin Ghahramani. 2016. A theoret-ically grounded application of dropout in recurrentneural networks. In Proceedings of NIPS.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K.Li. 2016. Incorporating copying mechanism insequence-to-sequence learning. In Proceedings ofACL.

Tihomir Gvero and Viktor Kuncak. 2015. Interactivesynthesis using free-form queries. In Proceedingsof ICSE.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural Computation 9(8).

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, andLuke Zettlemoyer. 2016. Summarizing source codeusing a neural attention model. In Proceedings ofACL.

Robin Jia and Percy Liang. 2016. Data recombinationfor neural semantic parsing. In Proceedings of ACL.

Tomas Kocisky, Gabor Melis, Edward Grefenstette,Chris Dyer, Wang Ling, Phil Blunsom, andKarl Moritz Hermann. 2016. Semantic parsing withsemi-supervised sequential autoencoders. In Pro-ceedings of EMNLP.

Jayant Krishnamurthy, Oyvind Tafjord, and AniruddhaKembhavi. 2016. Semantic parsing to probabilisticprograms for situated question answering. In Pro-ceedings of EMNLP.

Nate Kushman and Regina Barzilay. 2013. Using se-mantic unification to generate regular expressionsfrom natural language. In Proceedings of NAACL.

Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, andLuke S. Zettlemoyer. 2013. Scaling semanticparsers with on-the-fly ontology matching. In Pro-ceedings of the EMNLP.

Tao Lei, Fan Long, Regina Barzilay, and Martin C. Ri-nard. 2013. From natural language specifications toprogram input parsers. In Proceedings of ACL.

Chen Liang, Jonathan Berant, Quoc Le, Kenneth D.Forbus, and Ni Lao. 2016. Neural symbolic ma-chines: Learning semantic parsers on freebase withweak supervision. CoRR abs/1611.00020.

Percy Liang, Michael I. Jordan, and Dan Klein. 2011.Learning dependency-based compositional seman-tics. In Proceedings of ACL.

Wang Ling, Phil Blunsom, Edward Grefenstette,Karl Moritz Hermann, Tomas Kocisky, FuminWang, and Andrew Senior. 2016. Latent predictornetworks for code generation. In Proceedings ofACL.

Greg Little and Robert C. Miller. 2009. Keyword pro-gramming in java. Autom. Softw. Eng. 16(1).

Chang Liu, Xinyun Chen, Eui Chul Richard Shin,Mingcheng Chen, and Dawn Xiaodong Song. 2016.Latent attention for if-then program synthesis. InProceedings of NIPS.

Thang Luong, Ilya Sutskever, Quoc V. Le, OriolVinyals, and Wojciech Zaremba. 2015. Addressingthe rare word problem in neural machine translation.In Proceedings of ACL.

Chris J. Maddison and Daniel Tarlow. 2014. Structuredgenerative models of natural source code. In Pro-ceedings of ICML.

Mehdi Hafezi Manshadi, Daniel Gildea, and James F.Allen. 2013. Integrating programming by exampleand natural language programming. In Proceedingsof AAAI.

Hongyuan Mei, Mohit Bansal, and Matthew R. Wal-ter. 2016. Listen, attend, and walk: Neural mappingof navigational instructions to action sequences. InProceedings of AAAI.

Dipendra K. Misra and Yoav Artzi. 2016. Neural shift-reduce CCG semantic parsing. In Proceedings ofEMNLP.

Dipendra Kumar Misra, Kejia Tao, Percy Liang, andAshutosh Saxena. 2015. Environment-driven lex-icon induction for high-level instructions. In Pro-ceedings of ACL.

Arvind Neelakantan, Quoc V. Le, and Ilya Sutskever.2016. Neural programmer: Inducing latent pro-grams with gradient descent. In Proceedings ofICLR.

Graham Neubig. 2015. lamtram: A toolkit for lan-guage and translation modeling using neural net-works. http://www.github.com/neubig/lamtram.

Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan AnhNguyen, and Tien N. Nguyen. 2013. A statisticalsemantic language model for source code. In Pro-ceedings of ACM SIGSOFT .

449

Yusuke Oda, Hiroyuki Fudaba, Graham Neubig,Hideaki Hata, Sakriani Sakti, Tomoki Toda, andSatoshi Nakamura. 2015. Learning to generatepseudo-code from source code using statistical ma-chine translation (T). In Proceedings of ASE.

Emilio Parisotto, Abdel-rahman Mohamed, RishabhSingh, Lihong Li, Dengyong Zhou, and PushmeetKohli. 2016. Neuro-symbolic program synthesis.CoRR abs/1611.01855.

Panupong Pasupat and Percy Liang. 2015. Composi-tional semantic parsing on semi-structured tables. InProceedings of ACL.

Python Software Foundation. 2016. Python abstractgrammar. https://docs.python.org/2/library/ast.html.

Chris Quirk, Raymond J. Mooney, and Michel Galley.2015. Language to code: Learning semantic parsersfor if-this-then-that recipes. In Proceedings of ACL.

Mukund Raghothaman, Yi Wei, and Youssef Hamadi.2016. SWIM: synthesizing what i mean: codesearch and idiomatic snippet synthesis. In Proceed-ings of ICSE.

Mohammad Raza, Sumit Gulwani, and Natasa Milic-Frayling. 2015. Compositional program synthesisfrom natural language and examples. In Proceed-ings of IJCAI.

Lappoon R. Tang and Raymond J. Mooney. 2001. Us-ing multiple clause constructors in inductive logicprogramming for semantic parsing. In Proceedingsof ECML.

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.2015. Pointer networks. In Proceedings of NIPS.

Yi Wei, Nirupama Chandrasekaran, Sumit Gul-wani, and Youssef Hamadi. 2015. Build-ing bing developer assistant. Techni-cal report. https://www.microsoft.com/en-us/research/publication/building-bing-developer-assistant/.

Chunyang Xiao, Marc Dymetman, and Claire Gardent.2016. Sequence-based structured prediction for se-mantic parsing. In Proceedings of ACL.

Wen-tau Yih, Ming-Wei Chang, Xiaodong He, andJianfeng Gao. 2015. Semantic parsing via stagedquery graph generation: Question answering withknowledge base. In Proceedings of ACL.

Pengcheng Yin, Zhengdong Lu, Hang Li, and Ben Kao.2016. Neural enquirer: Learning to query tables innatural language. In Proceedings of IJCAI.

Luke Zettlemoyer and Michael Collins. 2005. Learn-ing to map sentences to logical form structured clas-sification with probabilistic categorial grammars. InProceedings of UAI.

450

a syntactic neural model for general-purpose code …neural network, which naturally reßects the...

Documents