learning semantic parsers from denotations with latent ...learning semantic parsers from denotations...

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 3774–3785,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

3774

Learning Semantic Parsers from Denotations with LatentStructured Alignments and Abstract Programs

Bailin Wang, Ivan Titov and Mirella LapataInstitute for Language, Cognition and Computation

School of Informatics, University of [email protected], {ititov,mlap}@inf.ed.ac.uk

Abstract

Semantic parsing aims to map natural lan-guage utterances onto machine interpretablemeaning representations, aka programs whoseexecution against a real-world environmentproduces a denotation. Weakly-supervisedsemantic parsers are trained on utterance-denotation pairs treating programs as latent.The task is challenging due to the large searchspace and spuriousness of programs whichmay execute to the correct answer but do notgeneralize to unseen examples. Our goal isto instill an inductive bias in the parser tohelp it distinguish between spurious and cor-rect programs. We capitalize on the intu-ition that correct programs would likely re-spect certain structural constraints were theyto be aligned to the question (e.g., programfragments are unlikely to align to overlappingtext spans) and propose to model alignmentsas structured latent variables. In order to makethe latent-alignment framework tractable, wedecompose the parsing task into (1) predictinga partial “abstract program” and (2) refiningit while modeling structured alignments withdifferential dynamic programming. We ob-tain state-of-the-art performance on the WIK-ITABLEQUESTIONS and WIKISQL datasets.When compared to a standard attention base-line, we observe that the proposed structured-alignment mechanism is highly beneficial.

1 Introduction

Semantic parsing is the task of translating naturallanguage to machine interpretable meaning repre-sentations. Typically, it requires mapping a naturallanguage utterance onto a program, which is exe-cuted against a knowledge base to obtain an an-swer or a denotation. Most previous work (Zettle-moyer and Collins, 2005; Wong and Mooney,2007; Lu et al., 2008; Jia and Liang, 2016) hasfocused on the supervised setting where a modelis learned from question-program pairs. Weakly

Question:��"��$ ��!�� $ "��

Abstract Program: �� " ��

CorrectProgram:�� $�� !��

Instantiation

ExecutionDenotation: 0

Spurious Programs:��!�� #� �� "�� !�� !�� "�� !�� !��

Inconsistent Program:�� $��

Figure 1: After generating an abstract program fora question, our parser finds alignments between slots(with prefix #) and question spans. Based on the align-ment, it instantiates each slot and a complete programis executed against a table to obtain a denotation.

supervised semantic parsing (Berant et al., 2013;Liang et al., 2011) reduces the burden of anno-tating programs by learning from questions pairedwith their answers (or denotations).

Two major challenges arise when learning fromdenotations: 1) training of the semantic parser re-quires exploring a large search space of possibleprograms to find those which are consistent, andexecute to correct denotations; 2) the parser shouldbe robust to spurious programs which acciden-tally execute to correct denotations, but do not re-flect the semantics of the question. In this paper,we propose a weakly-supervised neural semanticparser that features structured latent alignments tobias learning towards correct programs which areconsistent but not spurious.

Our intuition is that correct programs should re-

3775

spect certain constraints were they to be aligned tothe question text, while spurious and inconsistentprograms do not. For instance, in Figure 1, theanswer to the question (“0”) can be obtained byexecuting the correct program which selects thenumber of Turkey’s silver medals. However, thesame answer can be also obtained by the spuriousprograms shown in the figure.1 The spurious pro-grams differ from the correct one in that they re-peatedly use the column “silver”. Whereas, in thequestion, the word “silver” only refers to the tar-get column containing the answer; it also mistak-enly triggers the appearance of the column “silver”in the row selection condition. This constraint,i.e., that a text span within a question cannot trig-ger two semantically distinct operations (e.g., se-lecting target rows and target columns) can pro-vide a useful inductive bias. We propose to cap-ture structural constraints by modeling the align-ments between programs and questions explicitlyas structured latent variables.

Considering the large search space of possibleprograms, an alignment model that takes into ac-count the full range of correspondences betweenprogram operations and question spans would bevery expensive. To make the process tractable, weintroduce a two-stage approach that features ab-stract programs. Specifically, we decompose se-mantic parsing into two steps: 1) a natural lan-guage utterance is first mapped to an abstract pro-gram which is a composition of high-level opera-tions; and 2) the abstract program is then instanti-ated with low-level operations that usually involverelations and entities specific to the knowledgebase at hand. This decomposition is motivated bythe observation that only a small number of sensi-ble abstract programs can be instantiated into con-sistent programs. Similar ideas of using abstractmeaning representations have been explored withfully-supervised semantic parsers (Dong and La-pata, 2018; Catherine Finegan-Dollak and Radev,2018) and in other related tasks (Goldman et al.,2018; Herzig and Berant, 2018; Nye et al., 2019).

For a knowledge base in tabular format, we ab-stract two basic operations of row selection andcolumn selection from programs: these are han-dled in the second (instantiation) stage. As shownin Figure 1, the question is first mapped to the ab-stract program “select (#row slot, #column slot)”

1The first program can be paraphrased as: find the rowwith the largest number of silver medals and then select thenumber of silver medals from the previous row.

whose two slots are subsequently instantiated withfilter conditions (row slot) and a column name(column slot). During the instantiation of abstractprograms, each slot should refer to the question toobtain its specific semantics. In Figure 1, row slotshould attend to “nation of Turkey” while col-umn slot needs to attend to “silver medals”. Thestructural constraint discussed above now corre-sponds to assuming that each span in a questioncan be aligned to a unique row or column slot.Under this assumption, the instantiation of spuri-ous programs will be discouraged. The uniquenessconstraint would be violated by both spurious pro-grams in Figure 1, since “column:silver” appearsin the program twice but can be only aligned to thespan “silver medals” once.

The first stage (i.e., mapping a question onto anabstract program) is handled with a sequence-to-sequence model. The second stage (i.e., programinstantation) is approached with local classifiers:one per slot in the abstract program. The classi-fiers are conditionally independent given the ab-stract program and a latent alignment. Instead ofmarginalizing out alignments, which would be in-tractable, we use structured attention (Kim et al.,2017), i.e., we compute the marginal probabili-ties for individual span-slot alignment edges anduse them to weight the input to the classifiers. Aswe discuss below, the marginals in our constrainedmodel are computed with dynamic programming.

We perform experiments on two open-domainquestion answering datasets in the setting of learn-ing from denotations. Our model achieves an ex-ecution accuracy of 44.5% in WIKITABLEQUES-TIONS and 79.3% in WIKISQL, which both sur-pass previous state-of-the-art methods in the sameweakly-supervised setting. In WIKISQL, ourparser is better than recent supervised parsers thatare trained on question-program pairs. Our contri-butions can be summarized as follows:

• we introduce an alignment model as a meansof differentiating between correct and spuri-ous programs;

• we propose a neural semantic parser that per-forms tractable alignments by first mappingquestions to abstract programs;

• we achieve state-of-the-art performance ontwo semantic parsing benchmarks.2

2Our code is available at https://github.com/berlino/weaksp_em19.

https://github.com/berlino/weaksp_em19

https://github.com/berlino/weaksp_em19

3776

Although we use structured alignments tomostly enforce the uniqueness constraint de-scribed above, other types of inductive biases canbe useful and could be encoded in our two-stageframework. For example, we could replace theuniqueness constraint with modeling the numberof slots aligned to a span, or favor sparse align-ment distributions. Crucially, the two-stage frame-work makes it easier to inject prior knowledgeabout datasets and formalisms while maintainingefficiency.

2 Background

Given knowledge base t, our task is to map a nat-ural utterance x to program z, which is then exe-cuted against a knowledge base to obtain denota-tion [[z]]t = d. We train our parser only based on dwithout access to correct programs z∗. Our exper-iments focus on two benchmarks, namely WIK-ITABLEQUESTIONS (Pasupat and Liang, 2015)and WIKISQL (Zhong et al., 2017) where eachquestion is paired with a Wikipedia table and adenotation. Figure 1 shows a simplified exampletaken from WIKITABLEQUESTIONS.

2.1 GrammarsExecutable programs z that can query tables aredefined according to a language. Specifically, thesearch space of programs is constrained by gram-mar rules so that it can be explored efficiently. Weadopt the variable-free language of Liang et al.(2018) and define an abstract grammar and an in-stantiation grammar which decompose the gener-ation of a program in two stages.3

The first stage involves the generation of anabstract version of a program which, in the sec-ond stage, gets instantiated. Abstract programsonly consider compositions of high-level func-tions, such as superlatives and aggregation, whilelow-level functions and arguments, such as filterconditions and entities, are taken into account inthe next step. In our table-based datasets, abstractprograms do not include two basic operations ofquerying tables: row selection and column selec-tion. These operations are handled at the instan-tiation stage. In Figure 1 the abstract programhas two slots for row and column selection, whichare filled with the conditions “column:nation =Turkey” and “column:silver” at the instantiation

3We also extend their grammar to additionally support op-erations of conjunction and disjunction. More details are pro-vided in the Appendix.

STRING select (ROW, COLUMN)

COLUMN #column_slot ROW first (LIST[ROW])

LIST[ROW] #rows_slot

Abstract Program: select (first (#rows_slot) #column_slot)

Derivation Tree:

Figure 2: An abstract program and its derivation tree.Capitalized words indicate types of function argumentsand their return value.

stage. The two stages can be easily merged intoone step when conducting symbolic combinatorialsearch. The motivation for the decomposition is tofacilitate the learning of our neural semantic parserand the handling of structured alignments.

Abstract Grammar Our abstract grammar hasfive basic types: ROW, COLUMN, STRING, NUM-BER, and DATE; COLUMN is further sub-typedinto STRING COLUMN, NUMBER COLUMN, andDATE COLUMN; other basic types are augmentedwith LIST to represent a list of elements likeLIST[ROW]. Arguments and return values offunctions are typed using these basic types.

Function composition can be defined recur-sively based on a set of production rules, each cor-responding to a function type. For instance, func-tion ROW→ first(LIST[ROW]) selects the firstrow from a list of rows and corresponds to produc-tion rule “ROW→ first”.

The abstract grammar has two additional typesfor slots (aka terminal rules) which correspond torow and column selection:

LIST[ROW]→ #row slotCOLUMN→ #column slot

An example of an abstract program and itsderivation tree is shown in Figure 2. We lin-earize the derivation by traversing it in a left-to-right depth-first manner. We represent the treein Figure 2 as a sequence of production rules:“ROOT → STRING, STRING → select, ROW →first”, LIST[ROW]→ #row slot, COLUMN→ #col-umn slot”. The first action is always to select thereturn type for the root node.

Given a specific table t, the abstract gram-mar Ht will depend on its column types. For ex-ample, if the table does not have number cells,“max/min” operations will not be executable.

Instantiation Grammar A column slot is di-rectly instantiated by selecting a column; a row

3777

slot is filled with one or multiple conditions(COND) which are joined together with conjunc-tion (OR) and disjunction (AND) operators:

COND→ COLUMN OPERATOR VALUE

#row slot→ COND (AND COND)*#row slot→ COND (OR COND)*

where OPERATOR ∈ [>,<,=,≥,≤] and VALUE

is a string, a number, or a date. A special condition#row slot→ all rows is defined to signify that aprogram queries all rows.

2.2 Search for Consistent Programs

A problematic aspect of learning from denotationsis that, since annotated programs are not available(e.g., for WIKITABLEQUESTIONS), we have nomeans to directly evaluate a proposed grammar.As an evaluation proxy, we measure the coverageof our grammar in terms of consistent programs.Specifically, we exhaustively search for all consis-tent programs for each question in the training set.While the space of programs is exponential, weobserved that abstract programs which are instan-tiated into correct programs are not very complexin terms of the number of production rules used togenerate them. As a result, we impose restrictionson the number of production rules which can ab-stract programs, and in this way the search processbecomes tractable.4

We find that 83.6% of questions in WIK-ITABLEQUESTIONS are covered by at least oneconsistent program. However, each question even-tually has 200 consistent programs on averageand most of them are spurious. Treating them asground truth poses a great challenge for learninga semantic parser. The coverage for WIKISQL is96.6% and each question generates 84 consistentprograms.

Another important observation is that there isonly a limited number of abstract programs thatcan be instantiated into consistent programs. Thenumber of such abstract programs is 23 for WIK-ITABLEQUESTIONS and 6 for WIKISQL, sug-gesting that there are a few patterns underlyingseveral utterances. This motivates us to design asemantic parser that first maps utterances to ab-stract programs. For the sake of generality, we donot restrict our parser to abstract programs in thetraining set. We elaborate on this below.

4Details are provided in the Appendix.

3 Model

After obtaining consistent programs z for eachquestion via offline search, we next show how tolearn a parser that can generalize to unseen ques-tions and tables.

3.1 Training and InferenceOur learning objective J is to maximize the log-likelihood of the marginal probability of all con-sistent programs, which are generated by mappingan utterance x to an interim abstract program h:

J = log{∑h∈Ht

p(h|x, t)∑

[[z]]=d

p(z|x, t, h)}. (1)

During training, our model only needs to focuson abstract programs that have successful instanti-ations of consistent programs and it does not haveto explore the whole space of possible programs.

At test time, the parser chooses the program zwith the highest probability:

h, z = argmaxh∈Ht, z

p(h|x, t)p(z|x, t, h). (2)

For efficiency, we only choose the top-k abstractprograms to instantiate through beam search. z isthen executed to obtain its denotation as the finalprediction.

Next, we will explain the basic components ofour neural parser. Basically, our model first en-codes a question and a table with an input encoder;it then generates abstract programs with a seq2seqmodel; and finally, these abstract programs are in-stantiated based on a structured alignment model.

3.2 Input EncoderEach word in an utterance is mapped to a dis-tributed representation through an embeddinglayer. Following previous work (Neelakantanet al., 2017; Liang et al., 2018), we also add anindicator feature specifying whether the word ap-pears in the table. This feature is mapped to alearnable vector. Additionally, in WIKITABLE-QUESTIONS, we use POS tags from the CoreNLPannotations released with the dataset and mapthem to vector representations. The final represen-tation for a word is the concatenation of the vec-tors above. A bidirectional LSTM (Hochreiter andSchmidhuber, 1997) is then used to obtain a con-textual representation li for the ith word.

A table is represented by a set of columns. Eachcolumn is encoded by averaging the embeddings

3778

of words under its column name. We also have acolumn type feature (i.e., number, date, or string)and an indicator feature signaling whether at leastone entity in the column appears in the utterance.

3.3 Generating Abstract Programs

Instead of extracting abstract programs as tem-plates, similarly to Xu et al. (2017) and CatherineFinegan-Dollak and Radev (2018), we generatethem with a seq2seq model. Although template-based approaches would be more efficient in prac-tice, a seq2seq model is more general since itcould generate unseen abstract programs whichfixed templates could not otherwise handle.

Our goal here is to generate a sequence ofproduction rules that lead to abstract programs.During decoding, the hidden state gj of the j-th timestep is computed based on the previousproduction rule, which is mapped to an embed-ding aj−1. We also incorporate an attention mech-anism (Luong et al., 2015) to compute a contex-tual vector bj . Finally, a score vector sj is com-puted by feeding the concatenation of the hiddenstate and context vector to a multilayer perceptron(MLP):

gj = LSTM(gj−1,aj−1)

bj = Attention(gj , l)

sj = MLP1([gj ; bj ])

p(aj |x, t, a<j) = softmaxaj (sj)

(3)

where the probability of production rule aj iscomputed by the softmax function. According toour abstract grammar, only a subset of productionrules will be valid at the j-th time step. For in-stance, in Figure 2, production rule “STRING →select” will only expand to rules whose left-handside is ROW, which is the type of the first argu-ment of select. In this case, the next productionrule is “ROW → first”. We thus restrict the nor-malization of softmax to only focus on these validproduction rules. The probability of generating anabstract program p(h|x, t) is simply the product ofthe probability of predicting each production rule∏

j p(aj |x, t, a<j).After an abstract program is generated, we need

to instantiate slots in abstract programs. Ourmodel first encodes the abstract program using abi-directional LSTM. As a result, the representa-tion of a slot is contextually aware of the entireabstract program (Dong and Lapata, 2018).

3.4 Instantiating Abstract ProgramsTo instantiate an abstract program, each slot mustobtain its specific semantics from the question. Wemodel this process by an alignment model whichlearns the correspondence between slots and ques-tion spans. Formally, we use a binary alignmentmatrix A with size m× n× n, where m is thenumber of slots and n is the number of tokens.In Figure 1, the alignment matrix will only haveA0,6,8 = 1 and A1,2,3 = 1 which indicates thatthe first slot is aligned with “nation of Turkey”,and the second slot is aligned with “silver medals”.The second and third dimension of the matrix rep-resent the start and end position of a span.

We model alignments as discrete latent vari-ables and condition the instantiation process on thealignments as follows:∑

[z]=d

p(z|x, t, h) =∑A

p(A|x, t, h)∑

[z]=d

p(z|x, t, h,A).(4)

We will first discuss the instantiation modelp(z|x, t, h,A) and then elaborate on how to avoidmarginalization in the next section. Each slot inan abstract program can be instantiated by a setof candidates following the instantiation grammar.For efficiency, we use local classifiers to model theinstantiation of each slot independently:

p(z|x, t, h,A) =∏s∈S

p(s→ c|x, t, h,A), (5)

where S is the set of slots and c is a candidate fol-lowing our instantiation grammar. “s → c” repre-sents the instantiation of slot s into candidate c.

Recall that there are two types of slots, one forrows and one for columns. All column namesin the table are potential instantiations of columnslots. We represent each column slot candidate bythe average of the embeddings of words in the col-umn name. Based on our instantiation grammarin Section 2.1, candidates for row slots are repre-sented as follows: 1) each condition is representedwith the concatenation of the representations of acolumn, an operator, and a value. For instance,condition “string column:nation = Turkey” in Fig-ure 1 is represented by vector representations ofthe column ‘nation’, the operator ‘=’, and the en-tity ‘Turkey’; 2) multiple conditions are encodedby averaging the representations of all conditionsand adding a vector representation of AND /OR toindicate the relation between them.

3779

For each slot, the probability of generating acandidate is computed with softmax normalizationon a score function:

p(s→ c|x, t, h,A) ∝ exp{MLP([s; c])}, (6)

where s is the representation of the span that slot sis aligned with, and c is the representation of can-didate c. The representations s and c are concate-nated and fed to a MLP. We use the same MLP ar-chitecture but different parameters for column androw slots.

3.5 Structured Attention

We first formally define a few structural con-straints over alignments and then explain how toincorporate them efficiently into our parser.

The intuition behind our alignment model is thatrow and column selection operations represent dis-tinct semantics, and should therefore be expressedby distinct natural language expressions. Hence,we propose the following constraints:

Unique Span In most cases, the semantics of arow selection or a column selection is expresseduniquely with a single contiguous span:

∀k ∈ [1, |S|],∑i,j

Ak,i,j = 1, (7)

where |S| is the number of slots.

No Overlap Spans aligned to different slotsshould not overlap. Formally, at most one spanthat contains word i can be aligned to a slot:

∀i ∈ [1, n],∑k,j

Ak,i,j ≤ 1. (8)

As an example, the alignments in Figure 1 fol-low the above constraints. Intuitively, the one-to-one mapping constraint aims to assign distinctand non-overlapping spans to slots of abstract pro-grams. To further bias the alignments and improveefficiency, we impose additional restrictions: (1) arow slot must be aligned to a span that containsan entity since conditions that instantiate the slotwould require entities for filtering; (2) a columnslot must be aligned to a span with length 1 sincemost column names only have one word.

Marginalizing out all A in Equation (4) wouldbe very expensive considering the exponentialnumber of possible alignments. We approximate

the marginalization by moving the outside expec-tation directly inside over A. As a result, we in-stead optimize the following objective:

J ≈ log{ ∑

h∈Ht

p(h|x, t)∑

[[z]]=d

p(z|x, t, h,E[A])}, (9)

where E[A] are the marginals of A with respect top(A|x, t, h).

The idea of using differentiable surrogates fordiscrete latent variables has been used in manyother works like differentiable data structures(Grefenstette et al., 2015; Graves et al., 2014) andattention-based networks (Bahdanau et al., 2015;Kim et al., 2017). Using marginals E[A] can beviewed as structured attention between slots andquestion spans.

The marginal probability of the alignment ma-trix A can be computed efficiently using dynamicprogramming (see Tackstrom et al. 2015 for de-tails). An alignment is encoded into a path ina weighted lattice where each vertex has 2|S|

states to keep track of the set of covered slots.The marginal probability of edges in this latticecan be computed by the forward-backward al-gorithm (Wainwright et al., 2008). The latticeweights, represented by a scoring matrix M ∈Rm×n×n for all possible slot-span pairs, are com-puted using the following scoring function:

Mk,i,j = MLP2([r(k); span[i : j]]), (10)

where r(k) represents the kth slot and span[i : j]represents the span from word i to j. Recall thatwe obtain r(k) by encoding a generated abstractprogram. A span is represented by averaging therepresentations of the words therein. These tworepresentations are concatenated and fed to a MLPto obtain a score. Since E[A] is not discrete any-more, the aligned representation of slot s in Equa-tion (6) becomes the weighted average of repre-sentations of all spans in the set.

4 Experiments

We evaluated our model on two semantic parsingbenchmarks, WIKITABLEQUESTIONS and WIK-ISQL. We compare against two common baselinesto demonstrate the effectiveness of using abstractprograms and alignment. We also conduct detailedanalysis which shows that structured attention ishighly beneficial, enabling our parser to differen-tiate between correct and spurious programs. Fi-nally, we break down the errors of our parser so as

3780

to examine whether structured attention is better atinstantiating abstract programs.

4.1 Experimental SetupDatasets WIKITABLEQUESTIONS contains2,018 tables and 18,496 utterance-denotationpairs. The dataset is challenging as 1) the tablescover a wide range of domains and unseen tablesappear at test time; and 2) the questions involvea variety of operations such as superlatives, com-parisons, and aggregation (Pasupat and Liang,2015). WIKISQL has 24,241 tables and 80,654utterance-denotation pairs. The questions arelogically simpler and only involve aggregation,column selection, and conditions. The originaldataset is annotated with SQL queries, but weonly use the execution result for training. In bothdatasets, tables are extracted from Wikipedia andcover a wide range of domains.

Entity extraction is important during parsingsince entities are used as values in filter condi-tions during instantiation. String entities are ex-tracted by string matching utterance spans and ta-ble cells. In WIKITABLEQUESTIONS, numbersand dates are extracted from the CoreNLP anno-tations released with the dataset. WIKISQL doesnot have entities for dates, and we use string-basednormalization to deal with numbers.

Implementation We obtained word embed-dings by a linear projection of GloVe pre-trainedembeddings (Pennington et al., 2014) which werefixed during training. Attention scores were com-puted based on the dot product between two vec-tors. Each MLP is a one-hidden-layer perceptronwith ReLU as the activation function. Dropout(Srivastava et al., 2014) was applied to preventoverfitting. All models were trained with Adam(Kingma and Ba, 2015). Implementations of ab-stract and instantiation grammars were based onAllenNLP (Gardner et al., 2017).5

4.2 BaselinesAside from comparing our model against previ-ously published approaches, we also implementedthe following baselines:

Typed Seq2Seq Programs were generated us-ing a sequence-to-sequence model with attention(Dong and Lapata, 2016). Similarly to Krishna-murthy et al. (2017), we constrained the decod-

5Please refer to the Appendix for the full list of hyperpa-rameters used in our experiments.

Supervised by Denotations Dev. Test

Pasupat and Liang (2015) 37.0 37.1Neelakantan et al. (2017) 34.1 34.2Haug et al. (2018) — 34.8Zhang et al. (2017) 40.4 43.7Liang et al. (2018) 42.3 43.1Dasigi et al. (2019) 42.1 43.9Agarwal et al. (2019) 43.2 44.1

Typed Seq2Seq 37.3 38.3Abstract Programs

f.w. standard attention 39.4 41.4f.w. structured attention 43.7 44.5

Table 1: Results on WIKITABLEQUESTIONS.f.w. stands for slots filled with.

ing process so that only well-formed programs arepredicted. This baseline can be viewed as mergingthe two stages of our model into one stage wheregeneration of abstract programs and their instanti-ations are performed with a shared decoder.

Standard Attention The aligned representationof slot s in Equation (6) is computed by a standardattention mechanism: s = Attention(r(s), l)where r(s) is the representation of slot s fromabstract programs. Each slot is aligned indepen-dently with attention, and there are no global struc-tural constraints on alignments.

4.3 Main Results

For all experiments, we report the mean accuracyof 5 runs. Results on WIKITABLEQUESTIONS areshown in Table 1. The structured-attention modelachieves the best performance, compared againstthe two baselines and previous approaches. Thestandard attention baseline with abstract programsis superior to the typed Seq2Seq model, demon-strating the effectiveness of decomposing seman-tic parsing into two stages. Results on WIKISQLare shown in Table 2. The structured-attentionmodel is again superior to our two baseline mod-els. Interestingly, its performance surpasses previ-ously reported weakly-supervised models (Lianget al., 2018; Agarwal et al., 2019) and is on pareven with fully supervised ones (Dong and Lap-ata, 2018).

The gap between the standard attention base-line and the typed Seq2Seq model is not verylarge on WIKISQL, compared to WIKITABLE-QUESTIONS. Recall from Section 2.2 that WIK-

3781

Supervised by Programs Dev. Test

Zhong et al. (2017) 60.8 59.4Wang et al. (2017) 67.1 66.8Xu et al. (2017) 69.8 68.0Huang et al. (2018) 68.3 68.0Yu et al. (2018) 74.5 73.5Sun et al. (2018) 75.1 74.6Dong and Lapata (2018) 79.0 78.5Shi et al. (2018) 84.0 83.7

Supervised by Denotations Dev. Test

Liang et al. (2018) 72.2 72.1Agarwal et al. (2019) 74.9 74.8

Typed Seq2Seq 74.5 74.7Abstract Programs

f.w. standard attention 75.2 75.3f.w. structured attention 79.4 79.3

Table 2: Results on WIKISQL. f.w.: slots filled with.

ISQL only has 6 abstract programs that can besuccessfully instantiated. For this reason, our de-composition alone may not be very beneficial ifcoupled with standard attention. In contrast, ourstructured-attention model consistently performsmuch better than both baselines.

We report scores of ensemble systems in Ta-ble 3. We use the best model which relies onabstract programs and structured attention as abase model in our ensemble. Our ensemble sys-tem achieves better performance than Liang et al.(2018) and Agarwal et al. (2019), while using thesame ensemble size.

4.4 Analysis of Spuriousness

To understand how well structured attention canhelp a parser differentiate between correct andspurious programs, we analyzed the posterior dis-tribution of consistent programs given a denota-tion: p(z|x, t, d) where [[z]] = d.

WIKISQL includes gold-standard SQL anno-tations, which we do not use in our experimentsbut exploit here for analysis. Specifically, we con-verted the annotations released with WIKISQL toprograms licensed by our grammar. We then com-puted the log-probability of these programs ac-cording to the posterior distribution as a measureof how well a parser can identify them amongst allconsistent programs log

∑z∗ p(z

∗|x, t, d), wherez∗ denotes correct programs. The average log-

Models WTQ WIKISQL

Liang et al. (2018) 46.3 (10) 74.2 (5)Agarwal et al. (2019) 46.9 (10) 76.9 (5)Our model 47.3 (10) 81.7 (5)

Table 3: Results of ensembled models on the test set;ensemble sizes are shown within parentheses.

Error Types standard structured

Abstraction Error 19.2 20.0Instantiation Error 41.5 36.2Coverage Error 39.2 43.8

Table 4: Proportion of errors on the development set inWIKITABLEQUESTIONS.

probability assigned to correct programs by struc-tured and standard attention is -0.37 and -0.85, re-spectively. This gap confirms that structured atten-tion can bias our parser towards correct programsduring learning.

4.5 Error Analysis

We further manually inspected the output of ourstructured-attention model and the standard atten-tion baseline in WIKITABLEQUESTIONS. Specif-ically, we randomly sampled 130 error cases in-dependently from both models and classified theminto three categories.

Abstraction Errors If a parser fails to generatean abstract program, then it is impossible for it toinstantiate a consistent complete program.

Instantiation Errors These errors arise whenabstract programs are correctly generated, but aremistakenly instantiated either by incorrect columnnames or filter conditions.

Coverage Errors These errors arise from im-plicit assumptions made by our parser: a) thereis a long tail of unsupported operations that arenot covered by our abstract programs; b) if enti-ties are not correctly identified and linked, abstractprograms cannot be correctly instantiated.

Table 4 shows the proportion of errors attestedby the two attention models. We observe thatstructured attention suffers less from instantia-tion errors compared against the standard attentionbaseline, which points to the benefits of the struc-tured alignment model.

3782

5 Related Work

Neural Semantic Parsing We follow the lineof work that applies sequence-to-sequence models(Sutskever et al., 2014) to semantic parsing (Jiaand Liang, 2016; Dong and Lapata, 2016). Ourwork also relates to models which enforce typeconstraints (Yin and Neubig, 2017; Rabinovichet al., 2017; Krishnamurthy et al., 2017) so asto restrict the vast search space of potential pro-grams. We use both methods as baselines to showthat the structured bias introduced by our modelcan help our parser handle spurious programs inthe setting of learning from denotations. Note thatour alignment model can also be applied in the su-pervised case in order to help the parser rule outincorrect programs.

Earlier work has used lexicon mappings (Zettle-moyer and Collins, 2007; Wong and Mooney,2007; Lu et al., 2008; Kwiatkowski et al., 2010)to model correspondences between programs andnatural language. However, these methods cannotgeneralize to unseen tables where new relationsand entities appear. To address this issue, Pasupatand Liang (2015) propose a floating parser whichallows partial programs to be generated withoutbeing anchored to question tokens. In the samespirit, we use a sequence-to-sequence model togenerate abstract programs while relying on ex-plicit alignments to instantiate them. Besides se-mantic parsing, treating alignments as discrete la-tent variables has proved effective in other taskslike sequence transduction (Yu et al., 2016) andAMR parsing (Lyu and Titov, 2018).

Learning from Denotations To improve theefficiency of searching for consistent programs,Zhang et al. (2017) use a macro grammar inducedfrom cached consistent programs. Unlike Zhanget al. (2017) who abstract entities and relationsfrom logical forms, we take a step further andabstract the computation of row and column se-lection. Our work also differs from Pasupat andLiang (2016) who resort to manual annotationsto alleviate spuriousness. Instead, we equip ourparser with an inductive bias to rule out spuri-ous programs during training. Recently, reinforce-ment learning based methods address the com-putational challenge by using a memory buffer(Liang et al., 2018) which stores consistent pro-grams and an auxiliary reward function (Agarwalet al., 2019) which provides feedback to deal withspurious programs. Guu et al. (2017) employ vari-

ous strategies to encourage even distributions overconsistent programs in cases where the parser hasbeen misled by spurious programs. Dasigi et al.(2019) use coverage of lexicon-like rules to guidethe search of consistent programs.

6 Conclusions

In this paper, we proposed a neural semanticparser that learns from denotations using abstractprograms and latent structured alignments. Ourparser achieves state-of-the-art performance ontwo benchmarks, WIKITABLEQUESTIONS andWIKISQL. Empirical analysis shows that the in-ductive bias introduced by the alignment modelhelps our parser differentiate between correct andspurious programs. Alignments can exhibit dif-ferent properties (e.g., monotonicity or bijectiv-ity), depending on the meaning representation lan-guage (e.g., logical forms or SQL), the definitionof abstract programs, and the domain at hand. Webelieve that these properties can be often capturedwithin a probabilistic alignment model and henceprovide a useful inductive bias to the parser.

Acknowledgements

We would like to thank the anonymous reviewersfor their valuable comments. We gratefully ac-knowledge the support of the European ResearchCouncil (Titov: ERC StG BroadSem 678254; La-pata: ERC CoG TransModal 681760) and theDutch National Science Foundation (NWO VIDI639.022.518).

ReferencesRishabh Agarwal, Chen Liang, Dale Schuurmans, and

Mohammad Norouzi. 2019. Learning to generalizefrom sparse and underspecified rewards. In Proc. ofICML.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proc. of ICLR.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on freebase fromquestion-answer pairs. In Proc. of EMNLP.

Li Zhang Karthik Ramanathan Sesh SadasivamRui Zhang Catherine Finegan-Dollak, JonathanK. Kummerfeld and Dragomir Radev. 2018. Im-proving text-to-sql evaluation methodology. InProc. of ACL.

Pradeep Dasigi, Matt Gardner, Shikhar Murty, LukeZettlemoyer, and Eduard Hovy. 2019. Iterative

3783

search for weakly supervised semantic parsing. InProc. of NAACL.

Li Dong and Mirella Lapata. 2016. Language to logicalform with neural attention. In Proc. of ACL.

Li Dong and Mirella Lapata. 2018. Coarse-to-fine de-coding for neural semantic parsing. In Proc. of ACL.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, MatthewPeters, Michael Schmitz, and Luke S. Zettlemoyer.2017. Allennlp: A deep semantic natural languageprocessing platform.

Omer Goldman, Veronica Latcinnik, Ehud Nave, AmirGloberson, and Jonathan Berant. 2018. Weakly su-pervised semantic parsing with abstract examples.In Proc. of ACL.

Alex Graves, Greg Wayne, and Ivo Danihelka.2014. Neural turing machines. arXiv preprintarXiv:1410.5401.

Edward Grefenstette, Karl Moritz Hermann, MustafaSuleyman, and Phil Blunsom. 2015. Learning totransduce with unbounded memory. In Proc. ofNeurIPS, pages 1828–1836.

Kelvin Guu, Panupong Pasupat, Evan Zheran Liu,and Percy Liang. 2017. From language to pro-grams: Bridging reinforcement learning and maxi-mum marginal likelihood. In Proc. of ACL.

Till Haug, Octavian-Eugen Ganea, and PaulinaGrnarova. 2018. Neural multi-step reasoningfor question answering on semi-structured tables.ECIR.

Jonathan Herzig and Jonathan Berant. 2018. Decou-pling structure and lexicon for zero-shot semanticparsing. In Proc. of EMNLP.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wentau Yih, and Xiaodong He. 2018. Natural languageto structured query generation via meta-learning. InProc. of NAACL.

Robin Jia and Percy Liang. 2016. Data recombinationfor neural semantic parsing. In Proc. of ACL.

Yoon Kim, Carl Denton, Luong Hoang, and Alexan-der M Rush. 2017. Structured attention networks.In Proc. of ICLR.

Diederik P Kingma and Jimmy Ba. 2015. Adam:A method for stochastic optimization. In Proc. ofICLR.

Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gard-ner. 2017. Neural semantic parsing with type con-straints for semi-structured tables. In Proc. ofEMNLP.

Tom Kwiatkowski, Luke Zettlemoyer, Sharon Gold-water, and Mark Steedman. 2010. Inducing proba-bilistic ccg grammars from logical form with higher-order unification. In Proc. of EMNLP.

Chen Liang, Mohammad Norouzi, Jonathan Berant,Quoc V Le, and Ni Lao. 2018. Memory augmentedpolicy optimization for program synthesis and se-mantic parsing. In Proc. of NeurIPS.

P. Liang, M. I. Jordan, and D. Klein. 2011. Learn-ing dependency-based compositional semantics. InProc. of ACL.

Wei Lu, Hwee Tou Ng, Wee Sun Lee, and Luke SZettlemoyer. 2008. A generative model for pars-ing natural language to meaning representations. InProc. of EMNLP.

Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. In Proc. ofEMNLP.

Chunchuan Lyu and Ivan Titov. 2018. Amr parsing asgraph prediction with latent alignment. In Proc. ofACL.

Arvind Neelakantan, Quoc V Le, Martin Abadi, An-drew McCallum, and Dario Amodei. 2017. Learn-ing a natural language interface with neural pro-grammer. In Proc. of ICLR.

Maxwell Nye, Luke Hewitt, Joshua Tenenbaum, andArmando Solar-Lezama. 2019. Learning to inferprogram sketches. Proc. of ICML.

P. Pasupat and P. Liang. 2015. Compositional semanticparsing on semi-structured tables. In Proc. of ACL.

Panupong Pasupat and Percy Liang. 2016. Inferringlogical forms from denotations. In Proc. of ACL.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proc. of EMNLP.

Maxim Rabinovich, Mitchell Stern, and Dan Klein.2017. Abstract syntax networks for code generationand semantic parsing. In Proc. of ACL.

Tianze Shi, Kedar Tatwawadi, Kaushik Chakrabarti,Yi Mao, Oleksandr Polozov, and Weizhu Chen.2018. Incsql: Training incremental text-to-sqlparsers with non-deterministic oracles. arXivpreprint arXiv:1809.05054.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: a simple way to prevent neural networksfrom overfitting. JMLR, 15(1):1929–1958.

Yibo Sun, Duyu Tang, Nan Duan, Jianshu Ji, Gui-hong Cao, Xiaocheng Feng, Bing Qin, Ting Liu, andMing Zhou. 2018. Semantic parsing with syntax-and table-aware sql generation. In Proc. of ACL.

http://arxiv.org/abs/arXiv:1803.07640

http://arxiv.org/abs/arXiv:1803.07640

3784

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Proc. of NeurIPS.

Oscar Tackstrom, Kuzman Ganchev, and DipanjanDas. 2015. Efficient inference and structured learn-ing for semantic role labeling. TACL.

Martin J Wainwright, Michael I Jordan, et al. 2008.Graphical models, exponential families, and varia-tional inference. Foundations and Trends R© in Ma-chine Learning, 1(1–2):1–305.

Chenglong Wang, Marc Brockschmidt, and RishabhSingh. 2017. Pointing out SQL queries from text.Technical report, MSR.

Yuk Wah Wong and Raymond Mooney. 2007. Learn-ing synchronous grammars for semantic parsingwith lambda calculus. In Proc. of ACL.

Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet:Generating structured queries from natural languagewithout reinforcement learning. arXiv preprintarXiv:1711.04436.

Pengcheng Yin and Graham Neubig. 2017. A syntacticneural model for general-purpose code generation.In Proc. of ACL.

Lei Yu, Jan Buys, and Phil Blunsom. 2016. Onlinesegment to segment neural transduction. In Proc. ofEMNLP.

Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, andDragomir Radev. 2018. Typesql: Knowledge-basedtype-aware neural text-to-sql generation. In Proc. ofNAACL.

Luke Zettlemoyer and Michael Collins. 2007. Onlinelearning of relaxed ccg grammars for parsing to log-ical form. In Proc. of EMNLP/CoNLL.

Luke S Zettlemoyer and Michael Collins. 2005. Learn-ing to map sentences to logical form: structuredclassification with probabilistic categorial gram-mars. In Proc. of UAI.

Yuchen Zhang, Panupong Pasupat, and Percy Liang.2017. Macro grammars and holistic triggering forefficient semantic parsing. In Proc. of EMNLP.

Victor Zhong, Caiming Xiong, and Richard Socher.2017. Seq2sql: Generating structured queriesfrom natural language using reinforcement learning.CoRR, abs/1709.00103.

A Grammars

We created our grammars following Zhang et al.(2017) and Liang et al. (2018). Compared withLiang et al. (2018), we additionally support dis-junction(OR) and conjunction(AND). Some func-tions are pruned based on their effect on coverage,which is the proportion of questions that obtain at

least one consistent program. “same as” function(Liang et al., 2018) is excluded since it introducestoo many spurious programs while contributinglittle to the coverage. For the same reason, con-junction(AND) is not used in WIKITABLEQUES-TIONS and disjunction(OR) is not used in WIK-ISQL.

We also include non-terminals of function typesin production rules (Krishnamurthy et al., 2017).For instance, function “ROW → first(LIST[ROW])”selects the first row from a list of rows and willlead to the production rule “ROW → first”. In thepaper, we eliminate the function type for simplic-ity. Practically, we use two production rules torepresent the function: ROW →<ROW: LIST[ROW] > and<ROW:LIST[ROW]> → first , where < ROW: LIST[ROW] > isan abstract function type.

B Search for Consistent Programs

We enumerate all possible programs in two stagesusing the abstract and instantiation grammars. Toconstrain the space and make the search processtractable, we restrict the maximal number of pro-duction rules for generating abstract programs dur-ing the first stage. It is based on the observationthat the abstract programs which can be success-fully instantiated into correct programs are usu-ally not very complex. In other words, the con-sistent programs that are instantiated by long ab-stract programs are very likely to be spurious. Forinstance, programs like “select( previous( next( previ-ous(argmax [all rows], column:silver) column:bronze)” areunlikely to have a corresponding question. Specif-ically, we set the maximal number of productionrules for generating abstract programs to 6 and9, which leads to search time of around 7 and10 hours for WIKISQL and WIKITABLEQUES-TIONS respectively, using a single CPU. Note thatthis needs to be done only once.

C Hyperparameters

Models used in WIKITABLEQUESTIONS andWIKISQL share similar hyperparameters whichare listed in Table 5. Our input embeddings areobtained by a linear projection from the fixedpre-trained embedding (Pennington et al., 2014).Word Indicator refers to the indicator feature ofwhether a word appears in the table; Column In-dicator refers to the indicator feature of whether atleast one entity in a column appears in the ques-tion. All MLPs mentioned in the paper have the

3785

Hyperparameters WTQ WIKISQL

Input Fixed Embedding Size 300 300Input Linear Projection Size 256 256POS Embedding Size 64 -Production Rule Embedding Size 436 328Column Type Embedding Size 16 16Word Indictor Embedding Size 16 16Column Indictor Embedding Size 16 16Operator Embedding Size 128 128Encoder Hidden Size 256 256Encoder Dropout 0.45 0.35Decoder Hidden Size 218 164AP Encoder Hidden Size 218 164AP Encoder Dropout 0.25 0.25MLP Hidden Size 436 328MLP Dropout 0.25 0.2

Table 5: Hyperparameters for WTQ (WIKITABLE-QUESTIONS) and WIKISQL. AP Encoder is the en-coder representing the abstract programs we generate.

.

same hidden size and dropout rate. During decod-ing, we choose the top-6 abstract programs to in-stantiate via beam search.

D Alignments

If a row slot is instantiated with the special con-dition ‘all rows’, then it is possible that the se-mantics of this slot is implicit. For instance, thequestion “which driver completed the least num-ber of laps? ” should first be mapped to the ab-stract program “select (argmin (#row slot, #column slot),#column slot )” which is then instantiated to the cor-rect program “select (argmin (all rows, column:laps) col-umn:driver)”. The row slot in the abstract programis instantiated with ‘all rows’, but this is not ex-plicitly expressed in the question.

To make it compatible with our constraints inEquation (6) and (7), we add a special token“ALL ROW” at the end of each question. If arow slot is aligned with this token, then it is ex-pected to be instantiated with the special condi-tion ‘all rows’. Specifically, this special token ismapped to a learnable vector during instantiations.Our alignment needs to learn to align this specialtoken with a row slot if this row slot should be in-stantiated with the condition ‘all rows’.

learning semantic parsers from denotations with latent ...learning semantic parsers from denotations...

Documents