compositional generalization and natural language

Proceedings of the 59th Annual Meeting of the Association for Computational Linguisticsand the 11th International Joint Conference on Natural Language Processing, pages 922–938

August 1–6, 2021. ©2021 Association for Computational Linguistics

922

Compositional Generalization and Natural Language Variation:Can a Semantic Parsing Approach Handle Both?

Peter Shaw Ming-Wei Chang Panupong Pasupat Kristina Toutanova

Google Research{petershaw,mingweichang,ppasupat,kristout}@google.com

AbstractSequence-to-sequence models excel at han-dling natural language variation, but have beenshown to struggle with out-of-distributioncompositional generalization. This has mo-tivated new specialized architectures withstronger compositional biases, but most ofthese approaches have only been evaluated onsynthetically-generated datasets, which are notrepresentative of natural language variation. Inthis work we ask: can we develop a semanticparsing approach that handles both natural lan-guage variation and compositional generaliza-tion? To better assess this capability, we pro-pose new train and test splits of non-syntheticdatasets. We demonstrate that strong exist-ing approaches do not perform well across abroad set of evaluations. We also proposeNQG-T5, a hybrid model that combines a high-precision grammar-based approach with a pre-trained sequence-to-sequence model. It outper-forms existing approaches across several com-positional generalization challenges on non-synthetic data, while also being competitivewith the state-of-the-art on standard evalua-tions. While still far from solving this problem,our study highlights the importance of diverseevaluations and the open challenge of handlingboth compositional generalization and naturallanguage variation in semantic parsing.

1 Introduction

Sequence-to-sequence (seq2seq) models have beenwidely used in semantic parsing (Dong and Lap-ata, 2016; Jia and Liang, 2016) and excel at han-dling the natural language variation1 of human-generated queries. However, evaluations on syn-thetic2 tasks such as SCAN (Lake and Baroni,

1We use the term natural language variation in a broadsense to refer to the many different ways humans can expressthe same meaning in natural language, including differencesin word choice and syntactic constructions.

2We make a coarse distinction between synthetic datasets,where natural language utterances are generated by a program,

Specializedarchitectureswith strong

compositional bias

Under-explored

General-purposepre-trained models

(e.g. seq2seq)

PREDOMINANT APPROACHES

SYNTHETIC NON-SYNTHETIC

NATURAL LANGUAGE VARIATION

CO

MP

OS

ITIO

NA

L

GE

NE

RA

LIZ

AT

ION

Figure 1: We study whether a semantic parsing ap-proach can handle both out-of-distribution composi-tional generalization and natural language variation.Existing approaches are commonly evaluated acrossonly one dimension.

2018) have shown that seq2seq models often gener-alize poorly to out-of-distribution compositional ut-terances, such as “jump twice” when only “jump”,“walk”, and “walk twice” are seen during training.This ability to generalize to novel combinations ofthe elements observed during training is referred toas compositional generalization.

This has motivated many specialized architec-tures that improve peformance on SCAN (Li et al.,2019; Russin et al., 2019; Gordon et al., 2019; Lake,2019; Liu et al., 2020; Nye et al., 2020; Chen et al.,2020). However, most approaches have only beenevaluated on synthetic datasets. While syntheticdatasets enable precise, interpretable evaluation ofspecific phenomena, they are less representativeof the natural language variation that a real-worldsemantic parsing system must handle.

In this paper, we ask: can we develop a semanticparsing approach that handles both natural lan-guage variation and compositional generalization?

and non-synthetic datasets, where natural language utterancesare collected from humans.

923

Surprisingly, this question is understudied. As visu-alized in Figure 1, most prior work evaluates eitherout-of-distribution compositional generalization onsynthetic datasets, or in-distribution performanceon non-synthetic datasets. Notably, designing ap-proaches that can handle both compositional gen-eralization and the natural language variation ofnon-synthetic datasets is difficult. For example,large pre-trained seq2seq models that perform wellon in-distribution evaluations do not address mostof the compositional generalization challenges pro-posed in SCAN (Furrer et al., 2020).

Our research question has two important motiva-tions. First, humans have been shown to be adeptcompositional learners (Lake et al., 2019). Severalauthors have argued that a greater focus on com-positional generalization is an important path tomore human-like generalization and NLU (Lakeet al., 2017; Battaglia et al., 2018). Second, itis practically important to assess performance onnon-synthetic data and out-of-distribution exam-ples, as random train and test splits can overesti-mate real-world performance and miss importanterror cases (Ribeiro et al., 2020). Therefore, we areinterested in approaches that do well not only oncontrolled synthetic challenges of compositionalityor in-distribution natural utterances, but across allof the diverse set of evaluations shown in Figure 2.

Our contributions are two-fold. First, on theevaluation front, we show that performance onSCAN is not well-correlated with performanceon non-synthetic tasks. In addition, strong existingapproaches do not perform well across all evalu-ations in Figure 2. We also propose new TargetMaximum Compound Divergence (TMCD) trainand test splits, extending the methodology of Key-sers et al. (2020) to create challenging evaluationsof compositional generalization for non-syntheticdatasets. We show that TMCD splits complementexisting evaluations by focusing on different as-pects of the problem.

Second, on the modeling front, we propose NQG,a simple and general grammar-based approach thatsolves SCAN and also scales to natural utterances,obtaining high precision for non-synthetic data. Inaddition, we introduce and evaluate NQG-T5, ahybrid model that combines NQG with T5 (Raf-fel et al., 2020), leading to improvements acrossseveral compositional generalization evaluationswhile also being competitive on the standard splitsof GEOQUERY (Zelle and Mooney, 1996) and SPI-

TRAIN AND TEST SPLITS

SYNTHETIC NON-SYNTHETIC

NATURAL LANGUAGE VARIATION

CO

MP

OS

ITIO

NA

L

GE

NE

RA

LIZ

AT

ION

MCD(Keysers et al., 2020)

Add Primitive(Lake and Baroni, 2018)

Length

TMCD

Template(Finegan-Dollak et al., 2018)

Length

Random

Figure 2: We evaluate semantic parsing approachesacross a diverse set of evaluations focused on natu-ral language variation, compositional generalization, orboth. We add TMCD splits to complement existingevaluations. Ordering within each cell is arbitrary.

DER (Yu et al., 2018). Our results indicate thatNQG-T5 is a strong baseline for our challenge ofdeveloping approaches that perform well across adiverse set of evaluations focusing on either naturallanguage variation, compositional generalization,or both. Comparing five approaches across eightevaluations on SCAN and GEOQUERY, its aver-age rank is 1, with the rank of the best previousapproach (T5) being 2.9; performance is also com-petitive across several evaluations on SPIDER.

While still far from affirmatively answering ourresearch question, our study highlights the impor-tance of a diverse set of evaluations and the openchallenge of handling both compositional general-ization and natural language variation.3

2 Background and Related Work

In this section, we survey recent work related tocompositional generalization in semantic parsing.

Evaluations To evaluate a model’s ability to gen-eralize to novel compositions, previous work hasproposed several methods for generating train andtest splits, as well as several synthetic datasets.

A widely used synthetic dataset for assessingcompositional generalization is SCAN (Lake andBaroni, 2018), which consists of natural languagecommands (e.g., “jump twice”) mapping to actionsequences (e.g., “I_JUMP I_JUMP”). One split forSCAN is the length split, where examples are sepa-rated by length such that the test set contains longer

3Our code and data splits are available athttps://github.com/google-research/language/tree/master/language/nqg.

https://github.com/google-research/language/tree/master/language/nqg

https://github.com/google-research/language/tree/master/language/nqg

924

examples than the training set. Another is the prim-itive split, where a given primitive (e.g., “jump”)is seen by itself during training, but the test setconsists of the primitive recombined with otherelements observed during training (e.g., “jumptwice”). Other synthetic datasets have been de-veloped to evaluate aspects of compositional gen-eralization beyond SCAN, including NACS (Bast-ings et al., 2018), CFQ (Keysers et al., 2020), andCOGS (Kim and Linzen, 2020).

In addition to introducing the CFQ dataset, Key-sers et al. (2020) propose Maximum CompoundDivergence (MCD) splits based on the notion of acompound distribution. Their algorithm generatestrain and test splits that maximize the divergenceof their respective compound distributions whilebounding the divergence of their respective atomdistributions. We extend their methodology to cre-ate new TMCD splits for non-synthetic datasets.

Another method for generating train and testsplits is the template4 split (Finegan-Dollak et al.,2018). Unlike the aforementioned evaluations,template splits have been applied to non-syntheticdatasets, primarily for text-to-SQL. In templatesplits, any parse template (defined as the targetSQL query with entities anonymized) appearing inthe training set cannot appear in the test set. Weanalyze and discuss template splits in § 6.1.

Finally, Herzig and Berant (2019) studies biasesresulting from methods for efficiently collectinghuman-labeled data, providing further motivationfor out-of-distribution evaluations.

Approaches Many specialized architectureshave been developed to address the compositionalgeneralization challenges of SCAN. Several ofthem have recently reached 100% accuracy acrossmultiple SCAN challenges (Liu et al., 2020; Nyeet al., 2020; Chen et al., 2020). Similarly to theNQG-T5 approach we propose in § 4, all of thesemodels incorporate discrete structure. However,unlike NQG-T5, they have only been evaluated onsynthetic parsing tasks.

Recently, Herzig and Berant (2020) also beginsto address our research question, proposing an ap-proach that not only solves several SCAN chal-lenges but also achieves strong performance on thestandard and template splits of the non-syntheticdataset GEOQUERY. However, their approach re-quires some manual task-specific engineering. Wecompare NQG-T5 with this approach and other

4Also referred to as a query split.

SCAN-inspired architectures. Oren et al. (2020)and Zheng and Lapata (2020) also explored compo-sitional generalization on non-synthetic datasetsby focusing on the template splits proposed byFinegan-Dollak et al. (2018), demonstrating im-provements over standard seq2seq models.

The effect of large-scale pre-training on compo-sitional generalization ability has also been studied.Furrer et al. (2020) finds that pre-training alone can-not solve several compositional generalization chal-lenges, despite its effectiveness across NLP taskssuch as question answering (Raffel et al., 2020).

While our work focuses on modeling approaches,compositional data augmentation techniques havealso been proposed (Jia and Liang, 2016; Andreas,2020). NQG-T5 outperforms previously reportedresults for these methods, but more in-depth analy-sis is needed.

3 Target Maximum CompoundDivergence (TMCD) Splits

The existing evaluations targeting compositionalgeneralization for non-synthetic tasks are templatesplits and length splits. Here we propose an addi-tional method which expands the set of availableevaluations by generating data splits that maximizecompound divergence over non-synthetic datasets,termed Target Maximum Compound Divergence(TMCD) splits. As we show in § 6, it results in ageneralization problem with different characteris-tics that can be much more challenging than tem-plate splits, and contributes to the comprehensive-ness of evaluation.

In standard MCD splits (Keysers et al., 2020),the notion of compounds requires that both sourceand target are generated by a rule-based proce-dure, and therefore cannot be applied to existingnon-synthetic datasets where natural language ut-terances are collected from humans. For TMCD,we propose a new notion of compounds based onlyon the target representations. We leverage theirknown syntactic structure to define atoms and com-pounds. For instance, example atoms in FunQL arelongest and river, and an example compoundis longest(river). Detailed definitions of atomsand compounds for each dataset we study can befound in Appendix B.3.

Given this definition of compounds, our defini-tion of compound divergence, DC , is the same asthat of Keysers et al. (2020). Specifically,

DC = 1 − C0.1(FTRAIN ‖FTEST),

925

whereFTRAIN andFTEST are the weighted frequencydistributions of compounds in the training andtest sets, respectively. The Chernoff coefficientCα(P‖Q) =

∑k p

αk q

1−αk (Chung et al., 1989) is

used with α = 0.1.For TMCD, we constrain atom divergence by re-

quiring that every atom appear at least once in thetraining set. An atom constraint is desirable so thatthe model knows the possible target atoms to gener-ate. A greedy algorithm similar to the one of Key-sers et al. (2020) is used to generate splits that ap-proximately maximize compound divergence. First,we randomly split the dataset. Then, we swap ex-amples until the atom constraint is satisfied. Finally,we sequentially identify example pairs that can beswapped between the train and test sets to increasecompound divergence without violating the atomconstraint, breaking when a swap can no longer beidentified.

4 Proposed Approach: NQG-T5

We propose NQG-T5, a hybrid semantic parserthat combines a grammar-based approach with aseq2seq model. The two components are motivatedby prior work focusing on compositional general-ization and natural language variation, respectively,and we show in § 5 that their combination sets astrong baseline for our challenge.

The grammar-based component, NQG, consistsof a discriminative Neural parsing model and aflexible Quasi-synchronous Grammar inductionalgorithm which can operate over arbitrary pairsof strings. Like other grammar-based approaches,NQG can fail to produce an output for certain in-puts. As visualized in Figure 3, in cases whereNQG fails to produce an output, we return the out-put from T5 (Raffel et al., 2020), a pre-trainedseq2seq model. This simple combination can workwell because NQG often has higher precision thanT5 for cases where it produces an output, especiallyin out-of-distribution settings.

We train NQG and T5 separately. Training datafor both components consists of pairs of source andtarget strings, referred to as x and y, respectively.

4.1 NQG Component

NQG is inspired by more traditional approachesto semantic parsing based on grammar formalismssuch as CCG (Zettlemoyer and Collins, 2005, 2007;Kwiatkowski et al., 2010, 2013) and SCFG (Wongand Mooney, 2006, 2007; Andreas et al., 2013; Li

StartRunNQG

NQG hasoutput?

RunT5

Return NQG output

Return T5 output

no

yes

Figure 3: Overview of how predictions are generatedby NQG-T5, a simple yet effective combination ofT5 (Raffel et al., 2020) with a high-precision grammar-based approach, NQG.

et al., 2015). NQG combines a QCFG induction al-gorithm with a neural parsing model. Training is atwo-stage process. First, we employ a compression-based grammar induction technique to constructour grammar. Second, based on the induced gram-mar, we build the NQG semantic parsing modelvia a discriminative latent variable model, usinga powerful neural encoder to score grammar ruleapplications anchored in the source string x.

4.1.1 NQG Grammar InductionGrammar Formalism Synchronous context-free grammars (SCFGs) synchronously generatestrings in both a source and target language. Com-pared to related work based on SCFGs for machinetranslation (Chiang, 2007) and semantic parsing,NQG uses a slightly more general grammar formal-ism that allows repetition of a non-terminal with thesame index on the target side. Therefore, we adoptthe terminology of quasi-synchronous context-freegrammars (Smith and Eisner, 2006), or QCFGs,to refer to our induced grammar G.5 Our gram-mar G contains a single non-terminal symbol, NT .We restrict source rules to ones containing at most2 non-terminal symbols, and do not allow unaryproductions as source rules. This enables efficientparsing using an algorithm similar to CKY (Cocke,1969; Kasami, 1965; Younger, 1967) that does notrequire binarization of the grammar.

Induction Procedure To induce G from thetraining data, we propose a QCFG induction algo-rithm that does not rely on task-specific heuristicsor pre-computed word alignments. Notably, ourapproach makes no explicit assumptions about thesource or target languages, beyond those implicitin the QCFG formalism. Table 1 shows examplesof induced rules.

Our grammar induction algorithm is guided bythe principle of Occam’s razor, which leads us to

5See Appendix A.1 for additional background on QCFGs.

926

SCANNT → 〈turn right, I_TURN_RIGHT〉NT → 〈NT[1] after NT[2], NT[2] NT[1]〉NT → 〈NT[1] thrice, NT[1] NT[1] NT[1]〉

GEOQUERY

NT → 〈names of NT[1], NT[1]〉NT → 〈towns, cities〉NT → 〈NT[1] have NT[2] running through them,

intersection ( NT[1] , traverse_1 ( NT[2] ) )〉

SPIDER-SSPNT → 〈reviewer, reviewer〉NT → 〈what is the id of the NT[1] named NT[2] ?,

select rid from NT[1] where name = " NT[2] "〉

Table 1: Examples of induced QCFG rules. The sub-script 1 in NT[1] indicates the correspondence betweensource and target non-terminals.

seek the smallest, simplest grammar that explainsthe data well. We follow the Minimum DescriptionLength (MDL) principle (Rissanen, 1978; Grun-wald, 2004) as a way to formalize this intuition.Specifically, we use standard two-part codes tocompute description length, where we are inter-ested in an encoding of targets y given the inputsx, across a dataset D consisting of these pairs. Atwo-part code encodes the model and the targetsencoded using the model; the two parts measurethe simplicity of the model and the extent to whichit can explain the data, respectively.

For grammar induction, our model is simply ourgrammar, G. The codelength can therefore be ex-pressed as H(G) −

∑x,y∈D log2 PG(y|x) where

H(G) corresponds to the codelength of some en-coding of G. We approximate H(G) by countingterminal (CT ) and non-terminal (CN ) symbols inthe grammar’s rules, R. For PG , we assume auniform distribution over the set of possible deriva-tions.6 As the only mutable aspect of the grammarduring induction is the set of rules R, we abusenotation slightly and write our approximate code-length objective as a function ofR only:

L(R) = lNCN (R) + lTCT (R)−∑(x,y)∈D

log2|ZGx,y||ZGx,∗|

,

where ZGx,y is the set of all derivations in G thatyield the pair of strings x and y, whileZGx,∗ ⊃ ZGx,y

6This can be viewed as a conservative choice, as in practicewe expect our neural parser to learn a better model for P (y|x)than a naive uniform distribution over derivations.

is the set of derivations that yield source string xand any target string. The constants lN and lT canbe interpreted as the average bitlength for encodingnon-terminal and terminal symbols, respectively.In practice, these are treated as hyperparameters.

We use a greedy search algorithm to find agrammar that approximately minimizes this code-length objective. We initialize G by creating a ruleNT → 〈x,y〉 for every training example (x,y).By construction, the initial grammar perfectly fitsthe training data, but is also very large. Our algo-rithm iteratively identifies a rule that can be addedto G that decreases our codelength objective byenabling ≥ 1 rule(s) to be removed, under the in-variant constraint that G can still derive all trainingexamples. The search completes when no rule thatdecreases the objective can be identified. In prac-tice, we use several approximations to efficientlyselect a rule at each iteration. Additional detailsregarding the grammar induction algorithm are de-scribed in Appendix A.2.

4.1.2 NQG Semantic Parsing ModelBased on the induced grammar G, we train a dis-criminative latent variable parsing model, using amethod similar to that of Blunsom et al. (2008).We define p(y | x) as:

p(y | x) =∑

z∈ZGx,y

p(z | x),

where ZGx,y is the set of derivations of x and y inG. We define p(z | x) as:

p(z | x) =exp(s(z,x))∑

z′∈ZGx,∗exp(s(z′,x))

,

where s(z,x) is a derivation score and the denom-inator is a global partition function. Similarly tothe Neural CRF model of Durrett and Klein (2015),the scores decompose over anchored rules. UnlikeDurrett and Klein (2015), we compute these scoresbased on contextualized representations from aBERT (Devlin et al., 2019) encoder. Additional de-tails regarding the model architecture can be foundin Appendix A.3.

At training time, we use a Maximum MarginalLikelihood (MML) objective. We preprocess eachexample to produce parse forest representationsfor both ZGx,y and ZGx,∗, which correspond to thenumerator and denominator of our MML objective,respectively. By using dynamic programming to

927

efficiently sum derivation scores inside the trainingloop, we can efficiently compute the exact MMLobjective without requiring approximations such asbeam search.

At inference time, we select the highest scoringderivation using an algorithm similar to CKY thatconsiders anchored rule scores generated by theneural parsing model. We output the correspondingtarget if it can be derived by a CFG defining validtarget constructions for the given task.

4.1.3 NQG DiscussionWe note that NQG is closely related to work thatuses synchronous grammars for hierarchical statis-tical machine translation, such as Hiero (Chiang,2007). Unlike Hiero, NQG does not rely on anadditional word alignment component. Moreover,Hiero simply uses relative frequency to learn ruleweights. Additionally, in contract with traditionalSCFG models for machine translation applied tosemantic parsing (Wong and Mooney, 2006; An-dreas et al., 2013), our neural model conditions onglobal context from the source x via contextualword embeddings, and our grammar’s rules do notneed to carry source context to aid disambiguation.

4.2 T5 Component

T5 (Raffel et al., 2020) is a pre-trained sequence-to-sequence Transformer model (Vaswani et al., 2017).We fine-tune T5 for each task.

5 Experiments

We evaluate existing approaches and the newlyproposed NQG-T5 across a diverse set of evalu-ations to assess compositional generalization andhandling of natural language variation. We aim tounderstand how the approaches compare to eachother for each type of evaluation and in aggregate,and how the performance of a single approach mayvary across different evaluation types.

5.1 Experiments on SCAN and GEOQUERY

For our main experiments, we focus on evaluationacross multiple splits of two datasets with compo-sitional queries: SCAN (Lake and Baroni, 2018)and GEOQUERY (Zelle and Mooney, 1996; Tangand Mooney, 2001). The two datasets have beenwidely used to study compositional generalizationand robustness to natural language variation, re-spectively. Both datasets are closed-domain andhave outputs with straightforward syntax, enabling

us to make clear comparisons between synthetic vs.non-synthetic setups.

Approaches For NQG-T5, to assess the effect ofmodel size, we compare two sizes of the underlyingT5 model: Base (220 million parameters) and 3B (3billion parameters). To evaluate NQG individually,we treat any example where no output is providedas incorrect when computing accuracy.

We select strong approaches from prior workthat have performed well in at least one setting.We group them into two families of approachesdescribed in Figure 1. First, for general-purposemodels that have shown strong ability to handlenatural language variation, we consider T5, a pre-trained seq2seq model, in both Base and 3B sizes.

Second, for specialized methods with strongcompositional biases, we consider approaches thathave been developed for SCAN. Some previousapproaches for SCAN require task-specific infor-mation such as the mapping of atoms (Lake, 2019;Gordon et al., 2019) or a grammar mimicking thetraining data (Nye et al., 2020), and as such are dif-ficult to adapt to non-synthetic datasets. Among theapproaches that do not need task-specific resources,we evaluate two models with publicly availablecode: Syntactic Attention (Russin et al., 2019) andCGPS (Li et al., 2019). We report results on SCANfrom the original papers as well as new results onour proposed data splits.

Datasets For the SCAN dataset, we evaluate us-ing the length split and two primitive splits, jumpand turn left, included in the original dataset (Lakeand Baroni, 2018). We also evaluate using theSCAN MCD splits from Keysers et al. (2020).

GEOQUERY (Zelle and Mooney, 1996) containsnatural language questions about US geography.Similarly to prior work (Dong and Lapata, 2016,2018), we replace entity mentions with placehold-ers. We use a variant of Functional Query Lan-guage (FunQL) as the target representation (Kateet al., 2005). In addition to the standard split ofZettlemoyer and Collins (2005), we generate multi-ple splits focusing on compositional generalization:a new split based on query length and a TMCDsplit, each consisting of 440 train and 440 test ex-amples. We also generate a new template splitconsisting of 441 train and 439 test examples.7

7We generate a new template split rather than use the GEO-QUERY template split of Finegan-Dollak et al. (2018) to avoidoverlapping templates between the train and test sets whenmapping from SQL to FunQL.

928

SCAN GEOQUERY Avg.RankSystem Jump Turn Left Len. MCD Standard Template Len. TMCD

LANE (Liu et al., 2020) 100 — 100 100 — — — — —NSSM (Chen et al., 2020) 100 — 100 — — — — — —Syntactic Attn. (Russin et al., 2019) 91.0 99.9 15.2 2.9 77.5 70.6 23.6 0.0 3.9CGPS (Li et al., 2019) 98.8 99.7 20.3 2.0 62.1 32.8 9.3 32.3 4.4GECA (Andreas, 2020) 87.0 — — — 78.0† — — — —SBSP (Herzig and Berant, 2020) 100 100 100 100 86.1† — — — —SBSP −lexicon 100 100 100 100 78.9† — — — —

T5-Base (Raffel et al., 2020) 99.5 62.0 14.4 15.4 92.9 87.0 39.1 54.3 2.9T5-3B (Raffel et al., 2020) 99.0 65.1 3.3 11.6 93.2 83.1 36.8 51.6 —

NQG-T5-Base 100 100 100 100 92.9 88.8 52.2 56.6 1.0NQG-T5-3B 100 100 100 100 93.7 85.0 51.4 54.1 —NQG 100 100 100 100 76.8 61.9 37.4 41.1 2.3

Table 2: Main Results. Existing approaches do not excel on a diverse set of evaluations across synthetic andnon-synthetic tasks, but NQG-T5 obtains significant improvements. For comparison, we report the average rankamong 5 approaches across all 8 evaluations. Gray cells are previously reported results. † indicates differences inGEOQUERY settings (see discussion in § 5.1). Boldfaced results are within 1.0 points of the best result.

We report exact-match accuracy for bothdatasets.8 Hyperparameters and pre-processing de-tails can be found in Appendix B.

Results The results are presented in Table 2.The results for T5 on SCAN are from Furreret al. (2020). Additionally, we include resultsfor GECA9 (Andreas, 2020), a data augmentationmethod, as well as LANE (Liu et al., 2020) andNSSM (Chen et al., 2020)10. We also compare withSpanBasedSP11 (Herzig and Berant, 2020).

From the results, we first note that the rela-tive performance of approaches on compositionalsplits of SCAN is not very predictive of their rela-tive performance on compositional splits of GEO-QUERY. For example, GGPS is better than T5on the length split of SCAN but is significantlyworse than T5 on the length split of GEOQUERY.Similarly, the ranking of most methods is different

8For GEOQUERY we report the mean of 3 runs for NQG,with standard deviations reported in Appendix B.5

9GECA reports GEOQUERY results on a setting with Pro-log logical forms and without anonymization of entities. Notethat the performance of GECA depends on both the quality ofthe generated data and the underlying parser (Jia and Liang,2016), which can complicate the analysis.

10These SCAN-motivated approaches both include aspectsof discrete search and curriculum learning, and have not beendemonstrated to scale effectively to non-synthetic parsingtasks. Moreover, the code is either not yet released (NSSM)or specialized to SCAN (LANE).

11SpanBasedSP preprocesses SCAN to add program-levelsupervision. For GEOQUERY, they similarly use FunQL, butuses slightly different data preprocessing and report denotationaccuracy. We computed NQG-T5’s denotation accuracy to be2.1 points higher than exact-match accuracy on the standardsplit of GeoQuery.

on the (T)MCD splits of the two datasets. Sec-ond, the proposed NQG-T5 approach combines thestrengths of T5 and NQG to achieve superior re-sults across all evaluations. It improves over T5 oncompositional generalization for both synthetic andnon-synthetic data while maintaining T5’s perfor-mance on handling in-distribution natural languagevariation, leading to an average rank of 1.0 com-pared to 2.9 for T5. (To the best of our knowledge,both T5 and NQG-T5 achieve new state-of-the-artaccuracy on the standard split of GEOQUERY.)

Finally, we note that there is substantial roomfor improvement on handling both compositionalgeneralization and natural language variation.

5.2 Experiments on SPIDER

We now compare the approaches on SPIDER (Yuet al., 2018), a non-synthetic text-to-SQL datasetthat includes the further challenges of schema link-ing and modeling complex SQL syntax.

SPIDER contains 10,181 questions and 5,693unique SQL queries across 138 domains. Theprimary evaluation is in the cross-database set-ting, where models are evaluated on examples fordatabases not seen during training. The primarychallenge in this setting is generalization to newdatabase schemas, which is not our focus. There-fore, we use a setting where the databases areshared between train and test examples.12 We gen-

12This is similar to the “example split” discussed in Yuet al. (2018). However, we only consider examples in theoriginal training set for databases with more than 50 examplesto ensure sufficient coverage over table and column names in

929

SPIDER-SSP

System Rand. Templ. Len. TMCD

T5-Base −schema 76.5 45.3 42.5 42.3T5-Base 82.0 59.3 49.0 60.9T5-3B 85.6 64.8 56.7 69.6

NQG-T5-Base 81.8 59.2 49.0 60.8NQG-T5-3B 85.4 64.7 56.7 69.5NQG 1.3 0.5 0.0 0.5

Table 3: Results on Spider-SSP. While the text-to-SQLtask is not modeled well by the NQG grammar due toSQL’s complex syntax, NQG-T5 still performs well byrelying on T5.

erate 3 new splits consisting of 3,282 train and1,094 test examples each: a random split, a splitbased on source length, and a TMCD split. We alsogenerate a template split by anonymizing integersand quoted strings, consisting of 3,280 train and1,096 test examples. We adopt the terminologyof Suhr et al. (2020) and use SPIDER-SSP to re-fer to these same-database splits, and use SPIDER-XSP to refer to the standard cross-database setting.

We prepend the name of the target database tothe source sequence. For T5, we also serialize thedatabase schema as a string and append it to thesource sequence similarly to Suhr et al. (2020). Wereport exact set match without values, the standardSpider evaluation metric (Yu et al., 2018).

Results Table 3 shows the results of T5 andNQG-T5 on different splits of SPIDER-SSP.We also show T5-Base performance without theschema string appended. The text-to-SQL map-ping is not well modeled by NQG. Nevertheless,the performance of NQG-T5 is competitive withT5, indicating a strength of the hybrid approach.

Table 4 shows the results on SPIDER-XSP,which focuses on handling unseen schema ratherthan compositional generalization. To our surprise,T5-3B proves to be competitive with the state-of-the-art (Choi et al., 2020) for approaches withoutaccess to database contents beyond the table andcolumn names. As NQG-T5 simply uses T5’s out-put when the induced grammar lacks coverage, ittoo is competitive.

6 Analysis

6.1 Comparison of Data SplitsTable 6 compares the compound divergence, thenumber of test examples with unseen atoms, and

the training data. This includes 51 databases.

SPIDER-XSPSystem Dev

RYANSQL v2 (Choi et al., 2020) 70.6

T5-Base 57.1T5-3B 70.0

NQG-T5-Base 57.1NQG-T5-3B 70.0NQG 0.0

Table 4: Although Spider-XSP is not our focus, T5 andNQG-T5 are competitive with the state-of-the-art.

the accuracy of T5-Base across various splits. ForGEOQUERY, the TMCD split is significantly morechallenging than the template split. However, forSPIDER, the template and TMCD splits are simi-larly challenging. Notably, template splits do nothave an explicit atom constraint. We find that forthe SPIDER template split, T5-Base accuracy is53.9% for the 30.3% of test set examples that con-tain an atom not seen during training, and 61.6%on the remainder, indicating that generalization tounseen atoms can contribute to the difficulty oftemplate splits.13 Length splits are also very chal-lenging, but they lead to a more predictable errorpattern for seq2seq models, as discussed next.

6.2 T5 Analysis

We analyze NQG-T5’s components, starting withT5. On length splits, there is a consistent pat-tern to the errors. T5’s outputs on the test set arenot significantly longer than the maximum lengthobserved during training, leading to poor perfor-mance. This phenomenon was explored by New-man et al. (2020).

Diagnosing the large generalization gap on the(T)MCD splits is more challenging, but we noticedseveral error patterns. For T5-Base on the GEO-QUERY TMCD split, in 52 of the 201 incorrectpredictions (26%), the first incorrectly predictedsymbol occurs when the gold symbol has 0 prob-ability under a trigram language model fit to thetraining data. This suggests that the decoder’s im-plicit target language model might have over-fittedto the distribution of target sequences in the train-ing data, hampering its ability to generate novelcompositions. Non-exclusively with these errors,53% of the incorrect predictions occur when thegold target contains an atom that is seen in only 1

13Future work could explore different choices for construct-ing template and TMCD splits, such as alternative compounddefinitions and atom constraints.

930

SCAN GEOQUERY SPIDER-SSP

Metric Jump Turn L. Len. MCD Stand. Templ. Len. TMCD Rand. Templ. Len. TMCD

NQG Coverage 100 100 100 100 80.2 64.5 43.3 43.7 1.5 0.5 0.0 0.6NQG Precision 100 100 100 100 95.7 95.8 86.4 94.1 87.5 83.3 — 85.7

Table 5: NQG coverage and precision. NQG-T5 outperforms T5 when NQG has higher precision than T5 over thesubset of examples it covers.

Dataset Split %ZA DC T5-Base

GEOQUERY Standard 0.3 0.03 92.9

GEOQUERY Random 1.4 0.03 91.1GEOQUERY Template 0.9 0.07 87.0GEOQUERY Length 4.3 0.17 39.1GEOQUERY TMCD 0 0.19 54.3

SPIDER-SSP Random 6.2 0.03 82.0SPIDER-SSP Template 30.3 0.08 59.2SPIDER-SSP Length 27.4 0.08 49.0SPIDER-SSP TMCD 0 0.18 60.9

Table 6: Percentage of test examples with atoms not in-cluded in the training set (%ZA), compound divergence(DC), and T5-Base accuracy for various dataset splits.

example during training, suggesting that T5 strug-gles with single-shot learning of new atoms. Inother cases, the errors appear to reflect over-fittingto spurious correlations between inputs and outputs.Some error examples are shown in Appendix B.6.

6.3 NQG Analysis

To analyze NQG, we compute its coverage (frac-tion of examples where NQG produces an output)and precision (fraction of examples with a correctoutput among ones where an output is produced)on different data splits. The results in Table 5 showthat NQG has high precision but struggles at cover-age on some data splits.

There is a significant difference in the effective-ness of the grammar induction procedure amongthe three datasets. Induction is particularly unsuc-cessful for SPIDER, as SQL has complicated syn-tax and often requires complex coordination acrossdiscontinuous clauses. Most of the induced rulesare limited to simply replacing table and columnnames or value literals with non-terminals, such asthe rule shown in Table 1, rather than representingnested sub-structures. The degree of span-to-spancorrespondence between natural language and SQLis seemingly lower than for other formalisms suchas FunQL, which limits the effectiveness of gram-mar induction. Intermediate representations forSQL such as SemQL (Guo et al., 2019) may help

increase the correspondence between source andtarget syntax.

For both GEOQUERY and SPIDER, NQG is lim-ited by the expressiveness of QCFGs and the simplegreedy search procedure used for grammar induc-tion, which can lead to sub-optimal approxima-tions of the induction objective. Notably, QCFGscannot directly represent relations between sourcestrings, such as semantic similarity, or relationsbetween target strings, such as logical equivalence(e.g. intersect(a,b)⇔ intersect(b,a)), thatcould enable greater generalization. However, suchextensions pose additional scalability challenges,requiring new research in more flexible approachesfor both learning and inference.

7 Conclusions

Our experiments and analysis demonstrate thatNQG and T5 offer different strengths. NQG gen-erally has higher precision for out-of-distributionexamples, but is limited by the syntactic constraintsof the grammar formalism and by requiring exactlexical overlap with induced rules in order to pro-vide a derivation at inference time. T5’s coverageis not limited by such constraints, but precision canbe significantly lower for out-of-distribution exam-ples. With NQG-T5, we offer a simple combinationof these strengths. While accuracy is still limitedfor out-of-distribution examples where NQG lackscoverage, we believe it sets a strong and simplebaseline for future work.

More broadly, our work highlights that evaluat-ing on a diverse set of benchmarks is important,and that handling both out-of-distribution composi-tional generalization and natural language variationremains an open challenge for semantic parsing.

Acknowledgements

We thank Kenton Lee, William Cohen, JeremyCole, and Luheng He for helpful discussions.Thanks also to Emily Pitler, Jonathan Herzig, andthe anonymous reviewers for their comments andsuggestions.

931

Ethical Considerations

This paper proposed to expand the set of bench-marks used to evaluate compositional generaliza-tion in semantic parsing. While we hope that en-suring semantic parsing approaches perform wellacross a diverse set of evaluations, including onesthat test out-of-distribution compositional gener-alization, would lead to systems that generalizebetter to languages not well represented in smalltraining sets, we have only evaluated our methodson semantic parsing datasets in English.

Our NQG-T5 method uses a pre-trained T5model, which is computationally expensive in fine-tuning and inference, especially for larger mod-els (see Appendix B.1 for details on running timeand compute architecture). Our method does notrequire pre-training of large models, as it usespre-existing model releases. NQG-T5-base out-performs or is comparable in accuracy to T5-3B onthe non-SQL datasets, leading to relative savingsof computational resources.

ReferencesAlfred V Aho and Jeffrey D Ullman. 1972. The the-

ory of parsing, translation, and compiling, volume 1.Prentice-Hall Englewood Cliffs, NJ.

Jacob Andreas. 2020. Good-enough compositionaldata augmentation. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics, pages 7556–7566, Online. Associationfor Computational Linguistics.

Jacob Andreas, Andreas Vlachos, and Stephen Clark.2013. Semantic parsing as machine translation. InProceedings of the 51st Annual Meeting of the As-sociation for Computational Linguistics (Volume 2:Short Papers), pages 47–52, Sofia, Bulgaria. Associ-ation for Computational Linguistics.

Jasmijn Bastings, Marco Baroni, Jason Weston,Kyunghyun Cho, and Douwe Kiela. 2018. Jumpto better conclusions: Scan both left and right. InProceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP, pages 47–55.

Peter W Battaglia, Jessica B Hamrick, Victor Bapst,Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Ma-teusz Malinowski, Andrea Tacchetti, David Raposo,Adam Santoro, Ryan Faulkner, et al. 2018. Rela-tional inductive biases, deep learning, and graph net-works. arXiv preprint arXiv:1806.01261.

Phil Blunsom, Trevor Cohn, and Miles Osborne. 2008.A discriminative latent variable model for statisti-cal machine translation. In Proceedings of ACL-08:

HLT, pages 200–208, Columbus, Ohio. Associationfor Computational Linguistics.

Xinyun Chen, Chen Liang, Adams Wei Yu, DawnSong, and Denny Zhou. 2020. Compositional gen-eralization via neural-symbolic stack machines. Ad-vances in Neural Information Processing Systems,33.

David Chiang. 2007. Hierarchical phrase-based trans-lation. computational linguistics, 33(2):201–228.

DongHyun Choi, Myeongcheol Shin, EungGyun Kim,and Dong Ryeol Shin. 2020. RYANSQL: recur-sively applying sketch-based slot fillings for com-plex text-to-sql in cross-domain databases. arXivpreprint arXiv:2004.03125.

JK Chung, PL Kannappan, CT Ng, and PK Sahoo.1989. Measures of distance between probability dis-tributions. Journal of mathematical analysis and ap-plications, 138(1):280–292.

John Cocke. 1969. Programming languages and theircompilers: Preliminary notes. New York Univer-sity.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Li Dong and Mirella Lapata. 2016. Language to logi-cal form with neural attention. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages33–43.

Li Dong and Mirella Lapata. 2018. Coarse-to-fine de-coding for neural semantic parsing. In Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 731–742, Melbourne, Australia. Associationfor Computational Linguistics.

Greg Durrett and Dan Klein. 2015. Neural crf parsing.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers), pages302–312.

Catherine Finegan-Dollak, Jonathan K. Kummerfeld,Li Zhang, Karthik Ramanathan, Sesh Sadasivam,Rui Zhang, and Dragomir Radev. 2018. Improvingtext-to-SQL evaluation methodology. In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 351–360, Melbourne, Australia. Asso-ciation for Computational Linguistics.

https://doi.org/10.18653/v1/2020.acl-main.676


https://www.aclweb.org/anthology/P13-2009



https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/P18-1068

https://doi.org/10.18653/v1/P18-1068

https://doi.org/10.18653/v1/P18-1033

https://doi.org/10.18653/v1/P18-1033

932

Daniel Furrer, Marc van Zee, Nathan Scales, andNathanael Schärli. 2020. Compositional generaliza-tion in semantic parsing: Pre-training vs. specializedarchitectures. arXiv preprint arXiv:2007.08970.

Jonathan Gordon, David Lopez-Paz, Marco Baroni,and Diane Bouchacourt. 2019. Permutation equiv-ariant models for compositional generalization inlanguage. In International Conference on LearningRepresentations.

Peter Grünwald. 1995. A minimum description lengthapproach to grammar inference. In InternationalJoint Conference on Artificial Intelligence, pages203–216. Springer.

Peter Grunwald. 2004. A tutorial introduction tothe minimum description length principle. arXivpreprint math/0406077.

Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao,Jian-Guang Lou, Ting Liu, and Dongmei Zhang.2019. Towards complex text-to-sql in cross-domaindatabase with intermediate representation. In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 4524–4535.

Jonathan Herzig and Jonathan Berant. 2019. Don’tparaphrase, detect! rapid and effective data collec-tion for semantic parsing. In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 3801–3811.

Jonathan Herzig and Jonathan Berant. 2020. Span-based semantic parsing for compositional general-ization. arXiv preprint arXiv:2009.06040.

Theo MV Janssen and Barbara H Partee. 1997. Com-positionality. In Handbook of logic and language,pages 417–473. Elsevier.

Robin Jia and Percy Liang. 2016. Data recombinationfor neural semantic parsing. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages12–22.

T. Kasami. 1965. An efficient recognition and syntaxanalysis algorithm for context-free languages. Tech-nical Report AFCRL-65-758, Air Force CambridgeResearch Laboratory, Bedford, MA.

Rohit J Kate, Yuk Wah Wong, and Raymond J Mooney.2005. Learning to transform natural to formal lan-guages. In Proceedings of the National Conferenceon Artificial Intelligence, volume 20, page 1062.Menlo Park, CA; Cambridge, MA; London; AAAIPress; MIT Press; 1999.

Daniel Keysers, Nathanael Schärli, Nathan Scales,Hylke Buisman, Daniel Furrer, Sergii Kashubin,Nikola Momchev, Danila Sinopalnikov, LukaszStafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang,

Marc van Zee, and Olivier Bousquet. 2020. Measur-ing compositional generalization: A comprehensivemethod on realistic data. In ICLR.

Najoung Kim and Tal Linzen. 2020. COGS: A com-positional generalization challenge based on seman-tic interpretation. In Proceedings of the 2020 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 9087–9105, Online. As-sociation for Computational Linguistics.

Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and LukeZettlemoyer. 2013. Scaling semantic parsers withon-the-fly ontology matching. In Proceedings of the2013 conference on empirical methods in naturallanguage processing, pages 1545–1556.

Tom Kwiatkowski, Luke Zettlemoyer, Sharon Gold-water, and Mark Steedman. 2010. Inducing proba-bilistic ccg grammars from logical form with higher-order unification. In Proceedings of the 2010 con-ference on empirical methods in natural languageprocessing, pages 1223–1233. Association for Com-putational Linguistics.

B. M. Lake, T. Linzen, and M. Baroni. 2019. Humanfew-shot learning of compositional instructions. InProceedings of the 41st Annual Conference of theCognitive Science Society.

Brenden Lake and Marco Baroni. 2018. Generalizationwithout systematicity: On the compositional skillsof sequence-to-sequence recurrent networks. In In-ternational Conference on Machine Learning, pages2873–2882.

Brenden M Lake. 2019. Compositional generalizationthrough meta sequence-to-sequence learning. In Ad-vances in Neural Information Processing Systems,pages 9788–9798.

Brenden M Lake, Tomer D Ullman, Joshua B Tenen-baum, and Samuel J Gershman. 2017. Building ma-chines that learn and think like people. Behavioraland brain sciences, 40.

Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end neural coreference reso-lution. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing,pages 188–197.

Junhui Li, Muhua Zhu, Wei Lu, and Guodong Zhou.2015. Improving semantic parsing with enrichedsynchronous context-free grammar. In Proceed-ings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing, pages 1455–1465, Lisbon, Portugal. Association for Computa-tional Linguistics.

Yuanpeng Li, Liang Zhao, Jianyu Wang, and Joel Hes-tness. 2019. Compositional generalization for prim-itive substitutions. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural Language

https://arxiv.org/pdf/1912.09713.pdf



https://doi.org/10.18653/v1/2020.emnlp-main.731



https://doi.org/10.18653/v1/D15-1170

https://doi.org/10.18653/v1/D15-1170

https://doi.org/10.18653/v1/D19-1438

https://doi.org/10.18653/v1/D19-1438

933

Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 4284–4293, Hong Kong, China. As-sociation for Computational Linguistics.

Qian Liu, Shengnan An, Jian-Guang Lou, Bei Chen,Zeqi Lin, Yan Gao, Bin Zhou, Nanning Zheng, andDongmei Zhang. 2020. Compositional generaliza-tion by learning analytical expressions. Advances inNeural Information Processing Systems, 33.

Richard Montague. 1970. Universal grammar. Theo-ria, 36(3):373–398.

Benjamin Newman, John Hewitt, Percy Liang, andChristopher D. Manning. 2020. The EOS decisionand length extrapolation. In Proceedings of theThird BlackboxNLP Workshop on Analyzing and In-terpreting Neural Networks for NLP, pages 276–291,Online. Association for Computational Linguistics.

Maxwell I Nye, Armando Solar-Lezama, Joshua BTenenbaum, and Brenden M Lake. 2020. Learn-ing compositional rules via neural program synthe-sis. arXiv preprint arXiv:2003.05562.

Inbar Oren, Jonathan Herzig, Nitish Gupta, Matt Gard-ner, and Jonathan Berant. 2020. Improving com-positional generalization in semantic parsing. InProceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing: Findings,pages 2482–2495.

Hoifung Poon, Colin Cherry, and Kristina Toutanova.2009. Unsupervised morphological segmentationwith log-linear models. In Proceedings of HumanLanguage Technologies: The 2009 Annual Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics, pages 209–217,Boulder, Colorado. Association for ComputationalLinguistics.

Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J Liu. 2020. Exploring the lim-its of transfer learning with a unified text-to-texttransformer. Journal of Machine Learning Research,21:1–67.

Sujith Ravi and Kevin Knight. 2009. Minimized mod-els for unsupervised part-of-speech tagging. In Pro-ceedings of the Joint Conference of the 47th AnnualMeeting of the ACL and the 4th International JointConference on Natural Language Processing of theAFNLP, pages 504–512, Suntec, Singapore. Associ-ation for Computational Linguistics.

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin,and Sameer Singh. 2020. Beyond accuracy: Be-havioral testing of NLP models with CheckList. InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 4902–4912, Online. Association for Computational Lin-guistics.

Jorma Rissanen. 1978. Modeling by shortest data de-scription. Automatica, 14(5):465–471.

Jake Russin, Jason Jo, Randall C O’Reilly, and YoshuaBengio. 2019. Compositional generalization in adeep seq2seq model by separating syntax and seman-tics. arXiv preprint arXiv:1904.09708.

Markus Saers, Karteek Addanki, and Dekai Wu. 2013.Unsupervised transduction grammar induction viaminimum description length. In Proceedings of theSecond Workshop on Hybrid Approaches to Transla-tion, pages 67–73, Sofia, Bulgaria. Association forComputational Linguistics.

David A Smith and Jason Eisner. 2006. Quasi-synchronous grammars: Alignment by soft projec-tion of syntactic dependencies. In Proceedings onthe Workshop on Statistical Machine Translation,pages 23–30.

Mitchell Stern, Jacob Andreas, and Dan Klein. 2017. Aminimal span-based neural constituency parser. InProceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 818–827, Vancouver, Canada.Association for Computational Linguistics.

Alane Suhr, Ming-Wei Chang, Peter Shaw, and Ken-ton Lee. 2020. Exploring unexplored generalizationchallenges for cross-database semantic parsing. InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 8372–8388.

Lappoon R Tang and Raymond J Mooney. 2001. Us-ing multiple clause constructors in inductive logicprogramming for semantic parsing. In EuropeanConference on Machine Learning, pages 466–477.Springer.

Iulia Turc, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. 2019. Well-read students learn better:On the importance of pre-training compact models.arXiv preprint arXiv:1908.08962.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Yuk Wah Wong and Raymond Mooney. 2007. Learn-ing synchronous grammars for semantic parsingwith lambda calculus. In Proceedings of the 45thAnnual Meeting of the Association of ComputationalLinguistics, pages 960–967.

Yuk Wah Wong and Raymond J Mooney. 2006. Learn-ing for semantic parsing with statistical machinetranslation. In Proceedings of the main conferenceon Human Language Technology Conference of theNorth American Chapter of the Association of Com-putational Linguistics, pages 439–446. Associationfor Computational Linguistics.

https://doi.org/10.18653/v1/2020.blackboxnlp-1.26

https://doi.org/10.18653/v1/2020.blackboxnlp-1.26

https://www.aclweb.org/anthology/N09-1024

https://www.aclweb.org/anthology/N09-1024





https://www.aclweb.org/anthology/W13-2810

https://www.aclweb.org/anthology/W13-2810

http://aclweb.org/anthology/P17-1076

http://aclweb.org/anthology/P17-1076

934

Daniel H Younger. 1967. Recognition and parsing ofcontext-free languages in time n3. Information andcontrol, 10(2):189–208.

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,Dongxu Wang, Zifan Li, James Ma, Irene Li,Qingning Yao, Shanelle Roman, Zilin Zhang,and Dragomir Radev. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, pages3911–3921, Brussels, Belgium. Association forComputational Linguistics.

John M Zelle and Raymond J Mooney. 1996. Learn-ing to parse database queries using inductive logicprogramming. In Proceedings of the thirteenth na-tional conference on Artificial intelligence-Volume 2,pages 1050–1055.

Luke Zettlemoyer and Michael Collins. 2007. Onlinelearning of relaxed ccg grammars for parsing to log-ical form. In Proceedings of the 2007 Joint Con-ference on Empirical Methods in Natural LanguageProcessing and Computational Natural LanguageLearning (EMNLP-CoNLL), pages 678–687.

Luke S Zettlemoyer and Michael Collins. 2005. Learn-ing to map sentences to logical form: structured clas-sification with probabilistic categorial grammars. InProceedings of the Twenty-First Conference on Un-certainty in Artificial Intelligence, pages 658–666.AUAI Press.

Hao Zheng and Mirella Lapata. 2020. Compositionalgeneralization via semantic tagging. arXiv preprintarXiv:2010.11818.

https://doi.org/10.18653/v1/D18-1425

https://doi.org/10.18653/v1/D18-1425

https://doi.org/10.18653/v1/D18-1425

935

AppendixWe organize the appendix into two sections:

• Additional details for NQG in Appendix A.

• Additional experimental details and analysisin Appendix B.

A NQG Details

In this section we describe the NQG grammarinduction algorithm and parsing model in detail,starting with relevant background and notation forQCFGs.

A.1 Background: SCFGs and QCFGsSynchronous Context-Free Grammars (SCFGs)have been used to model the hierarchical mappingbetween pairs of strings in areas such as compilertheory (Aho and Ullman, 1972) and natural lan-guage processing.

Informally, SCFGs can be viewed as an exten-sion of Context-Free Grammars (CFGs) that syn-chronously generate strings in both a source andtarget language. We write SCFG rules as:

S → 〈α, β〉

Where S is a non-terminal symbol, and α and βare strings of non-terminal and terminal symbols.An SCFG rule can be viewed as two CFG rules,S → α and S → β with a pairing between theoccurrences of non-terminal symbols in α and β.This pairing is indicated by assigning each non-terminal in α and β an index ∈ N. Non-terminalssharing the same index are called linked. Followingconvention, we denote the index for a non-terminalusing a boxed subscript, e.g. NT[1]. A completeSCFG derivation is a pair of parse trees, one for thesource language and one for the target language.An example derivation is shown in Figure 4.

The ⇒r operator refers to a derives relation,such that 〈α1, β1〉 ⇒r 〈α2, β2〉 states that thestring pair 〈α2, β2〉 can be generated from 〈α1, β1〉by applying the rule r. We write ⇒ to leave therule unspecified, assuming the set of possible rulesis clear from context. We write⇒⇒ to indicate achain of 2 rule applications, omitting the interme-diate string pair. Finally, we write ∗⇒ to denote thereflexive transitive closure of⇒.

Quasi-Synchronous Context-Free Grammars(QCFGs) QCFGs generalize SCFGs in variousways, notably relaxing the restriction on a strict

one-to-one alignment between source and targetnon-terminals (Smith and Eisner, 2006).

Compositionality Notably, grammar for-malisms such SCFGs and QCFGs capture theformal notion of the principle of compositionalityas a homomorphism between source and targetstructures (Montague, 1970; Janssen and Partee,1997).

A.2 NQG Grammar Induction DetailsHaving defined the codelength scoring functionthat we use to compare grammars in section 4.1.1,we describe our greedy search algorithm that findsa grammar that approximately minimizes this ob-jective.14

Initialization We initialize R to be {NT →〈x,y〉 | x,y ∈ D}. We also add identity rulesfor substrings that exactly match between sourceand target examples, e.g. NT → 〈k, k〉 where k isa substring of both x and y for some x,y ∈ D.15

Optimization Algorithm Our algorithm was de-signed with simplicity in mind, and therefore usesa simple greedy search process that could likelybe significantly improved upon by future work.At a high level, our greedy algorithm iterativelyidentifies a rule to be added to R that decreasesthe codelength by enabling ≥ 1 rules in R to beremoved while maintaining the invariant that Gallows for deriving all of the training examples,i.e. 〈NT,NT 〉 ∗⇒ 〈x,y〉 for every x,y ∈ D. Thesearch completes when no rule that decreases L(R)can be identified.

To describe the implementation, first let us defineseveral operations over rules and sets of rules. Wedefine the set of rules that can be derived from agiven set of rules,R:

d(R) = {NT → 〈α, β〉 | 〈NT,NT 〉 ∗⇒ 〈α, β〉}

We define an operation SPLIT that generatespossible choices for splitting a rule into 2 rules:

SPLIT(NT → 〈α, β〉) = {g, h |〈NT,NT 〉 ⇒g⇒h 〈α, β〉 ∨〈NT,NT 〉 ⇒h⇒g 〈α, β〉},

14The induction objective contains hyperparameters repre-senting the bitlength of terminal and non-terminal symbols.For all experiments we use lN = 1. For GEOQUERY andSPIDER we use lT = 8, and use lT = 32 for SCAN.

15These initialization rules are used for GEOQUERY andSPIDER, but SCAN does not contain any exact token overlapbetween source and target languages.

936

NT → 〈how many NT[1] pass through NT[2], answer ( count ( intersection ( NT[1] , loc_1 ( NT[2] ) ) ) )〉NT → 〈rivers, river〉NT → 〈the largest NT[1], largest ( NT[1] )〉NT → 〈state, state〉

NT

NT[2]

NT[1]

state

largestthe

throughpassNT[1]

rivers

manyhow

NT

))))NT[2]

)NT[1]

state

(largest

(loc_1,NT[1]

river

(intersection(count(answer

Figure 4: An example QCFG derivation. Each non-terminal in the source derivation (blue) corresponds to a non-terminal in the target derivation (green). The QCFG rules used in the derivation are shown above.

where g and h is a pair of new rules that wouldmaintain the invariant that 〈NT,NT 〉 ∗⇒ 〈x,y〉for every x,y ∈ D, even if the provided rule iseliminated.16

SPLIT can be implemented by consider-ing pairs of sub-strings in α and β to re-place with a new indexed non-terminal sym-bol. For example, the rule “NT →〈largest state, largest ( state )〉” can be split into therules “NT → 〈largest NT[1], largest ( NT[1] )〉”and “NT → 〈state, state〉”. This step can requirere-indexing of non-terminals.

During our greedy search, we only split ruleswhen one of the two resulting rules can alreadybe derived given R. Therefore, we define a func-tion NEW that returns a set of candidate rules toconsider:

NEW(R) =

{g | g, h ∈ SPLIT(f) ∧ f ∈ R ∧ h ∈ d(R)}

Similarly, we can compute the set of rules thatare made redundant and can be eliminated by intro-ducing one these candidate rules, f :

ELIM(R, f) =

{h | f, g ∈ SPLIT(h) ∧ g ∈ d(R) ∧ h ∈ R}16We optionally allow SPLIT to introduce repeated target

non-terminals when the target string has repeated substrings.Otherwise, we do not allow SPLIT to replace a repeatedsubstring with a non-terminal, as this can lead to an ambiguouschoice. We enable this option for SCAN and SPIDER but notfor GEOQUERY, as FunQL does not require such repetitions.

We can then define the codelength reduction ofadding a particular rule, −∆L(R, f) = L(R) −L(R′) whereR′ = (R ∪ f) \ELIM(R, f).17 Fi-nally, we can select the rule with the largest −∆L:

MAX(R) = argmaxf∈NEW(R)

−∆L(R, f)

Conceptually, after initialization, the algorithmthen proceeds as:

while |NEW(R)| > 0 dor ← MAX(R)if −∆L(R, r) < 0 then

breakend ifR ← (R ∪ r) \ ELIM(R, r)

end while

For efficiency, we select the shortest N exam-ples from the training dataset, and only considerthese during the induction procedure. Avoidinglonger examples is helpful as the number of can-didates returned by SPLIT is polynomial with re-spect to source and target length. Once induction

17The last term of the codelength objective described insection 4.1.1 is related to the increase in the proportion ofincorrect derivations due to introducing f . Rather than com-puting this exactly, we estimate this quantity by sampling upto k examples from D that contain all of the sub-strings ofsource terminal symbols in f such that f could be used in aderivation, and estimating the increase in incorrect derivationsover this sample only. We sample k = 10 examples for allexperiments.

937

has completed, we then determine which of thelonger examples cannot be derived based on theset of induced rules, and add rules for these exam-ples.18

Our algorithm maintains a significant amount ofstate between iterations to cache computations thatare not affected by particular rule changes, basedon overlap in terminal symbols. We developed thealgorithm and selected some hyperparameters byassessing the size of the induced grammars overthe training sets of SCAN and GEOQUERY.

Our grammar induction algorithm is similar tothe transduction grammar induction method formachine translation by Saers et al. (2013). Morebroadly, compression-based criteria have been suc-cessfully used by a variety of models for language(Grünwald, 1995; Tang and Mooney, 2001; Raviand Knight, 2009; Poon et al., 2009).

A.3 NQG Parsing Model DetailsIn this section we provide details on how we gener-ate derivation scores, s(z,x), using a neural model,as introduced in § 4.1. The derivation scores de-compose over anchored rules from our grammar:

s(z,x) =∑

(r,i,j)∈z

φ(r, i, j,x),

where r is an index for a rule in G and i and j areindices defining the anchoring in x. The anchoredrule scores, φ(r, i, j,x), are based on contextual-ized representations from a BERT (Devlin et al.,2019) encoder:

φ(r, i, j,x) = fs([wi, wj ]) + eᵀrfr([wi, wj ]),

where [wi, wj ] is the concatenation of the BERTrepresentations for the first and last wordpiece inthe anchored span, fr is a feed-forward networkwith hidden size d that outputs a vector ∈ Rd, fs isa feed-forward network with hidden size d that out-puts a scalar, and er is an embedding ∈ Rd for therule index r. Our formulation for encoding spansis similar to that used in other neural span-factoredmodels (Stern et al., 2017; Lee et al., 2017).

B Experimental Details

B.1 Model Hyperparameters and RuntimeWe selected reasonable hyperparameter values andperformed some minimal hyperparameter tuning

18We use N = 500 for SCAN and N = 1000 for SPI-DER. As the GEOQUERY training set contains < 500 uniqueexamples, we use the entire training set.

for T5 and NQG based on random splits of thetraining sets for GEOQUERY and SPIDER. Weused the same hyperparameters for all splits of agiven dataset.

For T5, we selected a learning rate of 1e−4 from[1e−3, 1e−4, 1e−5], which we used for all experi-ments. Otherwise, we used the default hyperparam-eters for fine-tuning. We fine-tune for 3, 000 stepsfor GEOQUERY and 10, 000 for SPIDER. T5-Basetrained with a learning rate of 1e−4 reached 94.2%accuracy at 3, 000 steps on a random split of thestandard GeoQuery training set into 500 trainingand 100 validation examples.

For the NQG neural model, we use the pre-trained BERT Tiny model of Turc et al. (2019)(4.4M parameters) for SCAN and SPIDER, andBERT Base (110.1M parameters) for GEOQUERY,where there is more headroom for improved scor-ing. We do not freeze pre-trained BERT parame-ters during training. For all experiments, we used = 256 dimensions for computing anchored rulescores. We fine-tune for 256 steps and use a learn-ing rate of 1e−4. We use a batch size of 256.

We train NQG on 8 V100 GPUs. Training NQGtakes < 5 minutes for SCAN and SPIDER (BERTTiny), and up to 90 minutes for GEOQUERY (BERTBase). We fine-tune T5 on 32 Cloud TPU v3cores.19 For GEOQUERY, fine-tuning T5 takesapproximately 5 and 37 hours for Base and 3B,respectively. For SPIDER, fine-tuning T5 takesapproximately 5 and 77 hours for Base and 3B,respectively.

B.2 Dataset Preprocessing

For GEOQUERY, we use the version of thedataset with variable-free FunQL logicalforms (Kate et al., 2005), and expand certainfunctions based on their logical definitions,such that state(next_to_1(state(all)))

becomes the more conventionalintersection(state, next_to_1(state)).We replace entity mentions with placeholders (e.g.“m0”, “m1”) in both the source and target.

For SPIDER, we prepend the name of the targetdatabase to the source sequence. For T5, we also se-rialize the database schema as a string and appendit to the source sequence similarly to Suhr et al.(2020). This schema string contains the names ofall tables in the database, and the names of thecolumns for each table. As we use a maximum

19https://cloud.google.com/tpu/

938

Source: how many states are next to major riversTarget: answer ( count ( intersection ( state , next_to_2 ( intersection ( major ,

river ) ) ) ) )

Prediction: answer ( count ( intersection ( state , next_to_2 ( intersection ( major ,

intersection ( river , m0 ) ) ) ) ) )

Notes: The trigram “major , intersection” occurs 28 times during training, but “major , river”occurs 0 times. In this case, T5 also hallucinates “m0” despite no entity placeholder occuring the source.

Source: which state has the highest peak in the countryTarget: answer ( intersection ( state , loc_1 ( highest ( place ) ) ) )

Prediction: answer ( highest ( intersection ( state , loc_2 ( highest ( intersection (

mountain , loc_2 ( m0 ) ) ) ) ) )

Notes: The token “highest” occurs after “answer (” in 83% of instances in which “highest” occurs inthe training set. Note that T5 also hallucinates “m0” in this case.

Table 7: Example prediction errors for T5-Base for the GEOQUERY TMCD split.

Dataset Examples Induced Rules Ratio

SCAN 16727 21 796.5GEOQUERY 600 234 2.6SPIDER-SSP 3282 4155 0.79

Table 8: Sizes of induced grammars.

Std. Templ. Len. TMCD

NQG-T5-3B Acc. 0.6 1.2 1.2 0.4NQG-T5-Base Acc. 0.5 1.4 1.1 0.4NQG Acc. 1.2 4.5 1.5 0.4

NQG Coverage 0.7 3.4 1.8 0.1NQG Precision 0.7 1.9 1.7 1.2

Table 9: Standard deviation of NQG for GEOQUERY.

source sequence length of 512 for T5, this leadsto some schema strings being truncated (affectingabout 5% of training examples).

SCAN did not require any dataset-specific pre-processing.

B.3 Atom and Compound Definitions

For GEOQUERY, the tree structure of FunQL isgiven by explicit bracketing. We define atomsas individual FunQL symbols, and compounds ascombinations between parent and child symbolsin the FunQL tree. Example atoms are longest,river, and exclude and example compounds arelongest(river) and exclude(longest(_), _).

For SPIDER, we tokenize the SQL string anddefine atoms as individual tokens. To define com-pounds, we parse the SQL string using an unam-biguous CFG, and define compounds from the re-sulting parse tree. We define compounds over bothfirst and second order edges in the resulting parsetree.

B.4 Grammar SizesInduced grammar sizes for a selected split of eachdataset are shown in Table 8. For SPIDER, thenumber of induced rules is larger than the origi-nal dataset due to the identity rules added duringinitialization.

B.5 GEOQUERY VarianceIn tables 2 and 5 we report the mean of 3 runs forNQG for GEOQUERY. The standard deviations forthese runs are reported in Table 9. The reportedstandard deviations for NQG-T5 use the same fine-tuned T5 checkpoint, so they do not reflect anyadditional variance from different fine-tuned T5checkpoints.

B.6 T5 GEOQUERY ErrorsWe include several example T5-Base errors on theGEOQUERY TMCD split in Table 7.

compositional generalization and natural language

Documents