frugal paradigm completion

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8248–8273July 5 - 10, 2020. c©2020 Association for Computational Linguistics

8248

Frugal Paradigm Completion

Alexander Erdmann,† Tom Kenter, Markus Becker and Christian Schallhart ∗Google, UK †Ohio State University, USA

[email protected], {tomkenter,mabecker,schallhart}@google.com

Abstract

Lexica distinguishing all morphologically re-lated forms of each lexeme are crucial to manylanguage technologies, yet building them isexpensive. We propose Frugal ParadigmCompletion, an approach that predicts all re-lated forms in a morphological paradigm fromas few manually provided forms as possi-ble. It induces typological information dur-ing training which it uses to determine thebest sources at test time. We evaluate ourlanguage-agnostic approach on 7 diverse lan-guages. Compared to popular alternative ap-proaches, our Frugal Paradigm Completionapproach reduces manual labor by 16-63% andis the most robust to typological variation.

1 Introduction

From syntactic parsing (Seeker and Kuhn, 2013)to text-to-speech (Zen et al., 2016; Wan et al.,2019), many linguistic technologies rely on accu-rate lexica decorated with morphological informa-tion. Yet, building such lexica requires much hu-man effort (Buckwalter, 2002; Tadic and Fulgosi,2003; Forsberg et al., 2006; Sagot, 2010; Eskan-der et al., 2013). We present a language-agnosticmethod for minimizing the manual labor requiredto add new paradigms to an existing lexicon.

Formally, let each lexicon entry, or realization,be a triple (P, C, f ). P marks membership in someparadigm P of morphologically related words, C

defines a cell in P as a bundle of morphosyn-tactic features, and f is the form realizing C inP. Hence, paradigm SING can be expressed (inthe UniMorph schema (Kirov et al., 2018)) as aset of realizations: {(SING, NFIN, sing), (SING,3.SG.PRES, sings), . . . }.

For each paradigm to be added to the lexicon,e.g., FLY, we aim to select as few sources as pos-

∗This work was carried out during the first author’s intern-ship at Google UK in 2019.

sible to be manually realized, e.g., {(FLY, NFIN,fly), (FLY, PST, flew)} such that the forms realiz-ing the remaining cells can be predicted, i.e., flies,flying, flown. Here, sources are manually providedrealizations. Targets are realizations whose formsmust be predicted from sources. Our work dif-fers from traditional paradigm completion (Durrettand DeNero, 2013) in that sources are not givenblindly, but the system must strategically selectwhich sources it wants to be given at test time.

Paradigm completion from one source is typ-ically non-deterministic due to multiple inflec-tion classes realizing different exponents in somecells, e.g., suffixing +ed generates the past tensefor WALK, but not for SING or FLY whichare members of different classes. Hence, manyworks discuss paradigm completion in the con-text of (implicit) inflection class disambiguation(Ackerman et al., 2009; Montermini and Bonami,2013; Beniamine et al., 2018). Finkel and Stump(2007) propose three approaches to select thefewest sources required to deterministically iden-tify class. Yet, neural sequence models canoften complete paradigms accurately from lesssources without fully disambiguating inflectionclass (Kann and Schütze, 2016; Aharoni and Gold-berg, 2017; Wu and Cotterell, 2019). See Elsneret al. (2019) for an overview of the application ofneural sequence models to morphological theory.

We propose Frugal Paradigm Completion(FPC), inspired by work on inflection class dis-ambiguation and neural sequence modeling. Wetrain a source selection agent (SSA) to induce ty-pological knowledge regarding the distribution ofcomplexity in paradigms and use this to requestinformative source cells to be realized by an or-acle. Sources are fed to a predictor to generatetarget forms. For each paradigm, SSA iterativelyrequests sources until the oracle confirms all cellshave been realized correctly.

8249

We introduce a novel metric, auto-rate, to quan-tify the manual labour (performed by the oracle)needed to complete each paradigm. Using thismetric, we demonstrate that FPC reduces labor by63% over predicting targets from lemmata, and47% over predicting them from the smallest set ofsources that fully disambiguates inflection class.We propose a new typology for discussing theorganization of complexity in paradigms whichhelps explain why strategies perform better orworse on certain languages while FPC, being sen-sitive to typological variation, performs robustly.

After discussing related paradigm completionapproaches in Section 2, we describe FPC in Sec-tion 3. Section 4 covers all data and experimentalset up details. We discuss results in Section 5 andanalyze FPC’s behavior in Section 6.

2 Paradigm Completion Approaches

Here we discuss several paradigm completion ap-proaches related to FPC.

Lemma-based Paradigm Completion Thestandard paradigm completion approach does notselect sources, but assumes one source: the lemma(Dreyer and Eisner, 2011), whose distinction isultimately arbitrary. Yet many have shown thatmore informative sources can be chosen (Finkeland Stump, 2007; Cotterell et al., 2017b; Kannand Schütze, 2018).

Most Informative Source For each target formto be predicted, Kann and Schütze (2018) selectthe source most likely to predict that form. UnlikeFPC, they do not attempt to minimize the numberof unique sources that must be manually realized.

Static Principal Parts To minimize sourcesrequired to fully disambiguate inflection class,Finkel and Stump (2007); Stump and Finkel(2013) propose three approaches: static, dynamic,and adaptive. In the static approach, the samesources must be used for every paradigm (thesesources are referred to as principal parts in a mucholder pedagogical tradition dating back to ancientRome with Varro’s de lingua latina (Grinstead,1916; Ahern, 1990)). Cotterell et al. (2017b)train a model on static sources and attain near100% accuracy in Latin verb paradigm comple-tion. However, they do not consider that oneparadigm may require fewer sources than another,nor that paradigm completion may require fewersources than inflection class disambiguation.

Dynamic Principal Parts Finkel and Stump(2007)’s dynamic approach selects a minimal setof sources necessary to fully disambiguate inflec-tion class which can be unique to that inflectionclass. While efficient, this is impractical in that itrequires oracular knowledge of class prior to see-ing any forms.

Adaptive Principal Parts Finkel and Stump(2007)’s adaptive approach, like our FPC method,chooses the same first source cell for eachparadigm P. Subsequent sources are selected con-ditional on the set of inflection classes P could be-long to given the sources realized so far. Hence,the number of sources required per paradigm isupper bounded by the static approach and lowerbounded by the dynamic.

Our FPC approach is a neural update, inspiredby their adaptive approach. While their implemen-tation tracks viable inflection classes explicitlywith rules operating on oracularly segmented af-fixes, we use sequence models operating on wholewords to remove reliance on oracular segmenta-tion and leverage stem-internal phonology knownto correlate with inflection class (Aronoff, 1992;Dressler and Thornton, 1996; Dawdy-Hesterbergand Pierrehumbert, 2014).

3 Frugal Paradigm Completion

This section describes the interactions of the threeFPC components. As illustrated in Figure 1, thepredictor takes a source cell and its realizing formas input, e.g., 3.SG.PRES: sings, or cell 2: form2 in the figure. The predictor is composed ofas many sub-predictors as there are cells in theparadigm, each of which is trained to predict theentire paradigm from one source cell’s realization.Cell 2 in the paradigm is grayed out in the figure,as this was provided as input so it does not haveto be predicted. The predicted paradigm is evalu-ated by the oracle. If there are no errors, we aredone. Otherwise, based on previous sources, SSAchooses a new cell to be realized by the oracle andgives it to the predictor as the next source. Becausecell 3 is chosen in the figure, sub-predictor 3 willbe used to predict the paradigm going forward, andcells 2 and 3 will both be grayed out. The pro-cess continues like this until all cells have beencorrectly predicted by at least one sub-predictor.

Crucially, during inference, each test paradigmis empty, i.e., no realization has been seen duringtraining and no source is available to inflect from

8250

✓sub-predictor 1

sub-predictor n

...

sub-predictor 2

predictor

cell 1: form 1

cell n: form n

cell 2: form 2

paradigm

... oracle

done

error

source selection agent(SSA)cell3: form 3

cell 2: form 2input

cell 3oracle

Figure 1: Schematic representation of the flow of Frugal Paradigm Completion at inference time.

a-priori. Our setup aims to minimize the numberof sources which the SSA must request from theoracle (typically a human in the loop at inferencetime) to predict the remaining paradigm slots cor-rectly.

3.1 PredictorThe predictor outputs a target form given its cell,a source form and the source form’s cell as in-put. To train the predictor, for each possible sourcecell, we train a sub-predictor to predict every pos-sible target form in every paradigm in the trainingdata given the realization of that source cell in thatparadigm. Details of all sequence model architec-tures are provided in Section 4.

3.2 Source Selection AgentSSA’s choice of a cell for a given paradigm de-pends on all previously selected cells for thatparadigm and their corresponding forms. This al-lows SSA to learn, e.g., that given a previous En-glish PST source, PST.PTCP should only be re-quested as a subsequent source if the PST formdid not take the regular -ed suffix. Otherwise,PST.PTCP is likely to be regular and unlikely tocontribute new information.

To induce such knowledge, we train SSA on anoracle policy of ideal source selections extractedfrom the train set (Ross et al., 2011; Ross and Bag-nell, 2014; Welleck et al., 2019).1 To extract theoracle policy, we divide the training lexicon intotwo folds and train one predictor on each, allowingus to cross-validate each predictor on its held outfold. For each training paradigm, we test whichtarget forms can be correctly predicted by whichsource cells’ sub-predictors. As shown for SING

1While we borrow the term oracle policy from ImitationLearning (Ross et al., 2011; Ross and Bagnell, 2014; Wellecket al., 2019), we mimic the oracle policy with simple se-quence learning. Our analysis suggests even this may bemore machinery than necessary.

in Figure 2, we use this information to extract min-imum set covers, i.e., the fewest source cells suchthat the union of the subsets they predict correctlycovers the entire paradigm. These covers consti-tute the oracle policy used to train SSA.

The minimum set cover problem is NP-complete (Lund and Yannakakis, 1994; Kuhnet al., 2005), but we approximate it in O(loge|P|)by iteratively selecting the cell whose subset mostenlarges the union. We break ties by averagingpredictiveness (Equation 1) over both folds, wherefold F contains |F | paradigms; Pm, |Pm| cells;and Acc(Pm, Ctrg, Csrc) returns 1 if using Csrc’s re-alization as a source correctly predicts the formrealizing cell Ctrg in paradigm Pm.

predictiveness(Csrc, F ) =∑|F |m=1

∑|Pm|j=1 Acc(Pm, Cj, Csrc)∑|F |

m=1 |Pm|(1)

At this stage, paradigm covers are dynamic inthat no single cell need be shared by all covers.Yet, when selecting the first source, SSA has noprevious sources to condition on, making it impos-sible to predict the first cell. Thus, we get adap-tive minimum set covers by designating the startcell to be that which occurs in the most dynamiccovers. Then we re-approximate all covers suchthat each includes this cell.2 Finally, we rank cellswithin each cover by the total number of coversin which they appear. For each cell in each cover,we train SSA to predict said cell from all higherranked cells and their realizing forms (holding out2% of them for development).

2We train and test on a single part-of-speech for each lan-guage, so each paradigm should contain the start cell. For de-fective paradigms lacking said cell, we back off to the mostfrequent cell that exists in the paradigm.

8251

NFIN=sing3.SG.PRES=singsPRES.PTCP=singing



PST=sangPST.PTCP=sung PST.PTCP=sung

PST.PTCP=sungPST=sangPRES.PTCP=singing3.SG.PRES=singsNFIN=sing

Targets predicted correctly

Given

Figure 2: Minimum set cover example for SING, which is {NFIN, PST}.

3.3 OracleThe oracle represents a human-in-the-loop dur-ing inference, providing requested source realiza-tions to the predictor and informing SSA whena paradigm is complete and accurate (Figure 1).In our implementation, the oracle does not spec-ify which individual predictions are incorrect, butit thus must resolve any discrepancies when twosub-predictors disagree after the fact. We do notattempt to model the additional cost this incurs,as it is unclear how to combine it with the pre-sumably more expensive cost of correcting errors,which we model instead. This is worth re-visitingin future work.

4 Experimental Details

We evaluate 4 paradigm completion approacheson 7 languages. Here we discuss implementation,data and evaluation details.

4.1 Prediction ArchitectureAll sequence models in all implementations ofany paradigm completion approach use the Trans-former architecture (Vaswani et al., 2017). Herewe describe the formatting of input and outputs aswell as our hyperparameters.

Input and Output Formats Following Kannand Schütze (2016), input sequences combinecharacters and morphosyntactic features. The fol-lowing is a sample input and output for a singlesource FPC sub-predictor specializing in the cellNFIN:

Input: f l y out_V.PTCP out_PST

Output: f l o w n

For any inflected-form-predicting sequencemodel whose input is not limited to realizations ofa single cell—as in, e.g., the static principal partsapproach—source cell features are prepended tothe input as such:

Input: in_NFIN f l y out_V.PTCP out_PST

Output: f l o w n

For multi-source sequence models, the featuresof each source are inserted into the input and thetarget features are listed after the first source. Weexperimented with several different multi-sourcerepresentations and the Transformer performedfairly similarly with all of them.

Input: in_NFIN f l y out_V.PTCP out_PST

in_PST f l e w

Output: f l o w n

The FPC’s SSA predicts not a form, but a cell,conditional on any previously realized sources. Topredict the first source, it is given nothing and willthus deterministically select the best starting cellas determined by the oracle policy (see Section3.2). To predict any subsequent source, it condi-tions on the realizations of all previously requestedsources for that paradigm. The following exempli-fies SSA inputs and outputs when predicting thesecond source for paradigm FLY:

Input: in_NFIN f l y

Output: in_V.PTCP in_PST

Wu et al. (2018) and others have achieved im-provements by embedding morphosyntactic fea-tures separately and concatenating them to the en-coder output prior to feeding it to the decoder.Our error analysis, however, suggests Transform-ers handle Kann and Schütze (2016)-style inputwell. More sophisticated feature handling may notbe necessary, but should be investigated in futurework.

Hyperparameters We train all Transformermodels for 100 epochs in batches of 64 with 0.1dropout probability. The final model is restoredfrom the epoch with the highest dev accuracy.We stop early if there is no improvement for 20

8252

Train Dev Test

Arabicnouns

paradigms 1006 100 100instances 24160 2260 2352

Germanverbs


Englishverbs


Russiannouns


Latinnouns


Hungariannouns


Irishnouns


Table 1: Number of paradigms and instances by splitfor every language and POS considered.

epochs. The only exception is during FPC cross-validation where sub-predictor models are trainedfor only 50 epochs with early stopping after 5epochs without improvement. This is just to re-duce computational cost as it is sufficient to inducean oracle policy. The final sub-predictor modelshowever (those used at inference time, not thoseused to induce the oracle policy), are trained onthe full training data set using the full 100 epochswith 20 epochs patience for early stopping. As forTransformer-specific hyperparameters, using theoriginal notation of Vaswani et al. (2017), we setN = 4, dmodel = 128, dff = 512, and h = 8, scal-ing down the hyperparameters recommended formachine translation as our task is less expensive(Aharoni and Goldberg, 2017; Wu et al., 2018).

4.2 Data Preparation

For every language and part of speech (POS) con-sidered, we extract train, dev and test sets fromUniMorph (Kirov et al., 2018). Each split containsfull paradigms, though the cells realized in eachmay vary due to defectiveness (Corbett, 2005;Sims, 2015). We filter many gold errors by re-moving paradigms for which no realization canbe attested in actual text. We use Universal De-pendencies (UD) (Nivre et al., 2016) to check forattestations. We also filter overabundant realiza-tions (multiple forms realizing one cell), keeping

only the most frequent form, as attested in UD.While some languages allow for overabundance(Thornton, 2010, 2011), in UniMorph, this oftenindicates a gold error.

We randomly divide paradigms into splits suchthat train is maximally large and dev and test con-tain at least 100 paradigms and 1,000 realizations.Exact quantities are displayed in Table 1. Ara-bic, German, English, and Russian were used fordevelopment, while Irish, Hungarian, and Latinwere only evaluated after fixing hyperparameters.The languages considered represent 3 familiesand 4 diverse Indo-European branches. They ex-hibit multiple non-canonical behaviors (Corbett,2005) and present diverse challenges from non-concatenative morphology to complex inflectionclass systems.

4.3 EvaluationParadigm completion is usually evaluated via ex-act match accuracy on held out target forms (Cot-terell et al., 2016, 2017a, 2018; McCarthy et al.,2019). Yet we use as many sources as are nec-essary to reach 100% accuracy in predicting theremaining slots, so accuracy is not a meaningfulmetric for the FPC. Some theoretical works focuson the sources required to unambiguously com-plete a paradigm given some implicit knowledgeof viable inflection classes (Finkel and Stump,2007; Ackerman and Malouf, 2013). Yet thesetend not to propose actual paradigm completionmodels or evaluate their decisions in ambiguouscases. To evaluate our system and bridge these tra-ditions, we propose auto-rate:

auto-rate =

∑ni=1 auto(Pi)∑n

i=1 |Pi|, (2)

where auto(P) denotes the number of realizationscorrectly predicted while not having been pro-vided as sources for paradigm P by the oracle.

Intuitively, auto-rate is like accuracy but itcounts oracularly provided sources as additionalerrors since both errors and sources require la-bor, i.e., sources require manual input and errors,post-correction. We also report manual cells perparadigm, i.e., sources plus errors. Of course,FPC resolves all errors eventually, but other sys-tems can make errors requiring post-correction.

4.4 BaselinesWe compare the FPC method to three baselines.One is a variant of FPC using a random SSA.

8253

This allows us to distinguish the benefit of a smartSSA from that of simply receiving additional feed-back from an oracle in the loop. Each time asource must be selected, random SSA chooses ran-domly without replacement. Its performance is av-eraged over two runs. The lemma approach base-line predicts all paradigm forms from one desig-nated source: the lemma. Finally, for the staticapproach baseline, we considered two static ap-proach implementations. The single-source im-plementation predicts each target from the sourcethat is, in theory, its best predictor (Kann andSchütze, 2018). The multi-source implementationconcatenates these sources, predicting each targetfrom the concatenated input. As results are nearlyidentical for either implementation, we report re-sults only for single-source—with the exception ofLatin, as explained presently.

For some languages, there is little theoreticalor pedagogical literature to help identify the bestsources for the static approach. Our single-sourcestatic approach for Arabic nouns predicts singularand dual forms from SG;NDEF;NOM and pluralsfrom PL;NDEF;NOM. In theory, any non-pluralplus any plural should be sufficient (Brustad et al.,2005; Habash, 2010). For German verbs, we pre-dict present and imperative forms from NFIN andpast forms from IND;PST;1SG (Grebe et al., 1966).We predict English present forms from NFIN; PST

and V.PTCP;PST predict themselves. For Russiannouns, Zaliznyak (1980) argues for five sources,yet Parker (2016) demonstrates that three are usu-ally sufficient. We follow the latter, predicting allnominative or accusative forms from ACC;SG, allother singulars from INS;SG, and all other plu-rals from GEN;PL. In preliminary experiments,we found this to match the accuracy of the fivesource approach, thus achieving a higher auto-rate. For Latin, we could not evaluate a single-source static implementation as it is unclear whichsource cell best predicts each target. The multi-source static approach for Latin nouns predicts allforms from NOM;SG and GEN;SG (following theclassical grammatical analyses of Varro, Priscianand the Roman ars grammatica). For Irish andHungarian, we do not evaluate a static approachas we lack the requisite linguistic knowledge todetermine the best sources.

Accuracy Auto-rate McppDev Test Dev Test Dev Test

Arabic nouns

Lemma 62.0 58.8 59.3 56.5 9.6 10.7Static 95.9 99.4 89.5 93.1 2.9 2.1

Random Ag. 90.2 90.9 2.2 2.2FPC 90.2 93.6 2.2 1.5

German verbs

Lemma 87.6 89.0 84.1 85.8 4.3 4.0Static 94.1 96.4 86.7 88.9 3.6 3.0

Random Ag. 90.0 92.1 2.4 1.9FPC 91.8 92.5 2.0 1.8

English verbs

Lemma 96.5 94.0 76.7 74.2 1.2 1.3Static 99.7 98.4 39.7 38.4 3.0 3.0

Random Ag. 76.0 73.3 1.2 1.4FPC 77.3 74.3 1.1 1.3

Russian nouns

Lemma 97.1 95.6 88.3 87.5 1.3 1.5Static 98.4 98.3 72.6 72.3 3.2 3.2

Random Ag. 86.1 84.3 1.6 1.8FPC 88.5 89.1 1.3 1.2

Latin nouns

Lemma 65.5 51.6 63.6 49.6 5.1 6.7Static 97.7 96.8 80.8 79.7 2.3 2.4

Random Ag. 85.9 84.7 1.7 1.8FPC 89.0 87.8 1.3 1.4

Hungarian nouns

Lemma 95.6 90.9 92.8 88.0 2.5 4.1Random Ag. 95.0 94.6 1.7 1.9

FPC 95.5 95.2 1.5 1.6Irish nouns

Lemma 63.5 66.9 56.1 59.6 5.4 5.0Random Ag. 64.9 68.2 4.2 3.8

FPC 72.1 69.6 3.3 3.6

Table 2: Evaluation of paradigm completion ap-proaches with metrics defined in Section 4. We donot report accuracy for FPC or its random agent variant(Random Ag.), as it is trivially 100% (see Section 4.3).Mcpp stands for Manual cells per paradigm.

5 Results and Discussion

As shown in Table 2, FPC always ties or beats thenext best approach, while the next best approachvaries by language. On average, FPC reduces la-bor by 63% over the lemma approach, 47% overstatic, 16% over random agent, and 13% over thenext best approach. Its success is mainly due to (1)making predictions from fewer sources than arerequired for fully disambiguating inflection classand (2) receiving feedback after each source.

Surprisingly, training a sophisticated SSA doesnot improve much over using a random agent. We

8254

argue this is due to an unexpectedly large mar-gin of error in the agent’s source selection task.Despite the complexity of source selection strate-gies required for inflection class disambiguation,FPC uses lexical frequencies to expect regular-ity and stem-internal clues to anticipate irregularclasses, requiring a median of just one source perparadigm for all languages except under-resourcedIrish. Furthermore, inspection of the source selec-tion minimum set covers reveals that it is often thecase that a paradigm can be completed correctlyfrom any single source. This is surprising in lightof the precise strategies required for completelydeterministic paradigm completion in Finkel andStump (2007)’s framework and in light of Albright(2002)’s case for the privileged status of a singleform per paradigm, though in our framework withfull words and full paradigms for training, it seemsthat many sources can often serve as good enoughsingleton principal parts. This supports Bonamiand Beniamine (2016) proposal of gradient princi-pal part analyses.

6 Analysis

Here, we discuss patterns relating SSA’s first andsecond sources chosen (Figures 3a-b and 4a-b) tothe inter-predictability of cells represented by heatmaps (3c and 4c). Maps display the average accu-racies with which each target (column) can be pre-dicted from each source (row). We analyze spe-cific SSA choices and predictor errors in Arabicand Latin.

The maps (for all languages, see the Ap-pendix) suggest complexity can be distributedwithin paradigms in systematically distinct ways.Ackerman and Malouf (2013) propose integra-tive (I-) complexity, using average conditional en-tropy to describe paradigmatic organization, butthis has been criticized for obscuring differencesin the predictability of sub-paradigm regions (Cot-terell et al., 2019; Elsner et al., 2019). To remedythis, we propose a typology for measuring the ex-tent to which I-complexity is realized via differ-ent organizational strategies, which is useful fordiscussing source selection strategies. Our typol-ogy describes paradigms in terms of mutual pre-dictability, the correlation of a map and its trans-pose, and entropy predictiveness, the negative cor-relation of cells’ average predictiveness (see Equa-tion 1) and average predictability, defined here incomparable terms as:

predictability(Ctrg, F ) =∑|F |m=1

∑|Pm|j=1 Acc(Pm, Ctrg, Cj)∑|F |

m=1 |Pm|(3)

Intuitively, a paradigm is mutually predictable ifthe fact that cell A predicts cell B means that B islikely to predict A. Such paradigms often featureregions of mutually predictable cells (as in 3c),such that an optimal strategy avoids picking mul-tiple sources from one region. For entropy predic-tive paradigms, if A is generally more difficult topredict than B, A is likely to be a better predictor ofthe remaining cells (following the information the-oretic logic that surprisal is informative (Shannon,1948; Jaeger, 2010)). For such paradigms, the op-timal strategy selects the source which would havebeen the most difficult target to predict.

Unlike Sims (2020)’s graph-theoretic typologyfor describing inflection class structure, our typol-ogy is a two-dimensional description of how theoptimal paradigm completion strategy is affectedby underlying class structure. In this sense, our ty-pology is complementary to hers and future workmight investigate the relationship between traitsin her typology and mutual predictability or en-tropy predictiveness. Furthermore, our typologymight be updated to consider the impact of typefrequency (Sims and Parker, 2016) in a frameworkwhere distributional data is available.

Figure 5 demonstrates that cross-linguistic vari-ation is vast with respect to our typology, as somelanguages even exhibit negative entropy predic-tiveness or mutual predictability. This partly ex-plains why non-FPC approaches perform errati-cally: if paradigmatic organization varies by lan-guage, source selection strategies must be able toadapt to the data.

6.1 Arabic Error AnalysisArabic nouns are mutually predictable (Figure 5).Any singular or dual form can predict another.Plural forms also predict each other. Yet, in gen-eral, plurals are less predictive/able (Figure 3c)due to several inflection classes varying in the plu-ral. The sound plurals take suffixes while brokenplural classes are realized via non-concatenativeprocesses. For example, I. » @P rAkb, rider fromroot H. ¼ P r k b, takes the broken plural pattern_ _ A _, becoming H. A¿P rkAb. Yet, having heardonly singular realizations, a human might posit asound plural, i.e., ∗

àñJ.» @P rAkbwn, realizing the

8255

a) Coverage after 1 source

b) Coverage after 2 sources

c) Inter-predictability heat map

Figure 3: Arabic analysis. (a) and (b) define howlikely each cell is to be a source (white), correctly pre-dicted (gray), or an error (black) after one (a) or two (b)sources. (c) shows the predictiveness/ability of source(rows) and target (columns) cells. Darker cells are lesspredictive/able. For a more detailed rendering of thisgraphic, please see the appendix.

more productive exponent.

SSA learns an ideal strategy, requesting a sin-gular source (Figure 3a) and then a plural (3b). In-terestingly, 6 of 18 sound feminine plurals (mostfrequent single class) require multiple sources and8 of 28 broken plurals do not. Thus, the predic-tor does not default to regularity, but uses stem-internal phonology to anticipate irregularity. Mosterrors made from the first source posit a viablebroken plural, just not the right one. In futurework, modeling semantics can fix such errors, e.g.,knowing that I. » @P rAkb is animate makes plural

a) Coverage after 1 source

b) Coverage after 2 sources

c) Inter-predictability heat map

Figure 4: Latin analysis. (a) and (b) define how likelyeach cell is to be a source (white), correctly predicted(gray), or an error (black) after one (a) or two (b)sources. (c) shows the predictiveness/ability of source(rows) and target (columns) cells. Darker cells are lesspredictive/able. For a more detailed rendering of thisgraphic, please see the appendix.

∗I. » @ðP rwAkb unlikely, as animate nouns seldomtake that inflection class.

For future work, we can pre-train on raw cor-pora to give our model access to such informa-tion (Devlin et al., 2019). Indeed Erdmann andHabash (2018) found distributional information tobenefit inflectional paradigm clustering in Ara-bic. Though the benefits should generalize as se-mantics correlates with inflection class in manylanguages (Wurzel, 1989; Aronoff, 1992; Har-ris, 1992; Noyer, 1992; Carstairs-McCarthy, 1994;Corbett and Fraser, 2000; Kastner, 2019).

8256

Entropy predictiveness

Mut

ual p

redi

ctab

ility

Figure 5: Integrative typology for describingparadigms by the mutual predictability of cells and thepredictive power of hard-to-predict cells.

Cell 1st 2ndM 2ndN 3rd 4thM 4thN

NOM;SG a us um varies us u

GEN;SG ae ı ı is us us

DAT;SG ae o o ı uı u

ACC;SG am um um em um u

ABL;SG a o o em u u

VOC;SG generally matches NOM;SG

NOM;PL ae ı a es us ua

GEN;PL arum orum orum um uum uum

DAT;PL is ıs ıs ibus ibus ibus

ACC;PL as os a es us ua

ABL;PL is ıs ıs ibus ibus ibus

VOC;PL generally matches NOM;PL

Table 3: Plat of the suffixes taken for each Latin nomi-nal declension.

6.2 Latin Error Analysis

Latin is not mutually predictable with moderateentropy predictiveness. SSA’s choices are, atfirst, opaque, but Table 3 shows that ACC;PL nar-rows the inflection class to variants of one de-clension. Remaining ambiguity mostly involves3rd declension nominative and vocative realiza-tions, which can usually be predicted from the pre-ferred second source cell, VOC;SG. 44 of 100 testparadigms were 3rd declension, which requiredmultiple sources at the highest rate (16 of 44; 2nd

masculine declension was next highest at 3 of 15).There was no correlation between declension andsecond source chosen, yet high auto-rate suggestsSSA’s choices may not need to condition on previ-ously realized source forms, but only their cells.

While 77 of 100 paradigms were completed

from a single source, we found paradigms re-quiring three sources that might be completablefrom two using a multi-source FPC implemen-tation. For example, greges, flocks realizesGREX.ACC;PL, but the predictor mistakenlyposits ∗gregium for GEN;PL from this source,guessing the wrong 3rd declension variant. Whilesecond source VOC;SG grex corrects this, it ob-scures the underlying stem, as x can be an al-lophone of g or c. Thus, we still get an error,∗grecum. A multi-source predictor could avoidforgetting the underlying allophone g after seeingthe second source.3 That said, multi-source FPCis not as simple as multi-source static. Heuristicsampling of training instances based on the ora-cle policy yields predictors that only attend to onesource or make bad predictions when only givenone. This is worth exploring further in future workas there is more evidence of paradigms that are dif-ficult to handle without jointly encoding sourcesin the linguistic literature (Corbett, 2005; Bonamiand Beniamine, 2016).

7 Conclusion

We presented Frugal Paradigm Completion, whichreduces the manual labor required to expand amorphological lexicon by 16-63% over competi-tive approaches across 7 languages. We demon-strated that typologically distinct morphologicalsystems require unique treatment and benefit fromour SSA, that learns its strategy from data. Wefound that inducing this strategy is not as challeng-ing as previously suggested (Finkel and Stump,2007). Thus, SSA might be replaced with a lesscostly architecture while our model might be im-proved by conditioning on semantics and jointlydecoding from a variable number of sources.

Acknowledgments

We are indebted to helpful conversations with Shi-jie Wu, Ryan Cotterell, Katharina Kann, AndreaSims, and Olivier Bonami. We would also liketo acknowledge insightful feedback from Google’sWord Graph and TTS teams, as well as fouranonymous reviewers.

3See rex–regis, king or pax–pacis, peace, which are tech-nically conditioned on preceding vowel quality, though thereare probably not enough training examples for the model tolearn that.

8257

ReferencesFarrell Ackerman, James P Blevins, and Robert Mal-

ouf. 2009. Parts and wholes: Implicative patterns ininflectional paradigms. Analogy in grammar: Formand acquisition, pages 54–82.

Farrell Ackerman and Robert Malouf. 2013. Morpho-logical organization: The low conditional entropyconjecture. Language, pages 429–464.

Roee Aharoni and Yoav Goldberg. 2017. Morphologi-cal inflection generation with hard monotonic atten-tion. In Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 2004–2015.

Thomas Ahern. 1990. The role of linguistics in Latininstruction. The Classical World, pages 505–510.

Adam Albright. 2002. The identification of bases inmorphological paradigms. UCLA Ph.D. Ph.D. the-sis, dissertation.

Mark Aronoff. 1992. Noun classes in Arapesh.In Yearbook of Morphology 1991, pages 21–32.Springer.

Sacha Beniamine, Olivier Bonami, and Benoît Sagot.2018. Inferring inflection classes with descriptionlength. Journal of Language Modelling, 5(3):465–525.

Olivier Bonami and Sarah Beniamine. 2016. Joint pre-dictiveness in inflectional paradigms. Word Struc-ture, 9(2):156–182.

Kristen Brustad, Abbas Al-Tonsi, and Mahmoud Al-Batal. 2005. Al-Kitaab Fii Ta’Aallum Al-’Arabiyya:A Textbook for Arabic. Georgetown UniversityPress.

Tim Buckwalter. 2002. Arabic morphological analyzer(AraMorph). Linguistic Data Consortium, Philadel-phia.

Andrew Carstairs-McCarthy. 1994. Inflection classes,gender, and the principle of contrast. Language,pages 737–788.

Greville G Corbett. 2005. The canonical approach intypology. Linguistic diversity and language theo-ries, 25:49.

Greville G Corbett and Norman M Fraser. 2000. Gen-der assignment: a typology and a model. Systems ofnominal classification, 4:293–325.

Ryan Cotterell, Christo Kirov, Mans Hulden, and Ja-son Eisner. 2019. On the complexity and typol-ogy of inflectional morphological systems. Transac-tions of the Association for Computational Linguis-tics, 7:327–342.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,Géraldine Walther, Ekaterina Vylomova, Arya DMcCarthy, Katharina Kann, Sebastian Mielke, Gar-rett Nicolai, Miikka Silfverberg, et al. 2018. Theconll–sigmorphon 2018 shared task: Universal mor-phological reinflection. Proceedings of the CoNLLSIGMORPHON 2018 Shared Task: Universal Mor-phological Reinflection, pages 1–27.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,Géraldine Walther, Ekaterina Vylomova, PatrickXia, Manaal Faruqui, Sandra Kübler, DavidYarowsky, Jason Eisner, and Mans Hulden. 2017a.CoNLL-SIGMORPHON 2017 shared task: Uni-versal morphological reinflection in 52 languages.CoRR, abs/1706.09031.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,David Yarowsky, Jason Eisner, and Mans Hulden.2016. The SIGMORPHON 2016 shared task—morphological reinflection. In Proceedings of theWorkshop of the Special Interest Group on Com-putational Morphology and Phonology (SIGMOR-PHON), Berlin, Germany.

Ryan Cotterell, John Sylak-Glassman, and ChristoKirov. 2017b. Neural graphical models over stringsfor principal parts morphological paradigm comple-tion. In Proceedings of the 15th Conference of theEuropean Chapter of the Association for Computa-tional Linguistics: Volume 2, Short Papers, pages759–765.

Lisa Garnand Dawdy-Hesterberg and Janet Brecken-ridge Pierrehumbert. 2014. Learnability and gener-alisation of Arabic broken plural nouns. Language,cognition and neuroscience, 29(10):1268–1282.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186.

Wolfgang U Dressler and Anna M Thornton. 1996.Italian nominal inflection. Wiener LinguistischeGazette, pages 55–57.

Markus Dreyer and Jason Eisner. 2011. Discoveringmorphological paradigms from plain text using aDirichlet process mixture model. In Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing, pages 616–627. Associationfor Computational Linguistics.

Greg Durrett and John DeNero. 2013. Supervisedlearning of complete morphological paradigms. InProceedings of the 2013 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 1185–1195.

Micha Elsner, Andrea D Sims, Alexander Erd-mann, Antonio Hernandez, Evan Jaffe, Lifeng Jin,Martha Booker Johnson, Shuan Karim, David LKing, Luana Lamberti Nunes, et al. 2019. Model-ing morphological learning, typology, and change:What can the neural sequence-to-sequence frame-work contribute? Journal of Language Modelling,7(1):53–98.

Alexander Erdmann and Nizar Habash. 2018. Comple-mentary strategies for low resourced morphologicalmodeling. In Proceedings of the Fifteenth Workshopon Computational Research in Phonetics, Phonol-ogy, and Morphology, pages 54–65.

8258

Ramy Eskander, Nizar Habash, and Owen Rambow.2013. Automatic extraction of morphological lex-icons from morphologically annotated corpora. InProceedings of the 2013 conference on empiri-cal methods in natural language processing, pages1032–1043.

Raphael Finkel and Gregory Stump. 2007. Princi-pal parts and morphological typology. Morphology,17(1):39–75.

Markus Forsberg, Harald Hammarström, and AarneRanta. 2006. Morphological lexicon extraction fromraw text data. In International Conference on Nat-ural Language Processing (in Finland), pages 488–499. Springer.

Paul Grebe, Helmut Gipper, and Günther Drosdowski,editors. 1966. Duden Grammatik der deutschenGegenwartssprache, 2., verm. und verb. aufl edition.Bibliogr. Inst., Dudenverl., Mannheim.

Wren Jones Grinstead. 1916. An Introduction to thePsychology and Pedagogy of High School Latin:With Special Reference to the Value of a One-year Course, volume 1. University of Wisconsin–Madison.

Nizar Y Habash. 2010. Introduction to Arabic naturallanguage processing. Synthesis Lectures on HumanLanguage Technologies, 3(1):1–187.

James W Harris. 1992. The form classes of Span-ish substantives. In Yearbook of morphology 1991,pages 65–88. Springer.

T Florian Jaeger. 2010. Redundancy and reduc-tion: Speakers manage syntactic information den-sity. Cognitive psychology, 61(1):23–62.

Katharina Kann and Hinrich Schütze. 2016. MED: TheLMU system for the SIGMORPHON 2016 sharedtask on morphological reinflection. In Proceedingsof the 14th SIGMORPHON Workshop on Computa-tional Research in Phonetics, Phonology, and Mor-phology, pages 62–70.

Katharina Kann and Hinrich Schütze. 2018. Neuraltransductive learning and beyond: Morphologicalgeneration in the minimal-resource setting. In Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing, pages 3254–3264.

Itamar Kastner. 2019. Templatic morphology as anemergent property. Natural Language & LinguisticTheory, 37(2):571–619.

Christo Kirov, Ryan Cotterell, John Sylak-Glassman,Géraldine Walther, Ekaterina Vylomova, PatrickXia, Manaal Faruqui, Sebastian J Mielke, Arya Mc-Carthy, Sandra Kübler, et al. 2018. UniMorph2.0: Universal morphology. In Proceedings of theEleventh International Conference on Language Re-sources and Evaluation (LREC-2018).

Fabian Kuhn, Pascal Von Rickenbach, Roger Watten-hofer, Emo Welzl, and Aaron Zollinger. 2005. In-terference in cellular networks: The minimum mem-bership set cover problem. In International Comput-ing and Combinatorics Conference, pages 188–198.Springer.

Carsten Lund and Mihalis Yannakakis. 1994. On thehardness of approximating minimization problems.Journal of the ACM (JACM), 41(5):960–981.

Arya D McCarthy, Ekaterina Vylomova, Shijie Wu,Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar-rett Nicolai, Christo Kirov, Miikka Silfverberg, Se-bastian Mielke, Jeffrey Heinz, et al. 2019. TheSIGMORPHON 2019 shared task: Crosslingualityand context in morphology. In Proceedings of the16th SIGMORPHON Workshop on ComputationalResearch in Phonetics, Phonology, and Morphol-ogy.

Fabio Montermini and Olivier Bonami. 2013. Stemspaces and predictability in verbal inflection. Linguee linguaggio, 12(2):171–190.

Joakim Nivre, Marie-Catherine De Marneffe, FilipGinter, Yoav Goldberg, Jan Hajic, Christopher DManning, Ryan McDonald, Slav Petrov, SampoPyysalo, Natalia Silveira, et al. 2016. Universal de-pendencies v1: A multilingual treebank collection.In Proceedings of the Tenth International Confer-ence on Language Resources and Evaluation (LREC2016), pages 1659–1666.

Robert Rolf Noyer. 1992. Features, positions and af-fixes in autonomous morphological structure. Ph.D.thesis, Massachusetts Institute of Technology.

Jeff Parker. 2016. Inflectional Complexity and Cog-nitive Processing: An Experimental and Corpus-based Investigation of Russian Nouns. Ph.D. thesis,Ohio State University.

Stephane Ross and J Andrew Bagnell. 2014. Rein-forcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979.

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell.2011. A reduction of imitation learning and struc-tured prediction to no-regret online learning. In Pro-ceedings of the fourteenth international conferenceon artificial intelligence and statistics, pages 627–635.

Benoît Sagot. 2010. The Lefff, a freely available andlarge-coverage morphological and syntactic lexiconfor French. In 7th international conference on Lan-guage Resources and Evaluation (LREC 2010).

Wolfgang Seeker and Jonas Kuhn. 2013. Morphologi-cal and syntactic case in statistical dependency pars-ing. Computational Linguistics, 39(1):23–55.

Claude Elwood Shannon. 1948. A mathematical the-ory of communication. Bell system technical jour-nal, 27(3):379–423.

Andrea D Sims. 2015. Inflectional defectiveness, vol-ume 148. Cambridge University Press.

Andrea D Sims. 2020. Inflectional networks: Graph-theoretic tools for inflectional typology. Proceed-ings of the Society for Computation in Linguistics:Vol, 3:10.

Andrea D Sims and Jeff Parker. 2016. How inflectionclass systems work: On the informativity of implica-tive structure. Word Structure, 9(2):215–239.

http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=YOP&IKT=1016&TRM=ppn+217481892&sourceid=fbw_bibsonomy

http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=YOP&IKT=1016&TRM=ppn+217481892&sourceid=fbw_bibsonomy

https://doi.org/10.1162/COLI_a_00134



8259

Gregory Stump and Raphael A Finkel. 2013. Morpho-logical typology: From word to paradigm, volume138. Cambridge University Press.

Marko Tadic and Sanja Fulgosi. 2003. Building theCroatian morphological lexicon. In Proceedings ofthe 2003 EACL Workshop on Morphological Pro-cessing of Slavic Languages, pages 41–46. Associa-tion for Computational Linguistics.

Anna Thornton. 2011. Overabundance (multiple formsrealizing the same cell): a non-canonical phe-nomenon in Italian verb morphology.

Anna M Thornton. 2010. Towards a typology ofoverabundance. In Décembrettes 7: InternationalConference on Morphology, University of Toulouse,pages 2–3.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Vincent Wan, Chun an Chan, Tom Kenter, Jakub Vit,and Rob Clark. 2019. CHiVE: Varying prosodyin speech synthesis with a linguistically driven dy-namic hierarchical conditional variational network.In ICML 2019.

Sean Welleck, Kianté Brantley, Hal Daumé III, andKyunghyun Cho. 2019. Non-monotonic sequentialtext generation. In International Conference on Ma-chine Learning, pages 6716–6726.

Shijie Wu and Ryan Cotterell. 2019. Exact hard mono-tonic attention for character-level transduction. InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 1530–1537.

Shijie Wu, Pamela Shapiro, and Ryan Cotterell. 2018.Hard non-monotonic attention for character-leveltransduction. In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing, pages 4425–4438.

Wolfgang Ullrich Wurzel. 1989. Inflectional morphol-ogy and naturalness, volume 9. Springer Science &Business Media.

A. A. Zaliznyak. 1980. The grammar dictionary of theRussian language.

Heiga Zen, Yannis Agiomyrgiannakis, Niels Egberts,Fergus Henderson, and Przemysław Szczepaniak.2016. Fast, compact, and high quality LSTM-RNNbased statistical parametric speech synthesizers formobile devices. In Interspeech 2016, 17th AnnualConference of the International Speech Communi-cation Association.

A Expanded Results

The figures in this appendix demonstrate the cov-erage after 1 and 2 sources for every languageconsidered as well as their inter-predictability heatmaps. Figures are enlarged to show all individual

cells for the reader’s convenience. Hence, the Ara-bic and Latin figures in this appendix correspondto Figures 3 and 4 in Section 6, but show moredetail.

https://books.google.co.uk/books?id=SarhAQAACAAJ

https://books.google.co.uk/books?id=SarhAQAACAAJ

8260

Figure 6: Coverage of Arabic target cells after SSA chooses the first two sources.

8261

Figure 7: Inter-predictability heat map of Arabic cells.

8262

Figure 8: Coverage of German target cells after SSA chooses the first two sources.

8263

Figure 9: Inter-predictability heat map of German cells.

8264

Figure 10: Coverage of English target cells after SSA chooses the first two sources.

8265

Figure 11: Inter-predictability heat map of English cells.

8266

Figure 12: Coverage of Russian target cells after SSA chooses the first two sources.

8267

Figure 13: Inter-predictability heat map of Russian cells.

8268

Figure 14: Coverage of Latin target cells after SSA chooses the first two sources.

8269

Figure 15: Inter-predictability heat map of Latin cells.

8270

Figure 16: Coverage of Hungarian target cells after SSA chooses the first two sources.

8271

Figure 17: Inter-predictability heat map of Hungarian cells.

8272

Figure 18: Coverage of Irish target cells after SSA chooses the first two sources.

8273

Figure 19: Inter-predictability heat map of Irish cells.

frugal paradigm completion

Documents