arxiv:1807.04783v2 [cs.cl] 26 jun 2019for different verbs (plunkett and marchman,1991)....

15
Recurrent Neural Networks in Linguistic Theory: Revisiting Pinker and Prince (1988) and the Past Tense Debate Christo Kirov Johns Hopkins University Baltimore, MD [email protected] Ryan Cotterell Johns Hopkins University Baltimore, MD [email protected] Abstract Can advances in NLP help advance cogni- tive modeling? We examine the role of artifi- cial neural networks, the current state of the art in many common NLP tasks, by return- ing to a classic case study. In 1986, Rumel- hart and McClelland famously introduced a neural architecture that learned to trans- duce English verb stems to their past tense forms. Shortly thereafter, Pinker and Prince (1988) presented a comprehensive rebuttal of many of Rumelhart and McClelland’s claims. Much of the force of their attack centered on the empirical inadequacy of the Rumelhart and McClelland (1986) model. Today, how- ever, that model is severely outmoded. We show that the Encoder-Decoder network ar- chitectures used in modern NLP systems ob- viate most of Pinker and Prince’s criticisms without requiring any simplification of the past tense mapping problem. We suggest that the empirical performance of modern networks warrants a re ¨ examination of their utility in linguistic and cognitive modeling. 1 Introduction In their famous 1986 opus, Rumelhart and Mc- Clelland (R&M) describe a neural network capa- ble of transducing English verb stems to their past tense. The strong cognitive claims in the article fomented a veritable brouhaha in the linguistics community that eventually led to the highly influ- ential rebuttal of Pinker and Prince (1988) (P&P). P&P highlighted the extremely poor empirical per- formance of the R&M model, and pointed out a number of theoretical issues with the model, that they suggested would apply to any neural network, contemporarily branded connectionist approaches. Their critique was so successful that many linguists and cognitive scientists to this day do not consider neural networks a viable approach to modeling lin- guistic data and human cognition. In the field of natural language processing (NLP), however, neural networks have experienced a re- naissance. With novel architectures, large new datasets available for training, and access to exten- sive computational resources, neural networks now constitute the state of the art in a many NLP tasks. However, NLP as a discipline has a distinct prac- tical bent and more often concerns itself with the large-scale engineering applications of language technologies. As such, the field’s findings are not always considered relevant to the scientific study of language, i.e., the field of linguistics. Recent work, however, has indicated that this perception is changing, with researchers, for example, prob- ing the ability of neural networks to learn syntactic dependencies like subject-verb agreement (Linzen et al., 2016). Moreover, in the domains of morphology and phonology, both NLP practitioners and linguists have considered virtually identical problems, seem- ingly unbeknownst to each other. For example, both computational and theoretical morphologists are concerned with how different inflected forms in the lexicon are related and how one can learn to generate such inflections from data. Indeed, the original R&M network focuses on such a genera- tion task, i.e., generating English past tense forms from their stems. R&M’s network, however, was severely limited and did not generalize correctly to held-out data. In contrast, state-of-the art mor- phological generation networks used in NLP, built from the modern evolution of Recurrent Neural Networks (RNNs) explored by Elman (1990) and others, solve the same problem almost perfectly (Cotterell et al., 2016). This level of performance on a cognitively relevant problem suggests that it is time to consider further incorporating network modeling into the study of linguistics and cognitive science. Crucially, we wish to sidestep one of the issues that framed the original debate between P&P and R&M—whether or not neural models learn and use ‘rules’. From our perspective, any system that arXiv:1807.04783v2 [cs.CL] 26 Jun 2019

Upload: others

Post on 10-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1807.04783v2 [cs.CL] 26 Jun 2019for different verbs (Plunkett and Marchman,1991). Interestingly, while the exact pattern of irregular acquisition may be disputed, children rarely

Recurrent Neural Networks in Linguistic Theory: Revisiting Pinker andPrince (1988) and the Past Tense Debate

Christo KirovJohns Hopkins University

Baltimore, [email protected]

Ryan CotterellJohns Hopkins University

Baltimore, [email protected]

Abstract

Can advances in NLP help advance cogni-tive modeling? We examine the role of artifi-cial neural networks, the current state of theart in many common NLP tasks, by return-ing to a classic case study. In 1986, Rumel-hart and McClelland famously introduceda neural architecture that learned to trans-duce English verb stems to their past tenseforms. Shortly thereafter, Pinker and Prince(1988) presented a comprehensive rebuttal ofmany of Rumelhart and McClelland’s claims.Much of the force of their attack centered onthe empirical inadequacy of the Rumelhartand McClelland (1986) model. Today, how-ever, that model is severely outmoded. Weshow that the Encoder-Decoder network ar-chitectures used in modern NLP systems ob-viate most of Pinker and Prince’s criticismswithout requiring any simplification of thepast tense mapping problem. We suggestthat the empirical performance of modernnetworks warrants a reexamination of theirutility in linguistic and cognitive modeling.

1 Introduction

In their famous 1986 opus, Rumelhart and Mc-Clelland (R&M) describe a neural network capa-ble of transducing English verb stems to their pasttense. The strong cognitive claims in the articlefomented a veritable brouhaha in the linguisticscommunity that eventually led to the highly influ-ential rebuttal of Pinker and Prince (1988) (P&P).P&P highlighted the extremely poor empirical per-formance of the R&M model, and pointed out anumber of theoretical issues with the model, thatthey suggested would apply to any neural network,contemporarily branded connectionist approaches.Their critique was so successful that many linguistsand cognitive scientists to this day do not considerneural networks a viable approach to modeling lin-guistic data and human cognition.

In the field of natural language processing (NLP),

however, neural networks have experienced a re-naissance. With novel architectures, large newdatasets available for training, and access to exten-sive computational resources, neural networks nowconstitute the state of the art in a many NLP tasks.However, NLP as a discipline has a distinct prac-tical bent and more often concerns itself with thelarge-scale engineering applications of languagetechnologies. As such, the field’s findings are notalways considered relevant to the scientific studyof language, i.e., the field of linguistics. Recentwork, however, has indicated that this perceptionis changing, with researchers, for example, prob-ing the ability of neural networks to learn syntacticdependencies like subject-verb agreement (Linzenet al., 2016).

Moreover, in the domains of morphology andphonology, both NLP practitioners and linguistshave considered virtually identical problems, seem-ingly unbeknownst to each other. For example,both computational and theoretical morphologistsare concerned with how different inflected formsin the lexicon are related and how one can learnto generate such inflections from data. Indeed, theoriginal R&M network focuses on such a genera-tion task, i.e., generating English past tense formsfrom their stems. R&M’s network, however, wasseverely limited and did not generalize correctlyto held-out data. In contrast, state-of-the art mor-phological generation networks used in NLP, builtfrom the modern evolution of Recurrent NeuralNetworks (RNNs) explored by Elman (1990) andothers, solve the same problem almost perfectly(Cotterell et al., 2016). This level of performanceon a cognitively relevant problem suggests that itis time to consider further incorporating networkmodeling into the study of linguistics and cognitivescience.

Crucially, we wish to sidestep one of the issuesthat framed the original debate between P&P andR&M—whether or not neural models learn anduse ‘rules’. From our perspective, any system that

arX

iv:1

807.

0478

3v2

[cs

.CL

] 2

6 Ju

n 20

19

Page 2: arXiv:1807.04783v2 [cs.CL] 26 Jun 2019for different verbs (Plunkett and Marchman,1991). Interestingly, while the exact pattern of irregular acquisition may be disputed, children rarely

picks up systematic, predictable patterns in datamay be referred to as rule-governed. We focusinstead on an empirical assessment of the abilityof a modern state-of-the-art neural architecture tolearn linguistic patterns, asking the following ques-tions: (i) Does the learner induce the full set ofcorrect generalizations about the data? Given arange of novel inputs, to what extent does it applythe correct transformations to them? (ii) Does thebehavior of the learner mimic humans? Are theerrors human-like?

In this work, we run new experiments examin-ing the ability of the Encoder-Decoder architecturedeveloped for machine translation (Sutskever et al.,2014; Bahdanau et al., 2014) to learn the Englishpast tense. The results suggest that modern netsabsolutely meet the first criterion above, and oftenmeet the second. Furthermore, they do this givenlimited prior knowledge of linguistic structure: thenetworks we consider do not have phonologicalfeatures built into them and must instead learn theirown representations for input phonemes. The de-sign and performance of these networks invalidatemany of the criticisms in Pinker and Prince (1988).We contend that, given the gains displayed in thiscase study, which is characteristic of problemsin the morpho-phonological domain, researchersacross linguistics and cognitive science should con-sider evaluating modern neural architectures as partof their modeling toolbox.

This paper is structured as follows. Section 2describes the problem under consideration, the En-glish past tense. Section 3 lays out the originalRumelhart and McClelland model from 1986 inmodern machine-learning parlance, and comparesit to a state-of-the-art Encoder-Decoder architec-ture. A historical perspective on alternative ap-proaches to modeling, both neural and non-neural,is provided in Section 4. The empirical perfor-mance of the Encoder-Decoder architecture is eval-uated in Section 5. Section 6 provides a summaryof which of Pinker and Prince’s original criticismshave effectively been resolved, and which ones stillrequire further consideration. Concluding remarksfollow.

2 The English Past Tense

Many languages mark words with syntactico-semantic distinctions. For instance, English marksthe distinction between present and past tense verbs,e.g., walk and walked. English verbal morphol-

orthographic IPA

stem past part. stem past part. infl. type

go went gone goU wEnt gon suppletivesing sang sung sIN sæN sUN ablautswim swam swum swIm swæm swUm ablautsack sacked sacked sæk sækt sækt [-t]sag sagged sagged sæg sægd sægd [-d]pat patted patted pæt pætId pætId [-Id]pad padded padded pæd pædId pædId [-Id]

Table 1: Examples of inflected English verbs.

ogy is relatively impoverished, distinguishing max-imally five forms for the copula to be and only threeforms for most verbs. In this work, we considerlearning to conjugate the English verb forms, ren-dered as phonological strings. As it is the focusof the original Rumelhart and McClelland (R&M)study, we focus primarily on the English past tenseformation.

Both regular and irregular patterning exist inEnglish. Orthographically, the canonical regularsuffix is -ed, which, phonologically, may be ren-dered as one of three phonological strings: [-Id],[-d] or [-t]. The choice among the three is de-terministic, depending only on the phonologicalproperties of the previous segment. English se-lects [-Id] where the previous phoneme is a [t] or[d], e.g. [pæt]7→[pætId] (pat 7→patted) and [pæd]7→[pædId] (pad 7→padded). In other cases, Englishenforces voicing agreement: it opts for [-d] whenthe proceeding phoneme is a voiced consonant ora vowel, e.g., [sæg]7→[sægd] (sag 7→sagged) and[SoU]7→[SoUd] (show7→showed), and for [-t] whenthe proceeding phoneme is an unvoiced consonant,e.g., [sæk]7→[sækt] (sack 7→sacked). English ir-regulars are either suppletive, e.g., [goU]7→[wEnt](go7→went), or exist in sub-regular islands definedby processes like ablaut, e.g., sing7→sang, that maycontain several verbs (Nelson, 2010): see Table 1.

Single versus Dual Route. A frequently dis-cussed cognitive aspect of past tense processingconcerns whether or not irregular forms have theirown processing pipeline in the brain. Pinker andPrince (1988) proposed separate modules for reg-ular and irregular verbs; regular verbs go througha general, rule-governed transduction mechanism,and exceptional irregulars are produced via sim-ple memory-lookup.1 While some studies, e.g.,(Marslen-Wilson and Tyler, 1997; Ullman et al.,1997), provide corroborating evidence from speak-

1Note that irregular lookup can simply be recast as theapplication of a context-specific rule.

Page 3: arXiv:1807.04783v2 [cs.CL] 26 Jun 2019for different verbs (Plunkett and Marchman,1991). Interestingly, while the exact pattern of irregular acquisition may be disputed, children rarely

ers with selective impairments to regular or irregu-lar verb production, others have called these resultsinto doubt (Stockall and Marantz, 2006). Fromthe perspective of this paper, a complete model ofthe English past tense should cover both regularand irregular transformations. The neural networkapproaches we advocate for achieve this goal, butdo not clearly fall into either the single or dual-route category—internal computations performedby each network remain opaque, so we cannot atpresent make a claim whether two separable com-putation paths are present.

2.1 Acquisition of the Past Tense

The English past tense is of considerable theoreticalinterest due to the now well-studied acquisition pat-terns of children. As first shown by Berko (1958)in the so-called wug-test, knowledge of Englishmorphology cannot be attributed solely to memo-rization. Indeed, both adults and children are fullycapable of generalizing the patterns to novel words,e.g., [w2g]7→[w2gd] (wug7→wugged). During ac-quisition, only a few types of errors are common;children rarely blend regular and irregular forms,e.g., the past tense of come is either produced ascomed or came, but rarely camed (Pinker, 1999).

Acquisition Patterns for Irregular Verbs. It iswidely claimed that children learning the past tenseforms of irregular verbs exhibit a ‘U-shaped’ learn-ing curve. At first, they correctly conjugate irreg-ular forms, e.g., come 7→came, then they regressduring a period of overregularization producingthe past tense as comed as they acquire the generalpast tense formation. Finally, they learn to pro-duce both the regular and irregular forms. Plunkett& Marchman, however, observed a more nuancedform of this behavior. Rather than a macro U-shaped learning process that applies globally anduniformly to all irregulars, they noted that manyirregulars oscillate between correct and overreg-ularized productions (Marchman, 1988). Theseoscillations, which Plunkett & Marchman refer toas a micro U-shape, further apply at different ratesfor different verbs (Plunkett and Marchman, 1991).Interestingly, while the exact pattern of irregularacquisition may be disputed, children rarely overir-regularize, i.e., misconjugate a regular verb as ifit were irregular such as ping7→pang.

3 1986 vs Today

In this section, we compare the original R&M ar-chitecture from 1986, to today’s state-of-the-artneural architecture for morphological transduction,the Encoder-Decoder model.

3.1 Rumelhart and McClelland (1986)For many linguists, the face of neural networksto this day remains the work of R&M. Here, wedescribe in detail their original architecture, us-ing modern machine learning parlance wheneverpossible. Fundamentally, R&M were interestedin designing a sequence-to-sequence network forvariable-length input using a small feed-forwardnetwork. From an NLP perspective, this work con-stitutes one of the first attempts to design a net-work for a task reminiscent of popular NLP taskstoday that require variable-length input, e.g., part-of-speech tagging, parsing and generation.

Wickelphones and Wickelfeatures. Unfortu-nately, a fixed-sized feed-forward network is notimmediately compatible with the goal of trans-ducing sequences of varying lengths. Rumel-hart and McClelland decided to get around thislimitation by representing each string as the setof its constituent phoneme trigrams. Each tri-gram is termed a Wickelphone (Wickelgren, 1969).As a concrete example, the IPA-string [#kæt#],marked with a special beginning- and end-of-string character, contains three distinct Wickel-phones: [#kæ], [kæt], [æt#]. In fact, Rumelhartand McClelland went one step further and decom-posed Wickelphones into component Wickelfea-tures, or trigrams of phonological features, onefor each Wickelphone phoneme. For example,the Wickelphone [ipt] is represented by the Wick-elfeatures 〈+vowel,+unvoiced,+interrupted〉 and〈+high,+stop,+stop〉. Since there are far fewerWickelfeatures than Wickelphones, words could berepresented with fewer units (of key importance for1986 hardware) and more shared Wickelfeaturespotentially meant better generalization.

We can describe R&M’s representations us-ing the modern linear-algebraic notation standardamong researchers in neural networks. First, weassume that the language under consideration con-tains a fixed set of phonemes Σ, plus an edge sym-bol # marking the beginning and end of words.Then, we construct the set of all WickelphonesΦ and the set of all Wickelfeatures F by enu-meration. The first layer of the R&M neural net-

Page 4: arXiv:1807.04783v2 [cs.CL] 26 Jun 2019for different verbs (Plunkett and Marchman,1991). Interestingly, while the exact pattern of irregular acquisition may be disputed, children rarely

work consists of two deterministic functions: (i)φ : Σ∗ → B|Φ| and (ii) f : B|Φ| → B|F|,where we define B = {−1, 1}. The first func-tion φ maps a phoneme string to the set of Wick-elphones that fire, as it were, on that string, e.g.,φ ([#kaet#]) = {[#kæ], [kæt], [æt#]}. The outputsubset of Φ may be represented by a binary vectorof length |Φ|, where a 1 means that the Wickel-phone appears in the string and a −1 that it doesnot.2 The second function f maps a set of Wickel-phones to its corresponding set of Wickelfeatures.

Pattern Associator Network. Here we definethe complete network of R&M. We denote stringsof phonemes as x ∈ Σ∗, where xi is the ith

phoneme in a string. Given source and targetphoneme strings x(i),y(i) ∈ Σ∗, R&M optimizethe following objective, a sum over the individuallosses for each of the i = 1, ..., N training items:

N∑i=1

∣∣∣∣∣∣max{

0,−π(y(i))�(Wπ(x(i)) + b

)}∣∣∣∣∣∣1,

(1)where max{·} is taken point-wise, � is point-wisemultiplication, W ∈ R|F|×|F| is a projection ma-trix, b ∈ R|F| is a bias term, and π = φ ◦ fis the composition of the Wickelphone and Wick-elfeature encoding functions. Using modern ter-minology, the architecture is a linear model for amulti-label classification problem (Tsoumakas andKatakis, 2006): the goal is to predict the set ofWickelfeatures in the target form y(i) given the in-put form x(i) using a point-wise perceptron loss(hinge loss without a margin), i.e., a binary percep-tron predicts each feature independently, but thereis one set of parameters {W , b} . The total lossincurred is the sum of the per-feature loss, hencethe use of the L1 norm. The model is trained withstochastic sub-gradient descent (the perceptron up-date rule) (Rosenblatt, 1958; Bertsekas, 2015) witha fixed learning rate.3 Later work augmented thearchitecture with multiple layers and nonlinearities(Marcus, 2001, Table 3.3).

2We have chosen −1 instead of the more traditional 0 sothat the objective function that Rumelhart and McClellandoptimize may be more concisely written.

3Follow-up work, e.g., Plunkett and Marchman (1991), hasspeculated that the original experiments in Rumelhart and Mc-Clelland (1986) may not have converged. Indeed, convergencemay not be guaranteed depending on the fixed learning ratechosen. As Equation (1) is jointly convex in its parameters{W, b}, there exist convex optimization algorithms that willguarantee convergence, albeit often with a decaying learningrate.

Decoding. Decoding the R&M network necessi-tates solving a tricky optimization problem. Givenan input phoneme string x(i), we then must find thecorresponding y′ ∈ Σ∗ that minimizes:

∣∣∣∣∣∣π(y′)− threshold{Wπ(x(i)) + b

}∣∣∣∣∣∣0, (2)

where threshold is a step function that maps allnon-positive reals to −1 and all positive reals to1. In other words, we seek the phoneme string y′

that shares the most features with the maximuma-posteriori (MAP) decoded binary vector. Thisproblem is intractable, and so Rumelhart and Mc-Clelland (1986) provide an approximation. Foreach test stem, they hand-selected a set of likelypast tense candidate forms, e.g., good candidatesfor the past tense of break would be {break, broke,brake, braked}, and choose the form with Wick-elfeatures closest to the network’s output. Thismanual approximate decoding procedure is not in-tended to be cognitively plausible.

Architectural Limitations. R&M used Wickel-phones and Wickelfeatures in order to help withgeneralization and limit their network to a tractablesize. However, this came at a significant cost to thenetwork’s ability to represent unique strings—theencoding is lossy: two words may have the sameset of Wickelphones or features. The easiest wayto see this shortcoming is to consider morpholog-ical reduplication, which is common in many ofthe world’s languages. P&P provide an examplefrom the Australian language of Oykangand whichdistinguishes between algal ‘straight’ and algalgal‘ramrod straight’; both of these strings have theidentical Wickelphone set {[#al], [alg], [lga], [gal],[al#]}. Moreover, P&P point out that phonologi-cally related words such as [slIt] and [sIlt] have dis-joint sets of Wickelphones: {[#sl], [slI], [lIt], [It#]}and {[#sI], [sIl], [Ilt], [lt#]}, respectively. Thesetwo words differ only by an instance of metathesis,or swapping the order of nearby sounds. The useof Wickelphone representations imposes the strongclaim that they have nothing in common phonolog-ically, despite sharing all phonemes. P&P suggestthis is unlikely to be the case. As one point of evi-dence, the metathesis of the kind that differentiates[slIt] and [sIlt] is a common diachronic change. InEnglish, for example, [horse] evolved from [hross],and [bird] from [brid] (Jesperson, 1942).

Page 5: arXiv:1807.04783v2 [cs.CL] 26 Jun 2019for different verbs (Plunkett and Marchman,1991). Interestingly, while the exact pattern of irregular acquisition may be disputed, children rarely

3.2 Encoder-Decoder ArchitecturesThe NLP community has recently developed ananalogue to the past tense generation task originallyconsidered by Rumelhart and McClelland (1986):morphological paradigm completion (Durrett andDeNero, 2013; Nicolai et al., 2015; Cotterell et al.,2015; Ahlberg et al., 2015; Faruqui et al., 2016).The goal is to train a model capable of mappingthe lemma (stem in the case of English) to eachform in the paradigm. In the case of English, thegoal would be to map a lemma, e.g., walk, to itspast tense word walked as well as its gerund andthird person present singular walking and walks,respectively. This task generalizes the R&M settingin that it requires learning more mappings thansimply lemma to past tense.

By definition, any system that solves the moregeneral morphological paradigm completion taskmust also be able to solve the original R&M task.As we wish to highlight the strongest currentlyavailable alternative to R&M, we focus on the state-of-the-art in morphological paradigm completion:the Encoder-Decoder network architecture (ED)(Cotterell et al., 2016). This architecture consistsof two recurrent neural networks (RNNs) coupledtogether by an attention mechanism. The encoderRNN reads each symbol in the input string one ata time, first assigning it a unique embedding, thenprocessing that embedding to produce a representa-tion of the phoneme given the rest of the phonemesin the string. The decoder RNN produces a se-quence of output phonemes one at a time, usingthe attention mechanism to peek back at the en-coder states as needed. Decoding ends when a haltsymbol is output. Formally, the ED architectureencodes the probability distribution over forms

p(y | x) =N∏i=1

p(yi | y1, . . . , yi−1, ci) (3)

=N∏i=1

g(yi−1, si, ci), (4)

where g is a non-linear function (in our case it isa multi-layer perceptron), si is the hidden stateof the decoder RNN, y = (y1, . . . , yN ) is theoutput sequence (a sequence of N = |y| char-acters), and finally ci is an attention-weightedsum of the the encoder RNN hidden states hi, us-ing the attention weights αk(si−1) that are com-puted based on the previous decoder hidden state:ci =

∑|x|k=1 αk(si−1)hk.

In contrast to the R&M network, the ED networkoptimizes the log-likelihood of the training data,i.e.,

∑Mi=1 log p(y(i) | x(i)) for i = 1, ...,M train-

ing items. We refer the reader to Bahdanau et al.(2014) for the complete architectural specificationof specific ED model we apply in this paper.

Theoretical Improvements. While there are anumber of possible architectural variants of the EDarchitecture (Luong et al., 2015)4, they all shareseveral critical features that make up for many ofthe theoretical shortcomings of the feed-forwardR&M model. The encoder reads in each phonemesequentially, preserving identity and order and al-lowing any string of arbitrary length to receive aunique representation. Despite this encoding, aflexible notion of string similarity is also main-tained as the ED model learns a fixed embeddingfor each phoneme that forms part of the representa-tion of all strings that share the phoneme. When thenetwork encodes [sIlt] and [slIt], it uses the samephoneme embeddings—only the order changes. Fi-nally, the decoder permits sampling and scoringarbitrary length fully formed strings in polynomialtime (forward sampling), so there is no need todetermine which string a non-unique set of Wick-elfeatures represents. However, we note that decod-ing the 1-best string from a sequence-to-sequencemodel is likely NP-hard. (1-best string decoding iseven hard for weighted FSTs (Goodman, 1998).)

Multi-Task Capability. A single ED model iseasily adapted to multi-task learning (Caruana,1997; Collobert et al., 2011), where each task isa single transduction, e.g., stem 7→ past. NoteR&M would need a separate network for eachtransduction, e.g., stem 7→ gerund and stem 7→past participle. In fact, the current state of theart in NLP is to learn all morphological transduc-tions in a paradigm jointly. The key insight is toconstruct a single network p(y | x, t) to predictall inflections, marking the transformation in theinput string, i.e., we feed the network the string“w a l k <gerund>” as input, augmenting thealphabet Σ to include morphological descriptors.We refer to reader to Kann and Schutze (2016)for the encoding details. Thus, one network pre-dicts all forms, e.g., p(y |x=walk, t=past) yieldsa distribution over past tense forms for walk and

4For the experiments in this paper, we use the variant in(Bahdanau et al., 2014), which has explicitly been shown to bestate-of-the-art in morphological transduction (Cotterell et al.,2016).

Page 6: arXiv:1807.04783v2 [cs.CL] 26 Jun 2019for different verbs (Plunkett and Marchman,1991). Interestingly, while the exact pattern of irregular acquisition may be disputed, children rarely

p(y |x=walk, t=gerund) yields a distribution overgerunds.

4 Related Work

In this section, we first describe direct followupsto the original 1986 R&M model, employing vari-ous neural architectures. Then we review compet-ing non-neural systems of context-sensitive rewriterules in the style of the Sound Pattern of English(SPE) (Halle and Chomsky, 1968), as favored byPinker and Prince.

4.1 Followups to Rumelhart and McClelland(1986) Over the Years

Following R&M, a cottage industry devoted to cog-nitively plausible connectionist models of inflec-tion learning sprouted in the linguistics and cog-nitive science literature. We provide a summarylisting of the various proposals, along with broad-brush comparisons, in Table 2.

While many of the approaches apply more mod-ern feed-forward architectures than R&M, introduc-ing multiple layers connected by nonlinear transfor-mations, most continue to use feed-forward archi-tectures with limited ability to deal with variable-length inputs and outputs, and remain unable toproduce and assign probability to arbitrary outputstrings.

(MacWhinney and Leinbach, 1991; Plunkettand Marchman, 1991, 1993; Plunkett and Juola,1999) map phonological strings to phonologicalstrings using feed-forward networks, but ratherthan turning to Wickelphones to imprecisely rep-resent strings of any length, they use fixed-sizeinput and output templates, with units represent-ing each possible symbol at each input and outputposition. For example, (Plunkett and Marchman,1991, 1993) simplify the past-tense mapping prob-lem by only considering a language of artificiallygenerated words of exactly three syllables and alimited set of constructed past-tense formation pat-terns. (MacWhinney and Leinbach, 1991; Plun-kett and Juola, 1999) additionally modify the inputtemplate to include extra units marking particulartransformations (e.g., past or gerund), enablingtheir network to learn multiple mappings.

Some proposals simplify the problem even fur-ther, mapping fixed-size inputs into a small finiteset of categories, solving a classification problemrather than a transduction problem. (Hahn andNakisa, 2000; Nakisa and Hahn, 1996) classify

German noun stems into their appropriate pluralinflection classes. (Plunkett and Nakisa, 1997) dothe same for Arabic stems.

(Hoeffner, 1992; Hare and Elman, 1995; Cot-trell and Plunkett, 1994) also solve an alternativeproblem—mapping semantic representations (usu-ally one-hot vectors with one unit per possible wordtype, and one unit per possible inflection) to phono-logical outputs. As these networks use unstructuredsemantic inputs to represent words, they must act asmemories—the phonological content of any wordmust be memorized. This precludes generalizationto word types that were not seen during training.

Of the proposals that map semantics to phonol-ogy, the architecture in (Hoeffner, 1992) is uniquein that it uses an attractor network rather than afeed-forward network, with the main difference be-ing training using Hebbian learning rather than thestandard backpropagation algorithm. (Cottrell andPlunkett, 1994) present an early use of a simplerecurrent network (Elman, 1990) to decode out-put strings, making their model capable of variablelength output.

(Bullinaria, 1997) is one of the few models pro-posed that can deal with variable length inputs.They use a derivative of the NETtalk pronuncia-tion model (Sejnowski and Rosenberg, 1987) thatwould today be considered a convolutional net-work. Each input phoneme in a stem is read in-dependently along with its left and right contextphonemes within a limited context window (i.e., aconvolutional kernel). Each kernel is then mappedto zero or more output phonemes within a fixed tem-plate. Since each output fragment is independentlygenerated, the architecture is limited to learningonly local constraints on output string structure.Similarly, the use of a fixed context window alsomeans that inflectional patterns that depend on long-distance dependencies between input phonemescannot be captured.

Finally, the model of (Westermann and Goebel,1995) is arguably the most similar to a modern EDarchitecture, relying on simple recurrent networksto both encode input strings and decode outputstrings. However, the model was intended to explic-itly instantiate a dual route mechanism and containsan additional explicit memory component to memo-rize irregulars. Despite the addition of this memory,the model was unable to fully learn the mappingfrom German verb stems to their participle forms,failing to capture the correct form for strong train-

Page 7: arXiv:1807.04783v2 [cs.CL] 26 Jun 2019for different verbs (Plunkett and Marchman,1991). Interestingly, while the exact pattern of irregular acquisition may be disputed, children rarely

ing verbs including the copular sein → gewesen.As the authors note, this may be due to the difficultyof training simple recurrent networks, which tendto converge to poor local minima. Modern RNN va-rieties, such as the LSTMs in the ED model, werespecifically designed to overcome these traininglimitations (Hochreiter and Schmidhuber, 1997).

4.2 Non-neural Learners

P&P describe several basic ideas that underlie atraditional, symbolic, rule-learner. Such a learnerproduces SPE-style rewrite rules that may be ap-plied to deterministically transform the input stringinto the target. Rule candidates are found by com-paring the stem and the inflected form, treating theportion that changes as the rule which governs thetransformation. This is typically a set of non-copyedit operations. If multiple stem-past pairs sharesimilar changes, these may be collapsed into a sin-gle rule by generalizing over the shared phonologi-cal features involved in the changes. For example,if multiple stems are converted to the past tenseby the addition of the suffix [-d], the learner maycreate a collapsed rule that adds the suffix to allstems that end in a [+voice] sound. Different rulesmay be assigned weights (e.g., probabilities) de-rived from how many stem-past pairs exemplify therules. These weights decide which rules to applyto produce the past tense.

Several systems that follow the rule-based tem-plate above have been developed in NLP. While theSPE itself does not impose detailed restrictions onrule structure, these systems generate rules that canbe compiled into finite-state transducers (Kaplanand Kay, 1994; Ahlberg et al., 2015). While thesesystems generalize well, even the most successfulvariants have been shown to perform significantlyworse than state-of-the-art neural networks at mor-phological inflection, often with a >10% differen-tial in accuracy on held-out data (Cotterell et al.,2016).

In the linguistics literature, the most straight-forward, direct machine-implemented instantiationof the P&P proposal is, arguably, the Minimal Gen-eralization Learner (MGL) of Albright and Hayes(2003) (c.f., Allen and Becker, 2015; Taatgen andAnderson, 2002). This model takes a mappingof phonemes to phonological features and makesfeature-level generalizations like the post-voice [-d] rule described above. For a detailed technicaldescription, see Albright and Hayes (2002). We

treat the MGL as a baseline in §5.Unlike (Taatgen and Anderson, 2002), which

explicitly accounts for dual route processing byincluding both memory retrieval and rule appli-cation submodules, (Albright and Hayes, 2003)and (Allen and Becker, 2015) rely on discoveringand correctly weighting (using weights learned bylog-linear regression) highly stem-specific rules toaccount for irregular transformations.

Within the context of rule-based systems, severalproposals focus on the question of rule generaliza-tion, rather than rule synthesis. That is, given a setof pre-defined rules, the systems implement metricsto decide whether rules should generalize to novelforms, depending on the number of exceptions inthe data set. (Yang, 2016) defines the ‘toleranceprinciple,’ a threshold for exceptionality beyondwhich a rule will fail to generalize. (O’Donnell,2011) treat the question of whether a rule will gen-eralize as one of optimal Bayesian inference.

5 Evaluation of the ED Learner

We evaluate the performance of the ED architec-ture in light of the criticisms Pinker and Prince(P&P) levied against the original R&M model. Weshow that in most cases, these criticisms no longerapply.5

The most potent line of attack P&P use againstthe R&M model is that it simply does not learnthe English past tense very well. While the non-deterministic, manual, and non-precise decodingprocedure used by R&M makes it difficult to ob-tain exact accuracy numbers, P&P estimate that themodel only prefers the correct past tense form forabout 67% of English verb stems. Furthermore,many of the errors made by the R&M network areunattested in human performance. For example,the model produces blends of regular and irreg-ular past-tense formation, e.g., eat 7→ ated, thatchildren do not produce unless they mistake atefor a present stem (Pinker, 1999). Furthermore,the R&M model frequently produces irregular pasttense forms when a regular formation is expected,e.g., ping 7→ pang. Humans are more likely tooverregularize. These behaviors suggest the R&Mmodel learns the wrong kind of generalizations. Asshown below, the ED architecture seems to avoidthese pitfalls, while outperforming a P&P-style

5Datasets and code for all experiments areavailable at https://github.com/ckirov/RevisitPinkerAndPrince

Page 8: arXiv:1807.04783v2 [cs.CL] 26 Jun 2019for different verbs (Plunkett and Marchman,1991). Interestingly, while the exact pattern of irregular acquisition may be disputed, children rarely

Type of Model Reference Input Output

Feedforward Network Rumelhart and McClelland (1986) Wickelphones WickelphonesFeedforward Network MacWhinney and Leinbach (1991) Fixed Size Phonological Template Fixed Size Phonological TemplateFeedforward Network Plunkett and Marchman (1991) Fixed Size Phonological Template Fixed Size Phonological TemplateAttractor Network Hoeffner (1992) Semantics Fixed Size Phonological TemplateFeedforward Network Plunkett & Marchman (1993) Fixed Size Phonological Template Fixed Size Phonological TemplateRecurrent Neural Network Cottrell & Plunkett (1994) Semantics Phonological StringFeedforward Network Hare, Elman, & Daugherty (1995) Fixed Size Phonological Template Inflection ClassFeedforward Neural Network Hare & Elman (1995) Semantics Fixed Size Phonological TemplateRecurrent Neural Network Westermann & Goebel (1995) Phonological String Phonological StringFeedforward Neural Network Nakisa & Hahn (1996) Fixed Size Phonological Template Inflection ClassConvolutional Neural Network Bullinaria (1997) Phonological String Phonological StringFeedforward Neural Network Plunkett & Nakisa (1997) Fixed Size Phonological Template Inflection ClassFeedforward Neural Network Plunkett & Juola (1999) Fixed Size Phonological Template Fixed Size Phonological TemplateFeedforward Neural Network Hahn & Nakisa (2000) Fixed Size Phonological Template Inflection Class

Table 2: A curated list of related work, categorized by aspects of the technique. Based on a similar list found inMarcus (2001, page 82).

non-neural baseline.

5.1 Experiment 1: Learning the Past Tense

In the first experiment, we seek to show: (i) the EDmodel successfully learns to conjugate both regularand irregular verbs in the training data, and gener-alizes to held-out data at convergence and (ii) thepattern of errors the model exhibits is compatiblewith attested speech errors.

CELEX Dataset. Our base dataset consists of4039 verb types in the CELEX database (Baayenet al., 1993). Each verb is associated with a presenttense form (stem) and past tense form, both in IPA.Each verb is also marked as regular or irregular(Albright and Hayes, 2003). 168 of the 4039 verbtypes were marked as irregular. We assigned verbsto train, development, and test sets according toa random 80-10-10 split. Each verb appears inexactly one of these sets once. This corresponds toa uniform distribution over types since every verbhas an effective frequency of 1.

In contrast, the original R&M model was trained,and tested (data was not held out), on a set of 506stem/past pairs derived from (Kucera and Francis,1967). 98 of the 506 verb types were marked asirregular.

Types vs Tokens. In real human communication,words follow a Zipfian distribution, with many ir-regular verbs being exponentially more commonthan regular verbs. While this condition is moretrue to the external environment of language learn-ing, it may not accurately represent the psycholog-ical reality of how that environment is processed.A body of psycholinguistic evidence (Bybee, 1995,2001; Pierrehumbert, 2001) suggests that humanlearners generalize phonological patterns based onthe count of word types they appear in, ignoring thetoken frequency of those types. Thus, we chose to

weigh all verb types equally for training, effecting auniform distribution over types as described above.

Hyperparameters and Other Details. Our ar-chitecture is nearly identical to that used in (Bah-danau et al., 2014), with hyperparameters set fol-lowing Kann and Schutze (2016, §4.1.1). Eachinput character has an embedding size of 300 units.The encoder consists of a bidirectional LSTM(Hochreiter and Schmidhuber, 1997) with two lay-ers. There is a dropout value of 0.3 between thelayers. The decoder is a unidirectional LSTM withtwo layers. Both the encoder and decoder have100 hidden units. Training was done using theAdadelta procedure (Zeiler, 2012) with a learningrate of 1.0 and a minibatch size of 20. We trainfor 100 epochs to ensure that all verb forms in thetraining data are adequately learned. We decodethe model with beam search (k = 12). The codefor our experiments is derived from the OpenNMTpackage (Klein et al., 2017). We use accuracy asour metric of performance. We train the MGL as anon-neural baseline, using the code distributed with(Albright and Hayes, 2003) with default settings.

Results and Discussion. The non-neural MGLbaseline unsurprisingly learns the regular past-tense pattern nearly perfectly, given that it is im-bued with knowledge of phonological features aswell as a list of phonologically illegal phonemesequences to avoid in its output. However, in ourtesting of the MGL, the preferred past-tense outputfor all verbs was never an irregular formulation.This was true even for irregular verbs that wereobserved by the learner in the training set. Onemight say that the MGL is only intended to accountfor the regular route of a dual route system. How-ever, the intended scope of the MGL seems to bewider. The model is billed as accurately learning‘islands of subregularity’ within the past tense sys-

Page 9: arXiv:1807.04783v2 [cs.CL] 26 Jun 2019for different verbs (Plunkett and Marchman,1991). Interestingly, while the exact pattern of irregular acquisition may be disputed, children rarely

all regular irregular

train dev test train dev test train dev test

Single-Task (MGL) 96.0 96.0 94.5 99.9 100.0 100.0 0.0 0.0 0.0

Single-Task (Type) 99.8† 97.4 95.1 99.9 99.2 98.9 97.6† 53.3† 28.6†Multi-Task (Type) 100.0† 96.9 95.1 100.0 99.5 99.7 99.2† 33.3† 28.6†

Table 3: Results on held-out data in English past tenseprediction for single- and multi- task scenarios. TheMGL achieves perfect accuracy on regular verbs, and0 accuracy on irregular verbs. † indicates that a neu-ral model’s performance was found to be significantlydifferent (p < 0.05) from the MGL.

tem and Albright & Hayes use the model to makepredictions about which irregular forms of novelverb stems are preferable to human speakers (seediscussion of wugs below).

In contrast, the ED model, despite no built-in knowledge of phonology, successfully learnsto conjugate nearly all the verbs in the trainingdata, including irregulars—no reduction in scope isneeded. This capacity to account for specific excep-tions to the regular rule does not result in overfitting.We note similarly high accuracy on held-out regulardata—98.9%-99.2% at convergence depending onthe condition. We report the full accuracy in all con-ditions in Table 3. The † indicates when a neuralmodel’s performance was found to be significantlydifferent (p < 0.05) from the MGL according toa χ2 test. The ED model achieves near-perfect ac-curacy on regular verbs, and irregular verbs seenduring training, as well as substantial accuracy onirregular verbs in the dev and test sets. This be-havior jointly results in better overall performancefor the ED model when all verbs are considered.Figure 1 shows learning curves for regular and ir-regular verbs types in different conditions.

An error analysis of held-out data shows that theerrors made by this network do not show any ofthe problems of the R&M architecture. There areno blend errors of the eat 7→ ated variety. Indeed,the only error the network makes on irregulars isoverregularization (e.g., throw → throwed). Infact, the overregularization-caused lower accuracywhich we observe for irregular verbs in develop-ment and test, is expected and desirable; it matchesthe human tendency to treat novel words as regu-lar, lacking knowledge of irregularity (Albright andHayes, 2003).

While most held-out irregulars are regularized,as expected, the ED model does, perhaps surpris-ingly, correctly conjugate a handful of irregularforms it hasn’t seen during training—5 in the testset. However, three of these are prefixed versions

0 20 40 60 80 1000

20

40

60

80

100

Multi-Task RegularsSingle-Task RegularsMulti-Task All VerbsSingle-Task All VerbsMulti-Task IrregularsSingle-Task Irregulars

Figure 1: Single-task versus multi-task. Learning curvesfor the English past tense. The x-axis is the number ofepochs (one complete pass over the training data) andthe y-axis is the accuracy on the training data (not themetric of optimization).

of irregulars that exist in the training set (retell→ retold, partake → partook, withdraw → with-drew). One (sling→ slung) is an analogy to similartraining words (fling, cling). The final conjugation,forsake→ forsook, is an interesting combination,with the prefix ’for’, but an unattested base form’sake’ that is similar to ’take.’ 6

From the training data, the only regular verbwith an error is ‘compartmentalize’, whose pasttense is predicted to be ‘compartmantalized,’ with aspurious vowel change that would likely be ironedout with additional training. Among the regularverbs in the development and test sets, the errorsalso consisted of single vowel changes (the full setof these errors was ‘thin’→ ‘thun,’ ‘try’→ ‘traud,’‘institutionalize’→ ‘instititionalized,’ and ‘drawl’→ ‘drooled’).

Overall then, the ED model performs extremelywell, a far cry from the ≈67% accuracy of theR&M model. It exceeds any reasonable standardof empirical adequacy, and shows human-like errorbehavior.

Acquisition Patterns. R&M made severalclaims that their architecture modeled the detailedacquisition of the English past tense by children.The core claim was that their model exhibited amacro U-shaped learning curve as in §2 above. Ir-regulars were initially produced correctly, followedby a period of overregularization preceding a finalcorrect stage. However, P&P point out that R&Monly achieve this pattern by manipulating the the

6[s] and [t] are both coronal consonants, a fricative and astop, respectively.

Page 10: arXiv:1807.04783v2 [cs.CL] 26 Jun 2019for different verbs (Plunkett and Marchman,1991). Interestingly, while the exact pattern of irregular acquisition may be disputed, children rarely

English Network MGRegular (rife ∼ rifed, n=58) 0.48 0.35Irregular (rife ∼ rofe, n=74) 0.45 0.36

Table 4: Spearman’s ρ of human wug production proba-bilities with MG scores and ED probability estimates.

input distribution fed into their network. Theytrained only on irregulars for a number of epochs,before flooding the network with regular verbforms. R&M justify this by claiming that youngchildren’s vocabulary consists disproportionatelyof irregular verbs early on, but P&P cite contraryevidence. A survey of child-directed speechshows that the ratio of regular to irregular verbsa child hears is constant while they are learningtheir language (Slobin, 1971). Furthermore,psycholinguistic results suggest that there isno early skew towards irregular verbs in thevocabulary children understand or use (Brown,1973).

While we do not wish to make a strong claimthat the ED architecture accurately mirrors chil-dren’s acquisition, only that it ultimately learnsthe correct generalizations, we wanted to see if itwould display a child-like learning pattern withoutchanging the training inputs fed into the networkover time, i.e., in all of our experiments, the datasets remained fixed for all epochs, unlike in R&M.While we do not clearly see a macro U-shape, wedo observe Plukett & Marchman’s predicted oscil-lations for irregular learning—the so-called microU-shaped pattern. As shown in Table 5, individ-ual verbs oscillate between correct production andoverregularization before they are fully mastered.

Wug Testing. As a further test of the MGL as acognitive model, Albright & Hayes created a set of74 nonce English verb stems with varying levelsof similarity to both regular and irregular verbs.For each stem (e.g., rife), they picked one regularoutput form (rifed), and one irregular output form(rofe). They used these stems and potential past-tense variants to perform a wug test with humanparticipants. For each stem, they had 24 partici-pants freely attempt to produce a past tense form.They then counted the percentage of participantsthat produced the pre-chosen regular and irregularforms (production probability). The productionprobabilities for each pre-chosen regular and irregu-lar form could then be correlated with the predictedscores derived from the MGL. In Table 4, we com-pare the correlations based on their model scores,

CLING MISLEAD CATCH FLY

# output # output # output # output

5 [klINd] 8 [mIsli:dId] 7 [kætS] 6 [flaId]11 [kl2N] 19 [mIslEd] 31 [kætS] 31 [flu:]13 [klIN] 21 [mIslEd] 43 [kOt] 40 [flaId]14 [klINd] 23 [mIslEd] 44 [kætS] 42 [fleI]18 [kl2N] 24 [mIsli:dId] 51 [kætS] 47 [flaId]21 [klINd] 29 [mIslEd] 52 [kOt] 56 [flu:]28 [kl2N] 30 [mIsli:dId] 66 [kætS] 62 [flaId]40 [kl2N] 41 [mIslEd] 73 [kOt] 70 [flu:]

Table 5: Here we evince the oscillating development ofsingle words in our corpus. For each stem, e.g., CLING,we show the past form that produced at change points toshow the diversity of alternation. Beyond the last epochdisplayed, each verb was produced correctly.

with correlations comparing the human scores tothe output probabilities given by an ED model. Asthe wug data provided with (Albright and Hayes,2003) uses a different phonetic transcription thanthe one we used above, we trained a separate EDmodel for this comparison. Model architecture,training verbs, and hyperparameters remained thesame. Only the transcription used to represent inputand output strings was changed to match (Albrightand Hayes, 2003). Following the original paper, wecorrelate the probabilities for regular and irregulartransformations separately. We apply Spearman’srank correlation, as we don’t necessarily expecta linear relationship. We see that the ED modelprobabilities are slightly more correlated than theMGL’s scores.

5.2 Experiment 2: Joint Multi-Task Learning

Another objection levied by P&P is R&M’s focuson learning a single morphological transduction:stem to past tense. Many phonological patternsin a language, however, are not restricted to a sin-gle transduction—they make up a core part of thephonological system and take part in many differ-ent processes. For instance, the voicing assimila-tion patterns found in the past tense also apply tothe third person singular: we see the affix -s ren-dered as [-s] after voiceless consonants and [-z]after voiced consonants and vowels.

P&P argue that the R&M model would not beable to take advantage of these shared generaliza-tions. Assuming a different network would needto be trained for each transduction, e.g., stem togerund and stem to past participle, it would be im-possible to learn that they have any patterns in com-mon. However, as discussed in §3.2, a single EDmodel can learn multiple types of mapping, simply

Page 11: arXiv:1807.04783v2 [cs.CL] 26 Jun 2019for different verbs (Plunkett and Marchman,1991). Interestingly, while the exact pattern of irregular acquisition may be disputed, children rarely

by tagging each input-output pair in the trainingset with the transduction it represents. A networktrained in such a way shares the same weights andphoneme embeddings across tasks, and thus hasthe capacity to generalize patterns across all trans-ductions, naturally capturing the overall phonologyof the language. Since different transductions mu-tually constrain each other (e.g., English in generaldoes not allow sequences of identical vowels), weactually expect faster learning of each individualpattern, which we test in the following experiment.

We trained a model with an architecture identicalto that used in Exp. 1, but this time to jointly predictfour mappings associated with English verbs (past,gerund, past participle, third-person singular).

Data. For each of the verb types in our basetraining set from Exp. 1, we added the three re-maining mappings. The gerund, past-participle,and third-person singular forms were identified inCELEX according to their labels in Wiktionary(Sylak-Glassman et al., 2015). The network wastrained on all individual stem 7→ inflection pairs inthe new training set, with each input string modi-fied with additional characters representing the cur-rent transduction (Kann and Schutze, 2016): take<PST> 7→ took, but take <PTCP> 7→ taken.7

Results. Table 3 and Figure 1 show the results.Overall, accuracy is >99% after convergence ontrain. While the difference in final performance isnever statistically significant compared to single-task learning, the learning curves are much steeperso this level of performance is achieved much morequickly. This provides evidence for our intuitionthat cross-task generalization facilitates individualtask learning due to shared phonological patterning,i.e., jointly generating the gerund hastens past tenselearning.

6 Summary of Resolved andOutstanding Criticisms

In this paper, we have argued that the Encoder-Decoder architecture obviates many of the criti-cisms P&P levied against R&M. Most importantly,

7Without input annotation to mark the different mappingsthe network must learn, it would treat all input/output pairsas belonging to the same mapping, with each inflected formof a single stem as an equally likely output variant associatedwith that mapping. It is not within the scope of this net-work architecture to solve problems other than morphologicaltransduction, such as discovering the range of morphologicalparadigm slots.

the empirical performance of neural models is nolonger an issue. The past tense transformation islearned nearly perfectly, compared to an approxi-mate accuracy of 67% for R&M. Furthermore, theED architecture solves the problem in a fully gen-eral setting. A single network can easily be trainedon multiple mappings at once (and appears to gener-alize knowledge across them). No representationalcludges such as Wickelphones are required—EDnetworks can map arbitrary length strings to arbi-trary length strings. This permits training and eval-uating the ED model on realistic data, including theability to assign an exact probability to any arbi-trary output string, rather than ‘representative’ datadesigned to fit in a fixed-size neural architecture(e.g., fixed input and output templates). Evaluationshows that the ED model does not appear to displayany of the degenerate error-types P&P note in theoutput of R&M (e.g., regular/irregular blends ofthe ate→ ated variety.

Despite this litany of successes, some outstand-ing criticisms of R&M still remain to be addressed.On the trivial end, P&P correctly point out that theR&M model does not handle homophones: write7→ wrote, but right 7→ righted. This is because itonly takes the phonological make-up of the inputstring into account, without concern for its lexi-cal identity. This issue affects the ED models wediscuss in this paper as well—lexical disambigua-tion is outside of their intended scope. However,even the rule-learner P&P propose does not havesuch functionality. Furthermore, if lexical mark-ings were available, we could incorporate them intothe model just as with different transductions in themulti-task set-up, i.e., by adding the disambiguat-ing markings to the input.

More importantly, we need to limit any claimsregarding treating ED models as proxies for childlanguage learners. P&P criticized such claims fromR&M because they manipulated the input data dis-tribution given to their network over time to effecta U-shaped learning curve, despite no evidence thatthe manipulation reflected children’s perception orproduction capabilities. We avoid this criticismin our experiments, keeping the input distributionconstant. We even show that the ED model cap-tures at least one observed pattern of child languagedevelopment—Plukett & Marchman’s predicted os-cillations for irregular learning, the micro U-shapedpattern. However, we did not observe a macro U-shape, nor was the micro effect consistent across

Page 12: arXiv:1807.04783v2 [cs.CL] 26 Jun 2019for different verbs (Plunkett and Marchman,1991). Interestingly, while the exact pattern of irregular acquisition may be disputed, children rarely

all irregular verbs. More study is needed to deter-mine the ways in which ED architectures do or donot reflect children’s behavior. Even if nets do notmatch the development patterns of any individual,they may still be useful if they ultimately achievea knowledge state that is comparable to that of anadult or, possibly, the aggregate usage statistics ofa population of adults.

Along this vein, P&P note that the R&M modelis able to learn highly unnatural patterns that donot exist in any language. For example, it istrivial to map each Wickelphone to its reverse,effectively creating a mirror-image of the input,e.g., brag7→garb. Although an ED model couldlikely learn linguistically unattested patterns aswell, some patterns may be more difficult to learnthan others, e.g., they might require increased time-to-convergence. It remains an open question forfuture research to determine which patterns RNN’sprefer, and which changes are needed to accountfor over- and underfitting. Indeed, any sufficientlycomplex learning system (including rule-basedlearners) would have learning biases which requirefurther study.

There are promising directions from which toapproach this study. Networks are in a way analo-gous to animal models (McCloskey, 1991), in thatthey share interesting properties with human learn-ers, as shown empirically, but are much easier andless costly to train and manipulate across multipleexperiments. Initial experiments could focus ondefault architectures, as we do in this paper, effec-tively treating them as inductive baselines (Gildeaand Jurafsky, 1996) and measuring their perfor-mance given limited domain knowledge. Our EDnetworks, for example, have no built-in knowledgeof phonology or morphology. Failures of thesebaselines would then point the way towards thebiases required to learn human language, and mod-els modified to incorporate these biases could betested.

7 Conclusion

We have shown that the application of the ED archi-tecture to the problem of learning the English pasttense obviates many, though not all, of the objec-tions levied by P&P against the first neural networkproposed for the task, suggesting that the criticismsdo not extend to all neural models, as P&P imply.Compared to a non-neural baseline, the ED modelaccounts for both regular and irregular past tense

formation in observed training data and generalizesto held-out verbs, all without built-in knowledgeof phonology. While not necessarily intended toact as a proxy for a child learner, the ED modelalso shows one of the development patterns thathave been observed in children, namely a microU-shaped (oscillating) learning curve for irregularverbs. The accurate and substantially human-likeperformance of the ED model, warrants consid-eration of its use as a research tool in theoreticallinguistics and cognitive science.

References

Malin Ahlberg, Markus Forsberg, and MansHulden. 2015. Paradigm classification in su-pervised learning of morphology. In Proceed-ings of the 2015 Conference of the North Amer-ican Chapter of the Association for Computa-tional Linguistics: Human Language Technolo-gies, pages 1024–1029, Denver, Colorado. Asso-ciation for Computational Linguistics.

Adam Albright and Bruce Hayes. 2002. Model-ing English past tense intuitions with minimalgeneralization. In Proceedings of the Workshopon Morphological and Phonological Learning,volume 6, pages 58–69. Association for Compu-tational Linguistics.

Adam Albright and Bruce Hayes. 2003. Rulesvs. analogy in English past tenses: A computa-tional/experimental study. Cognition, 90:119–161.

Blake Allen and Michael Becker. 2015. Learn-ing alternations from surface forms with sublexi-cal phonology. Technical report, University ofBritish Columbia and Stony Brook University.

R. Harald Baayen, Richard Piepenbrock, and Rijnvan H. 1993. The CELEX lexical data base onCD-ROM.

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate. arXivpreprint arXiv:1409.0473.

Jean Berko. 1958. The child’s learning of Englishmorphology. Word, 14(2-3):150–177.

Dimitri P. Bertsekas. 2015. Convex OptimizationAlgorithms. Athena Scientific.

Page 13: arXiv:1807.04783v2 [cs.CL] 26 Jun 2019for different verbs (Plunkett and Marchman,1991). Interestingly, while the exact pattern of irregular acquisition may be disputed, children rarely

Roger Brown. 1973. A First Language: The EarlyStages. Harvard University Press, Cambridge,MA.

John A Bullinaria. 1997. Modeling reading,spelling, and past tense learning with artifi-cial neural networks. Brain and Language,59(2):236–266.

Joan Bybee. 1995. Regular morphology and thelexicon. Language and Cognitive Processes,10:425–455.

Joan Bybee. 2001. Phonology and Language Use.Cambridge University Press, Cambridge, UK.

Rich Caruana. 1997. Multitask learning. MachineLearning, 28:41–75.

Ronan Collobert, Jason Weston, Leon Bottou,Michael Karlen, Koray Kavukcuoglu, and PavelKuksa. 2011. Natural language processing (al-most) from scratch. Journal of Machine Learn-ing Research, 12(Aug):2493–2537.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman, David Yarowsky, Jason Eisner, andMans Hulden. 2016. The SIGMORPHON2016 shared task–morphological reinflection. InProceedings of the 14th SIGMORPHON Work-shop on Computational Research in Phonet-ics, Phonology, and Morphology, pages 10–22,Berlin, Germany. Association for ComputationalLinguistics.

Ryan Cotterell, Nanyun Peng, and Jason Eisner.2015. Modeling word forms using latent under-lying morphs and phonology. Transactions ofthe Association for Computational Linguistics,3(1):433–447.

Garrison W Cottrell and Kim Plunkett. 1994. Ac-quiring the mapping from meaning to sounds.Connection Science, 6(4):379–412.

Greg Durrett and John DeNero. 2013. Supervisedlearning of complete morphological paradigms.In Proceedings of the 2013 Conference of theNorth American Chapter of the Association forComputational Linguistics: Human LanguageTechnologies, pages 1185–1195, Atlanta, Geor-gia. Association for Computational Linguistics.

Jeffrey L Elman. 1990. Finding structure in time.Cognitive science, 14(2):179–211.

Manaal Faruqui, Yulia Tsvetkov, Graham Neubig,and Chris Dyer. 2016. Morphological inflectiongeneration using character sequence to sequencelearning. In Proceedings of the 2016 Conferenceof the North American Chapter of the Associa-tion for Computational Linguistics: Human Lan-guage Technologies, pages 634–643, San Diego,California. Association for Computational Lin-guistics.

Daniel Gildea and Daniel Jurafsky. 1996. Learningbias and phonological-rule induction. Computa-tional Linguistics, 22(4):497–530.

Joshua Goodman. 1998. Parsing inside-out. Har-vard Computer Science Group Technical ReportTR-07-98.

Ulrike Hahn and Ramin Charles Nakisa. 2000. Ger-man inflection: Single route or dual route? Cog-nitive Psychology, 41(4):313–360.

Morris Halle and Noam Chomsky. 1968. TheSound Pattern of English. Harper & Row.

Mary Hare and Jeffrey L Elman. 1995. Learningand morphological change. Cognition, 56(1):61–98.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

James Hoeffner. 1992. Are rules a thing of thepast? The acquisition of verbal morphology byan attractor network. In Proceddings of the 14thAnnual Conference of the Cognitive Science So-ciety, pages 861–866.

Otto Jesperson. 1942. A Modern English Gram-mar on Historical Principles, volume 6. GeorgeAllen & Unwin Ltd., London, UK.

Katharina Kann and Hinrich Schutze. 2016. Med:The LMU system for the SIGMORPHON 2016shared task on morphological reinflection. InProceedings of the 14th SIGMORPHON Work-shop on Computational Research in Phonet-ics, Phonology, and Morphology, pages 62–70,Berlin, Germany. Association for ComputationalLinguistics.

Ronald M. Kaplan and Martin Kay. 1994. Regularmodels of phonological rule systems. Computa-tional linguistics, 20(3):331–378.

Page 14: arXiv:1807.04783v2 [cs.CL] 26 Jun 2019for different verbs (Plunkett and Marchman,1991). Interestingly, while the exact pattern of irregular acquisition may be disputed, children rarely

Guillaume Klein, Yoon Kim, Yuntian Deng, JeanSenellart, and Alexander Rush. 2017. Open-nmt: Open-source toolkit for neural machinetranslation. In Proceedings of ACL 2017, Sys-tem Demonstrations, pages 67–72, Vancouver,Canada. Association for Computational Linguis-tics.

Henry Kucera and Winthrop Nelson Francis. 1967.Computational Analysis of Present-day Ameri-can English. Dartmouth Publishing Group.

Tal Linzen, Emmanuel Dupoux, and Yoav Gold-berg. 2016. Assessing the ability of LSTMsto learn syntax-sensitive dependencies. Trans-actions of the Association for ComputationalLinguistics (TACL), 4:521–535.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-based neural machine translation. In Proceed-ings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing, pages1412–1421, Lisbon, Portugal. Association forComputational Linguistics.

Brian MacWhinney and Jared Leinbach. 1991. Im-plementations are not conceptualizations: Re-vising the verb learning model. Cognition,40(1):121–157.

Virginia Marchman. 1988. Rules and regulari-ties in the acquisition of the English past tense.Center for Research in Language Newsletter,2(4):04.

Gary Marcus. 2001. The Algebraic Mind. MITPress.

William D Marslen-Wilson and Lorraine K Tyler.1997. Dissociating types of mental computation.Nature, 387(6633):592.

Michael McCloskey. 1991. Networks and theories:The place of connectionism in cognitive science.Psychological science, 2(6):387–395.

Ramin Charles Nakisa and Ulrike Hahn. 1996.Where defaults dont help: the case of the Ger-man plural system. In Proceedings of the 18thAnnual Conference of the Cognitive Science So-ciety, pages 177–182.

Gerald Nelson. 2010. English: an Essential Gram-mar. Routledge.

Garrett Nicolai, Colin Cherry, and Grzegorz Kon-drak. 2015. Inflection generation as discrimina-tive string transduction. In Proceedings of the2015 Conference of the North American Chapterof the Association for Computational Linguis-tics: Human Language Technologies, pages 922–931, Denver, Colorado. Association for Compu-tational Linguistics.

Timothy J. O’Donnell. 2011. Productivity andReuse in Language. Ph.D. thesis, Harvard Uni-versity, Cambridge, MA.

Janet Pierrehumbert. 2001. Stochastic phonology.GLOT, 5:1–13.

Steven Pinker. 1999. Words and Rules: The Ingre-dients of Language. Basic Books.

Steven Pinker and Alan Prince. 1988. On languageand connectionism: Analysis of a parallel dis-tributed processing model of language acquisi-tion. Cognition, 28(1):73–193.

Kim Plunkett and Patrick Juola. 1999. A connec-tionist model of English past tense and pluralmorphology. Cognitive Science, 23(4):463–490.

Kim Plunkett and Virginia Marchman. 1991. U-shaped learning and frequency effects in a multi-layered perception: Implications for child lan-guage acquisition. Cognition, 38(1):43–102.

Kim Plunkett and Virginia Marchman. 1993. Fromrote learning to system building: Acquiring verbmorphology in children and connectionist nets.Cognition, 48(1):21–69.

Kim Plunkett and Ramin Charles Nakisa. 1997. Aconnectionist model of the Arabic plural system.Language and Cognitive processes, 12(5-6):807–836.

Frank Rosenblatt. 1958. The Perceptron: A prob-abilistic model for information storage and or-ganization in the brain. Psychological review,65(6):386.

David E. Rumelhart and James L. McClelland.1986. On learning the past tenses of Englishverbs. In Parallel Distributed Processing: Ex-plorations in the Microstructure of Cognition,volume 2, pages 216–271. MIT Press.

Page 15: arXiv:1807.04783v2 [cs.CL] 26 Jun 2019for different verbs (Plunkett and Marchman,1991). Interestingly, while the exact pattern of irregular acquisition may be disputed, children rarely

Terrence J Sejnowski and Charles R Rosenberg.1987. Parallel networks that learn to pronounceEnglish text. Complex systems, 1(1):145–168.

D.I. Slobin. 1971. On the learning of morphologi-cal rules: A reply to Palermo and Eberhart. InThe Ontogenesis of Grammar: A TheoreticalSymposium. Academic Press, New York.

Linnaea Stockall and Alec Marantz. 2006. A singleroute, full decomposition model of morpholog-ical complexity: MEG evidence. The mentallexicon, 1(1):85–123.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le.2014. Sequence to sequence learning with neu-ral networks. In Advances in Neural Informa-tion Processing Systems 27: Annual Confer-ence on Neural Information Processing Systems2014, December 8-13 2014, Montreal, Quebec,Canada, pages 3104–3112.

John Sylak-Glassman, Christo Kirov, DavidYarowsky, and Roger Que. 2015. A language-independent feature schema for inflectional mor-phology. In Proceedings of the 53rd AnnualMeeting of the Association for ComputationalLinguistics and the 7th International Joint Con-ference on Natural Language Processing (Vol-ume 2: Short Papers), pages 674–680, Beijing,China. Association for Computational Linguis-tics.

Niels A Taatgen and John R Anderson. 2002. Whydo children learn to say broke? a model of learn-ing the past tense without feedback. Cognition,86(2):123–155.

Grigorios Tsoumakas and Ioannis Katakis. 2006.Multi-label classification: An overview. Interna-tional Journal of Data Warehousing and Mining,3(3):1–13.

Michael T Ullman, Suzanne Corkin, Marie Cop-pola, Gregory Hickok, John H Growdon, Wal-ter J Koroshetz, and Steven Pinker. 1997. A neu-ral dissociation within language: Evidence thatthe mental dictionary is part of declarative mem-ory, and that grammatical rules are processedby the procedural system. Journal of cognitiveneuroscience, 9(2):266–276.

Gert Westermann and Rainer Goebel. 1995. Con-nectionist rules of language. In Proceedings of

the 17th annual conference of the Cognitive Sci-ence Society, pages 236–241.

Wayne A Wickelgren. 1969. Context-sensitivecoding, associative memory, and serial orderin (speech) behavior. Psychological Review,76(1):1.

Charles Yang. 2016. The price of linguistic produc-tivity: How children learn to break the rules oflanguage. MIT Press.

Matthew D Zeiler. 2012. ADADELTA: An adap-tive learning rate method. CoRR, abs/1212.5701.