a distributional and orthographic aggregation model for

10
A Distributional and Orthographic Aggregation Model for English Derivational Morphology Daniel Deutsch * , John Hewitt * and Dan Roth Department of Computer and Information Science University of Pennsylvania {ddeutsch,johnhew,danroth}@seas.upenn.edu Abstract Modeling derivational morphology to gen- erate words with particular semantics is useful in many text generation tasks, such as machine translation or abstractive ques- tion answering. In this work, we tackle the task of derived word generation. That is, given the word “run,” we attempt to gener- ate the word “runner” for “someone who runs.” We identify two key problems in generating derived words from root words and transformations: suffix ambiguity and orthographic irregularity. We contribute a novel aggregation model of derived word generation that learns derivational transfor- mations both as orthographic functions us- ing sequence-to-sequence models and as functions in distributional word embedding space. Our best open-vocabulary model, which can generate novel words, and our best closed-vocabulary model, show 22% and 37% relative error reductions over cur- rent state-of-the-art systems on the same dataset. 1 Introduction The explicit modeling of morphology has been shown to improve a number of tasks (Seeker and C ¸ etinoglu, 2015; Luong et al., 2013). In a large number of the world’s languages, many words are composed through morphological operations on subword units. Some languages are rich in inflec- tional morphology, characterized by syntactic trans- formations like pluralization. Similarly, languages like English are rich in derivational morphology, where the semantics of words are composed from * These authors contributed equally; listed alphabetically. Figure 1: Diagram depicting the flow of our aggregation model. Two models generate a hypothesis according to or- thogonal information; then one is chosen as the final model generation. Here, the hypothesis from the distributional model is chosen. smaller parts. The AGENT derivational transforma- tion, for example, answers the question, what is the word for ‘someone who runs’? with the answer, a runner. 1 Here, AGENT is spelled out as suffixing -ner onto the root verb run. We tackle the task of derived word generation. In this task, a root word x and a derivational trans- formation t are given to the learner. The learner’s job is to produce the result of the transformation on the root word, called the derived word y. Table 1 gives examples of these transformations. Previous approaches to derived word genera- tion model the task as a character-level sequence- to-sequence (seq2seq) problem (Cotterell et al., 2017b). The letters from the root word and some encoding of the transformation are given as input to a neural encoder, and the decoder is trained to pro- duce the derived word, one letter at a time. We iden- tify the following problems with these approaches: First, because these models are unconstrained, they can generate sequences of characters that do 1 We use the verb run as a demonstrative example; the transformation can be applied to most verbs.

Upload: others

Post on 14-Jan-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

A Distributional and Orthographic Aggregation Model for EnglishDerivational Morphology

Daniel Deutsch∗, John Hewitt∗ and Dan RothDepartment of Computer and Information Science

University of Pennsylvania{ddeutsch,johnhew,danroth}@seas.upenn.edu

Abstract

Modeling derivational morphology to gen-erate words with particular semantics isuseful in many text generation tasks, suchas machine translation or abstractive ques-tion answering. In this work, we tackle thetask of derived word generation. That is,given the word “run,” we attempt to gener-ate the word “runner” for “someone whoruns.” We identify two key problems ingenerating derived words from root wordsand transformations: suffix ambiguity andorthographic irregularity. We contribute anovel aggregation model of derived wordgeneration that learns derivational transfor-mations both as orthographic functions us-ing sequence-to-sequence models and asfunctions in distributional word embeddingspace. Our best open-vocabulary model,which can generate novel words, and ourbest closed-vocabulary model, show 22%and 37% relative error reductions over cur-rent state-of-the-art systems on the samedataset.

1 Introduction

The explicit modeling of morphology has beenshown to improve a number of tasks (Seeker andCetinoglu, 2015; Luong et al., 2013). In a largenumber of the world’s languages, many words arecomposed through morphological operations onsubword units. Some languages are rich in inflec-tional morphology, characterized by syntactic trans-formations like pluralization. Similarly, languageslike English are rich in derivational morphology,where the semantics of words are composed from

∗These authors contributed equally; listed alphabetically.

Figure 1: Diagram depicting the flow of our aggregationmodel. Two models generate a hypothesis according to or-thogonal information; then one is chosen as the final modelgeneration. Here, the hypothesis from the distributional modelis chosen.

smaller parts. The AGENT derivational transforma-tion, for example, answers the question, what is theword for ‘someone who runs’? with the answer, arunner.1 Here, AGENT is spelled out as suffixing-ner onto the root verb run.

We tackle the task of derived word generation.In this task, a root word x and a derivational trans-formation t are given to the learner. The learner’sjob is to produce the result of the transformationon the root word, called the derived word y. Table1 gives examples of these transformations.

Previous approaches to derived word genera-tion model the task as a character-level sequence-to-sequence (seq2seq) problem (Cotterell et al.,2017b). The letters from the root word and someencoding of the transformation are given as input toa neural encoder, and the decoder is trained to pro-duce the derived word, one letter at a time. We iden-tify the following problems with these approaches:

First, because these models are unconstrained,they can generate sequences of characters that do

1We use the verb run as a demonstrative example; thetransformation can be applied to most verbs.

x t y

wise ADVERB → wiselysimulate RESULT → simulationapprove RESULT → approvaloverstate RESULT → overstatement

yodel AGENT → yodelersurvive AGENT → survivorintense NOMINAL → intensity

effective NOMINAL → effectivenesspessimistic NOMINAL → pessimism

Table 1: The goal of derived word generation is to producethe derived word, y, given both the root word, x, and thetransformation t, as demonstrated here with examples fromthe dataset.

not form actual words. We argue that requiring themodel to generate a known word is a reasonableconstraint in the special case of English derivationalmorphology, and doing so avoids a large numberof common errors.

Second, sequence-based models can only gen-eralize string manipulations (such as “add -ment”)if they appear frequently in the training data. Be-cause of this, they are unable to generate derivedwords that do not follow typical patterns, such asgenerating truth as the nominative derivation oftrue. We propose to learn a function for each trans-formation in a low dimensional vector space thatcorresponds to mapping from representations ofthe root word to the derived word. This eliminatesthe reliance on orthographic information, unlike re-lated approaches to distributional semantics, whichoperate at the suffix level (Gupta et al., 2017).

We contribute an aggregation model of derivedword generation that produces hypotheses inde-pendently from two separate learned models: onefrom a seq2seq model with only orthographic in-formation, and one from a feed-forward networkusing only distributional semantic information inthe form of pretrained word vectors. The modellearns to choose between the hypotheses accord-ing to the relative confidence of each. This systemcan be interpreted as learning to decide betweenpositing an orthographically regular form or a se-mantically salient word. See Figure 1 for a diagramof our model.

We show that this model helps with two openproblems with current state-of-the-art seq2seq de-rived word generation systems, suffix ambiguityand orthographic irregularity (Section 2). We also

improve the accuracy of seq2seq-only derived wordsystems by adding external information throughconstrained decoding and hypothesis rescoring.These methods provide orthogonal gains to ourmain contribution.

We evaluate models in two categories: open vo-cabulary models that can generate novel wordsunattested in a preset vocabulary, and closed-vocabulary models, which cannot. Our best open-vocabulary and closed-vocabulary models demon-strate 22% and 37% relative error reductions overthe current state of the art.

2 Background: Derivational Morphology

Derivational transformations generate novel wordsthat are semantically composed from the root wordand the transformation. We identify two unsolvedproblems in derived word transformation, each ofwhich we address in Sections 3 and 4.

First, many plausible choices of suffix for a sin-gle pair of root word and transformation. For ex-ample, for the verb ground, the RESULT transfor-mation could plausibly take as many forms as2

(ground, RESULT)→ grounding

(ground, RESULT)→ *groundation

(ground, RESULT)→ *groundment

(ground, RESULT)→ *groundal

However, only one is correct, even though eachsuffix appears often in the RESULT transformationof other words. We will refer to this problem as“suffix ambiguity.”

Second, many derived words seem to lack a gen-eralizable orthographic relationship to their rootwords. For example, the RESULT of the verb speakis speech. It is unlikely, given an orthographicallysimilar verb creak, that the RESULT be creech in-stead of, say, creaking. Seq2seq models must grap-ple with the problem of derived words that arethe result of unlikely or potentially unseen stringtransformations. We refer to this problem as “or-thographic irregularity.”

3 Sequence Models and CorpusKnowledge

In this section, we introduce the prior state-of-the-art model, which serves as our baseline system.Then we build on top of this system by incorpo-rating a dictionary constraint and rescoring the

2The * indicates a non-word.

model’s hypotheses with token frequency informa-tion to address the suffix ambiguity problem.

3.1 Baseline Architecture

We begin by formalizing the problem and defin-ing some notation. For source word x =x1, x2, . . . xm, a derivational transformation t, andtarget word y = y1, y2, . . . yn, our goal is to learnsome function from the pair (x, t) to y. Here, xiand yj are the ith and jth characters of the inputstrings x and y. We will sometimes use x1:i todenote x1, x2, . . . xi, and similarly for y1:j .

The current state-of-the-art model for derived-form generation approaches this problem by learn-ing a character-level encoder-decoder neural net-work with an attention mechanism (Cotterell et al.,2017b; Bahdanau et al., 2014).

The input to the bidirectional LSTM en-coder (Hochreiter and Schmidhuber, 1997; Gravesand Schmidhuber, 2005) is the sequence #,x1, x2, . . . xm, #, t, where # is a special symbol todenote the start and end of a word, and the encodingof the derivational transformation t is concatenatedto the input characters. The model is trained tominimize the cross entropy of the training data. Werefer to our reimplementation of this model as SEQ.

For a more detailed treatment of neural sequence-to-sequence models with attention, we direct thereader to Luong et al. (2015).

3.2 Dictionary Constraint

The suffix ambiguity problem poses challenges formodels which rely exclusively on input charac-ters for information. As previously demonstrated,words derived via the same transformation maytake different suffixes, and it is hard to select amongthem based on character information alone. Here,we describe a process for restricting our inferenceprocedure to only generate known English words,which we call a dictionary constraint. We believethat for English morphology, a large enough cor-pus will contain the vast majority of derived forms,so while this approach is somewhat restricting, itremoves a significant amount of ambiguity fromthe problem.

To describe how we implemented this dictionaryconstraint, it is useful first to discuss how decodingin a seq2seq model is equivalent to solving a short-est path problem. The notation is specific to ourmodel, but the argument is applicable to seq2seqmodels in general.

The goal of decoding is to find the most probablestructure y conditioned on some observation x andtransformation t. That is, the problem is to solve

y = argmaxy∈Y

p(y | x, t) (1)

= argminy∈Y− log p(y | x, t) (2)

where Y is the set of valid structures. Sequentialmodels have a natural ordering y = y1, y2, . . . ynover which − log p(y | x, t) can be decomposed

− log p(y | x, t) =n∑

t=1

− log p(yt | y1:t−1,x, t)

(3)Solving Equation 2 can be viewed as solving ashortest path problem from a special starting stateto a special ending state via some path whichuniquely represents y. Each vertex in the graphrepresents some sequence y1:i, and the weight ofthe edge from y1:i to y1:i+1 is given by

− log p(yi+1 | y1:i−1,x, t) (4)

The weight of the path from the start state to the endstate via the unique path that describes y is exactlyequal to Equation 3. When the vocabulary size istoo large, the exact shortest path is intractable, andapproximate search methods, such as beam search,are used instead.

In derived word generation, Y is an infinite setof strings. Since Y is unrestricted, almost all ofthe strings in Y are not valid words. Given a dic-tionary YD, the search space is restricted to onlythose words in the dictionary by searching over thetrie induced from YD, which is a subgraph of theunrestricted graph. By limiting the search spaceto YD, the decoder is guaranteed to generate someknown word. Models which use this dictionary-constrained inference procedure will be labeledwith +DICT. Algorithm 1 has the pseudocode forour decoding procedure.

We discuss specific details of the search pro-cedure and interesting observations of the searchspace in Section 6. Section 5.2 describes how weobtained the dictionary of valid words.

3.3 Word Frequency Knowledge throughRescoring

We also consider the inclusion of explicit wordfrequency information to help solve suffix ambi-guity, using the intuition that “real” derived words

are likely to be frequently attested. This permits ahigh-recall, potentially noisy dictionary.

We are motivated by very high top-10 accu-racy compared to top-1 accuracy, even amongdictionary-constrained models. By rescoring thehypotheses of a model using word frequency (aword-global signal) as a feature, attempt to recovera portion of this top-10 accuracy.

When a model has been trained, we query it forits top-10 most likely hypotheses. The union of allhypotheses for a subset of the training observationsforms the training set for a classifier that learnsto predict whether a hypothesis generated by themodel is correct. Each hypothesis is labelled withits correctness, a value in {±1}. We train a sim-ple combination of two scores: the seq2seq modelscore for the hypothesis, and the log of the wordfrequency of the hypothesis.

To permit a nonlinear combination of word fre-quency and model score, we train a small multi-layer perceptron with the model score and the fre-quency of a derived word hypothesis as features.

At testing time, the 10 hypotheses generated bya single seq2seq model for a single observation arerescored. The new model top-1 hypothesis, then,is the argmax over the 10 hypotheses according tothe rescorer. In this way, we are able to incorporateword-global information, e.g. word frequency, thatis ill-suited for incorporation at each character pre-diction step of the seq2seq model. We label modelsthat are rescored in this way +FREQ.

4 Distributional Models

So far, we have presented models that learn deriva-tional transformations as orthographic operations.Such models struggle by construction with the or-thographic irregularity problem, as they are trainedto generalize orthographic information. However,the semantic relationships between root words andderived words are the same even when the orthog-raphy is dissimilar. It is salient, for example, thatirregular word speech is related to its root speak inabout the same way as how exploration is relatedto the word explore.

We model distributional transformations as func-tions in dense distributional word embeddingspaces, crucially learning a function per deriva-tional transformation, not per suffix pair. In thisway, we aim to explicitly model the semantic trans-formation, not the othographic information.

4.1 Feed-forward derivationaltransformations

For all source words x and all target words y, welook up static distributional embeddings vx, vy ∈Rd. For each derivational transformation t, welearn a function ft : Rd → Rd that maps vx to vy.ft is parametrized as two-layer perceptron, trainedusing a squared loss,

L = bTb (5)

b = ft(vx)− vy (6)

We perform inference by nearest neighbor searchin the embedding space. This inference strategyrequires a subset of strings for our embedding dic-tionary, YV .

Upon receiving (x, t) at test time, we computeft(vx) and find the most similar embeddings in YV .Specifically, we find the top-k most similar embed-dings, and take the most similar derived word thatstarts with the same 4 letters as the root word, andis not identical to it. This heuristic filters out highlyimplausible hypotheses.

We use the single-word subset of the GoogleNews vectors (Mikolov et al., 2013) as YV , so thesize of the vocabulary is 929k words.

4.2 SEQ and DIST AggregationThe seq2seq and distributional models we have pre-sented learn with disjoint information to solve sepa-rate problems. We leverage this intuition to build amodel that chooses, for each observation, whetherto generate according to orthographic informationvia the SEQ model, or produce a potentially irregu-lar form via the DIST model.

To train this model, we use a held-out portion ofthe training set, and filter it to only observations forwhich exactly one of the two models produces thecorrect derived form. Finally, we make the strongassumption that the probability of a derived formbeing generated correctly according to 1 modelas opposed to the other is dependent only on theunnormalized model score from each. We modelthis as a logistic regression (t is omitted for clarity):

P (·|yD,yS,x) =

softmax(We [DIST(yD|x); SEQ(yS|x)] + be)

where We and be are learned parameters, yD andyS are the hypotheses of the distributional andseq2seq models, and DIST(·) and SEQ(·) are themodels’ likelihood functions. We denote this ag-gregate AGGR in our results.

5 Datasets

In this section we describe the derivational mor-phology dataset used in our experiments and howwe collected the dictionary and token frequenciesused in the dictionary constraint and rescorer.

5.1 Derivational Morphology

In our experiments, we use the derived word gen-eration derivational morphology dataset releasedin Cotterell et al. (2017b). The dataset, derivedfrom NomBank (Meyers et al., 2004) , consists of4,222 training, 905 validation, and 905 test triplesof the form (x, t,y). The transformations are fromthe following categories: ADVERB (ADJ→ ADV),RESULT (V→ N), AGENT (V→ N), and NOMI-NAL (ADJ→ N). Examples from the dataset canbe found in Table 1.

5.2 Dictionary and Token FrequencyStatistics

The dictionary and token frequency statistics usedin the dictionary constraint and frequency rerank-ing come from the Google Books NGram corpus(Michel et al., 2011). The unigram frequencycounts were aggregated across years, and any to-kens which appear fewer than approximately 2,000times, do not end in a known possible suffix, orcontain a character outside of our vocabulary wereremoved.

The frequency threshold was determined usingdevelopment data, optimizing for high recall. Wecollect a set of known suffixes from the trainingdata by removing the longest common prefix be-tween the source and target words from the targetword. The result is a dictionary with frequencyinformation for around 360k words, which covers98% of the target words in the training data.3

6 Inference Procedure Discussion

In many sequence models where the vocabularysize is large, exact inference by finding the trueshortest path in the graph discussed in Section 3.2is intractable. As a result, approximate inferencetechniques such as beam search are often used, orthe size of the search space is reduced, for exam-ple, by using a Markov assumption. We, however,observed that exact inference via a shortest pathalgorithm is not only tractable in our model, but

3 The remaining 2% is mostly words with hyphens ormistakes in the dataset.

Method Accuracy Avg. #States

GREEDY 75.9 11.8BEAM 76.2 101.2

SHORTEST 76.2 11.8

DICT+GREEDY 77.2 11.7DICT+BEAM 82.6 91.2

DICT+SHORTEST 82.6 12.4

Table 2: The average accuracies and number of states exploredin the search graph of 30 runs of the SEQ model with varioussearch procedures. The BEAM models use a beam size of 10.

only slightly more expensive than greedy searchand significantly less expensive than beam search.

To quantify this claim, we measured the ac-curacy and number of states explored by greedysearch, beam search, and shortest path with andwithout a dictionary constraint on the developmentdata. Table 2 shows the results averaged over 30runs. As expected, beam search and shortest pathhave higher accuracies than greedy search and ex-plore more of the search space. Surprisingly, beamsearch and shortest path have nearly identical ac-curacies, but shortest path explores significantlyfewer hypotheses.

At least two factors contribute to the tractabilityof exact search in our model. First, our character-level sequence model has a vocabulary size of 63,which is significantly smaller than token-level mod-els, in which a vocabulary of 50k words is not un-common. The search space of sequence models isdependent upon the size of the vocabulary, so themodel’s search space is dramatically smaller thanfor a token-level model.

Second, the inherent structure of the task makesit easy to eliminate large subgraphs of the searchspace. The first several characters of the input wordand output word are almost always the same, sothe model assigns very low probability to any se-quence with different starting characters than theinput. Then, the rest of the search procedure isdedicated to deciding between suffixes. Any suffixwhich does not appear frequently in the trainingdata receives a low score, leaving the search to de-cide between a handful of possible options. Theresult is that the learned probability distribution isvery spiked; it puts very high probability on justa few output sequences. It is empirically true thatthe top few most probable sequences have signif-icantly higher scores than the next most probablesequences, which supports this hypothesis.

In our subsequent experiments, we decode using

Algorithm 1 The decoding procedure uses ashortest-path algorithm to find the most probableoutput sequence. The dictionary constraint is (op-tionally) implemented on line 9 by only consideringprefixes that are contained in some trie T .

1: procedure DECODE(x, t, V , T )2: H ← Heap()3: H .insert(0, #)4: while H is not empty do5: y← H .remove()6: if y is a complete word then return y

7: for y ∈ V do8: y′ ← y + y9: if y′ ∈ T then

10: s← FORWARD(x, t,y′)11: H .insert(s, y′)

exact inference by running a shortest path algo-rithm (see Algorithm 1). For reranking models,instead of typically using a beam of size k, we usethe top k most probable sequences.

7 Results

In all of our experiments, we use the training, de-velopment, and testing splits provided by Cotterellet al. (2017b) and average over 30 random restarts.Table 3 displays the accuracies and average editdistances on the test set of each of the systems pre-sented in this work and the state-of-the-art modelfrom Cotterell et al. (2017b).

First, we observed that SEQ outperforms the re-sults reported in Cotterell et al. (2017b) by a largemargin, despite the fact that the model architecturesare the same. We attribute this difference to betterhyperparameter settings and improved learning rateannealing.

Then, it is clear that the accuracy of the distri-butional model, DIST, is significantly lower thanany seq2seq model. We believe the orthography-informed models perform better because most ob-servations in the dataset are orthographically regu-lar, providing low-hanging fruit.

Open-vocabulary models Our open-vocabularyaggregation model AGGR improves performanceby 3.8 points accuracy over SEQ, indicating thatthe sequence models and the distributional modelare contributing complementary signals. AGGR

is an open-vocabulary model like Cotterell et al.(2017b) and improves upon it by 6.3 points, makingit our best comparable model. We provide an in-

Model Accuracy Edit

Cotterell et al. (2017b) 71.7 0.97

DIST 54.9 3.23SEQ 74.2 0.88

AGGR 78.0 0.83SEQ+FREQ 79.3 0.71

AGGR+FREQ 82.0 0.64

SEQ+DICT 80.4 0.72AGGR+DICT 81.0 0.78

SEQ+FREQ+DICT 81.2 0.71AGGR+FREQ+DICT 82.4 0.67

Table 3: The accuracies and edit distances of the modelspresented in this paper and prior work. For edit distance,lower is better. The dictionary-constrained models are on thelower half of the table.

depth analysis of the strengths of SEQ and DIST inSection 7.1.

Closed-vocabulary models We now considerclosed-vocabulary models that improve upon theseq2seq model in AGGR. First, we see that re-stricting the decoder to only generate known wordsis extremely useful, with SEQ+DICT improvingover SEQ by 6.2 points. Qualitatively, we notethat this constraint helps solve the suffix ambiguityproblem, since orthographically plausible incorrecthypotheses are pruned as non-words. See Table 6for examples of this phenomenon. Additionally,we observe that the dictionary-constrained modeloutperforms the unconstrained model according totop-10 accuracy (see Table 5).

Rescoring (+FREQ) provides further improve-ment of 0.8 points, showing that the decoding dic-tionary constraint provides a higher-quality beamthat still has room for top-1 improvement. All to-gether, AGGR+FREQ+DICT provides a 4.4 pointimprovement over the best open-vocabulary model,AGGR. This shows the disambiguating power ofassuming a closed vocabulary.

Edit Distance One interesting side effect of thedictionary constraint appears when comparingAGGR+FREQ with and without the dictionary con-straint. Although the accuracy of the dictionary-constrained model is better, the average edit dis-tance is worse. The unconstrained model is freeto put invalid words which are orthographicallysimilar to the target word in its top-k, however theconstrained model can only choose valid words.This means it is easier for the unconstrained modelto generate words which have a low edit distanceto the ground truth, whereas the constrained model

Cotterell et al.(2017b)

AGGRAGGR+FREQ

+DICT

acc edit acc edit acc edit

NOMINAL 35.1 2.67 68.0 1.32 62.1 1.40RESULT 52.9 1.86 59.1 1.83 69.7 1.29AGENT 65.6 0.78 73.5 0.65 79.1 0.57

ADVERB 93.3 0.18 94.0 0.18 95.0 0.22

Table 4: The accuracies and edit distances of our bestopen-vocabulary and closed-vocabulary models, AGGR andAGGR+FREQ+DICT compared to prior work, evaluated at thetransformation level. For edit distance, lower is better.

can only do that if such a word exists. The result isa more accurate, yet more orthographically diverse,set of hypotheses.

Results by Transformation Next, we compareour best open vocabulary and closed vocabularymodels to previous work across each derivationaltransformation. These results are in Table 4.

The largest improvement over the baseline sys-tem is for NOMINAL transformations, in which theAGGR has a 49% reduction in error. We attributemost of this gain to the difficulty of this particulartransformation. NOMINAL is challenging becausethere are several plausible endings (e.g. -ity, -ness,-ence) which occur at roughly the same rate. Addi-tionally, NOMINAL examples are the least frequenttransformation in the dataset, so it is challengingfor a sequential model to learn to generalize. Thedistributional model, which does not rely on suffixinformation, does not have this same weakness, sothe aggregation AGGR model has better results.

The performance of AGGR+FREQ+DICT isworse than AGGR, however. This is surprisingbecause, in all other transformations, adding dic-tionary information improves the accuracies. Webelieve this is due to the ambiguity of the groundtruth: Many root words have seemingly multi-ple plausible nominal transformations, such asrigid → {rigidness, rigidity} and equivalent →{equivalence, equivalency}. The dictionary con-straint produces a better set of hypotheses torescore, as demonstrated in Table 5. Therefore,the dictionary-constrained model is likely to havemore of these ambiguous cases, which makes thetask more difficult.

7.1 Strengths of SEQ and DIST

In this subsection we explore why AGGR improvesconsistently over SEQ even though it maintains anopen vocabulary. We have argued that DIST isable to correctly produce derived words that are

Cotterell et al.(2017b)

SEQ SEQ+DICT

top-10-acc top-10-acc top-10-acc

NOMINAL 70.2 73.7 87.5RESULT 72.6 79.9 90.4AGENT 82.2 88.4 91.6

ADVERB 96.5 96.9 96.9

Table 5: The accuracies of the top-10 best outputs for the SEQ,SEQ+DICT, and prior work demonstrate that the dictionaryconstraint significantly improves the overall candidate quality.

Figure 2: Aggregating across 30 random restarts, we talliedwhen SEQ and DIST correctly produced derived forms of eachsuffix. The y-axis shows the logarithm of the difference, persuffix, between the tally for DIST and the tally for SEQ. Onthe x-axis is the logarithm of the frequency of derived wordswith each suffix in the training data. A linear regression line isplotted to show the relationship between log suffix frequencyand log difference in correct predictions. Suffixes that differonly by the first letter, as with -ger and -er, have been mergedand represented by the more frequent of the two.

orthographically irregular or infrequent in the train-ing data. Figure 2 quantifies this phenomenon,analyzing the difference in accuracy between thetwo models, and plotting this in relationship to thefrequency of the suffix in the training data. Theplot shows that SEQ excels at generating derivedwords ending in -ly, -ion, and other suffixes thatappeared frequently in the training data. DIST’simprovements over SEQ are generally much lessfrequent in the training data, or as in the case of-ment, are less frequent than other suffixes for thesame transformation (like -ion.) By producing de-rived words whose suffixes show up rarely in thetraining data, DIST helps solve the orthographicirregularity problem.

8 Prior Work

There has been much work on the related task ofinflected word generation (Durrett and DeNero,

x t DIST SEQ AGGR AGGR+DICT

approve RESULT approval approvation approval approvalbankrupt NOMINAL bankruptcy bankruption bankruptcy bankruptcy

irretrievable ADVERB irreparably irretrievably irretrievably irretrievablyconnect RESULT connectivity connection connection connection

stroll AGENT strolls stroller stroller strolleremigrate SUBJECT emigre emigrator emigrator emigrant

ubiquitous NOMINAL ubiquity ubiquit ubiquit ubiquityhinder AGENT hinderer hinderer hinderer hinderervacant NOMINAL vacance vacance vacance vacance

Table 6: Sample output from a selection of models. The words in bold mark the correct derivations. “Hindrance” and “vacancy”are the expected derived words for the last two rows.

2013; Rastogi et al., 2016; Hulden et al., 2014). It isa structurally similar task to ours, but does not havethe same difficulty of challenges (Cotterell et al.,2017a,b), which we have addressed in our work.The paradigm completion for derivational morphol-ogy dataset we use in this work was introduced inCotterell et al. (2017b). They apply the model thatwon the 2016 SIGMORPHON shared task on in-flectional morphology to derivational morphology(Kann and Schutze, 2016; Cotterell et al., 2016).We use this as our baseline.

Our implementation of the dictionary constraintis an example of a special constraint which canbe directly incorporated into the inference algo-rithm at little additional cost. Roth and Yih (2004,2007) propose a general inference procedure thatnaturally incorporates constraints through recastinginference as solving an integer linear program.

Beam or hypothesis rescoring to incorporate anexpensive or non-decomposable signal into searchhas a history in machine translation (Huang andChiang, 2007). In inflectional morphology, Nico-lai et al. (2015) use this idea to rerank hypothe-ses using orthographic features and Faruqui et al.(2016) use a character-level language model. Ourapproach is similar to Faruqui et al. (2016) in thatwe use statistics from a raw corpus, but at the tokenlevel.

There have been several attempts to use distri-butional information in morphological generationand analysis. Soricut and Och (2015) collect pairsof words related by any morphological change inan unsupervised manner, then select a vector off-set which best explains their observations. Therehas been subsequent work exploring the vectoroffset method, finding it unsuccessful in captur-

ing derivational transformations (Gladkova et al.,2016). However, we use more expressive, non-linear functions to model derivational transforma-tions and report positive results. Gupta et al. (2017)then learn a linear transformation per orthographicrule to solve a word analogy task. Our distribu-tional model learns a function per derivational trans-formation, not per orthographic rule, which allowsit to generalize to unseen orthography.

9 Implementation Details

Our models are implemented in Python using theDyNet deep learning library (Neubig et al., 2017).The code is freely available for download.4

Sequence Model The sequence-to-sequencemodel uses character embeddings of size 20, whichare shared across the encoder and decoder, witha vocabulary size of 63. The hidden states of theLSTMs are of size 40.

For training, we use Adam with an initial learn-ing rate of 0.005, a batch size of 5, and train for amaximum of 30 epochs. If after one epoch of thetraining data, the loss on the validation set does notdecrease, we anneal the learning rate by half andrevert to the previous best model.

During decoding, we find the top 1 most prob-able sequence as discussed in Section 6 unlessrescoring is used, in which we use the top 10.

Rescorer The rescorer is a 1-hidden-layer per-ceptron with a tanh nonlinearity and 4 hiddenunits. It is trained for a maximum of 5 epochs.

Distributional Model The DIST model is a 1-hidden-layer perceptron with a tanh nonlinearity

4https://github.com/danieldeutsch/derivational-morphology

and 100 hidden units. It is trained for a maximumof 25 epochs.

10 Conclusion

In this work, we present a novel aggregation modelfor derived word generation. This model learns tochoose between the predictions of orthographically-and distributionally-informed models. This amelio-rates suffix ambiguity and orthographic irregularity,the salient problems of the generation task. Concur-rently, we show that derivational transformationscan be usefully modeled as nonlinear functions ondistributional word embeddings. The distributionaland orthographic models aggregated contribute or-thogonal information to the aggregate, as shownby substantial improvements over state-of-the-artresults, and qualitative analysis. Two ways of incor-porating corpus knowledge – constrained decodingand rescoring – demonstrate further improvementsto our main contribution.

Acknowledgements

We would like to thank Shyam Upadhyay, JordanKodner, and Ryan Cotterell for insightful discus-sions about derivational morphology. We wouldalso like to thank our anonymous reviewers forhelpful feedback on clarity and presentation.

This work was supported by Contract HR0011-15-2-0025 with the US Defense Advanced Re-search Projects Agency (DARPA). Approved forPublic Release, Distribution Unlimited. The viewsexpressed are those of the authors and do not reflectthe official policy or position of the Department ofDefense or the U.S. Government.

References

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate. CoRRabs/1409.0473.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,Geraldine Walther, Ekaterina Vylomova, PatrickXia, Manaal Faruqui, Sandra Kubler, DavidYarowsky, Jason Eisner, and Mans Hulden. 2017a.Conll-sigmorphon 2017 shared task: Universal mor-phological reinflection in 52 languages. In Proceed-ings of the CoNLL SIGMORPHON 2017 SharedTask: Universal Morphological Reinflection. Asso-ciation for Computational Linguistics, Vancouver,pages 1–30. http://www.aclweb.org/anthology/K17-2001.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,David Yarowsky, Jason Eisner, and Mans Hulden.2016. The sigmorphon 2016 shared taskmorpholog-ical reinflection. ACL 2016 page 10.

Ryan Cotterell, Ekaterina Vylomova, Huda Khayral-lah, Christo Kirov, and David Yarowsky. 2017b.Paradigm completion for derivational morphol-ogy. In Proceedings of the 2017 Confer-ence on Empirical Methods in Natural LanguageProcessing. Association for Computational Lin-guistics, Copenhagen, Denmark, pages 714–720.https://www.aclweb.org/anthology/D17-1074.

Greg Durrett and John DeNero. 2013. Supervisedlearning of complete morphological paradigms.In Proceedings of the 2013 Conference of theNorth American Chapter of the Association forComputational Linguistics: Human LanguageTechnologies. Association for Computational Lin-guistics, Atlanta, Georgia, pages 1185–1195.http://www.aclweb.org/anthology/N13-1138.

Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, andChris Dyer. 2016. Morphological inflection gener-ation using character sequence to sequence learn-ing. In Proceedings of the 2016 Conference ofthe North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies. Association for Computational Lin-guistics, San Diego, California, pages 634–643.http://www.aclweb.org/anthology/N16-1077.

Anna Gladkova, Aleksandr Drozd, and Satoshi Mat-suoka. 2016. Analogy-based detection of mor-phological and semantic relations with word em-beddings: what works and what doesn’t. InProceedings of the NAACL Student ResearchWorkshop. Association for Computational Lin-guistics, San Diego, California, pages 8–15.http://www.aclweb.org/anthology/N16-2002.

Alex Graves and Jurgen Schmidhuber. 2005. Frame-wise phoneme classification with bidirectional lstmand other neural network architectures. Neural Net-works 18(5-6):602–610.

Arihant Gupta, Syed Sarfaraz Akhtar, Avijit Vaj-payee, Arjit Srivastava, Madan Gopal Jhanwar,and Manish Shrivastava. 2017. Exploiting mor-phological regularities in distributional word rep-resentations. In Proceedings of the 2017 Con-ference on Empirical Methods in Natural Lan-guage Processing. Association for ComputationalLinguistics, Copenhagen, Denmark, pages 292–297.https://www.aclweb.org/anthology/D17-1028.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural computation 9(8):1735–1780.

Liang Huang and David Chiang. 2007. Forestrescoring: Faster decoding with integrated lan-guage models. In Proceedings of the 45th An-nual Meeting of the Association of Computational

Linguistics. Association for Computational Lin-guistics, Prague, Czech Republic, pages 144–151.http://www.aclweb.org/anthology/P07-1019.

Mans Hulden, Markus Forsberg, and Malin Ahlberg.2014. Semi-supervised learning of morpho-logical paradigms and lexicons. In Proceed-ings of the 14th Conference of the EuropeanChapter of the Association for ComputationalLinguistics. Association for Computational Lin-guistics, Gothenburg, Sweden, pages 569–578.http://www.aclweb.org/anthology/E14-1060.

Katharina Kann and Hinrich Schutze. 2016. Single-model encoder-decoder with explicit morphologi-cal representation for reinflection. In Proceed-ings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume2: Short Papers). Association for ComputationalLinguistics, Berlin, Germany, pages 555–560.http://anthology.aclweb.org/P16-2090.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In Proceedings of the2015 Conference on Empirical Methods in Natu-ral Language Processing. Association for Compu-tational Linguistics, Lisbon, Portugal, pages 1412–1421. http://aclweb.org/anthology/D15-1166.

Thang Luong, Richard Socher, and Christopher Man-ning. 2013. Better word representations with re-cursive neural networks for morphology. In Pro-ceedings of the Seventeenth Conference on Computa-tional Natural Language Learning. Association forComputational Linguistics, Sofia, Bulgaria, pages104–113. http://www.aclweb.org/anthology/W13-3512.

A. Meyers, R. Reeves, C. Macleod, R. Szekely,V. Zielinska, B. Young, and R. Grishman. 2004. Thenombank project: An interim report. In A. Meyers,editor, HLT-NAACL 2004 Workshop: Frontiers inCorpus Annotation. Association for ComputationalLinguistics, Boston, Massachusetts, USA, pages 24–31.

Jean-Baptiste Michel, Yuan Kui Shen, Aviva PresserAiden, Adrian Veres, Matthew K. Gray, Joseph P.Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig,Jon Orwant, Steven Pinker, Martin A. Nowak, andErez Lieberman Aiden. 2011. Quantitative analysisof culture using millions of digitized books. Science331 6014:176–82.

Tomas Mikolov, Kai Chen, Greg Corrado, andJeffrey Dean. 2013. Efficient estimation ofword representations in vector space. InICLR Workshop Papers. Scottsdale, Arizona.https://arxiv.org/pdf/1301.3781.pdf.

Graham Neubig, Chris Dyer, Yoav Goldberg, AustinMatthews, Waleed Ammar, Antonios Anastasopou-los, Miguel Ballesteros, David Chiang, Daniel Cloth-iaux, Trevor Cohn, Kevin Duh, Manaal Faruqui,

Cynthia Gan, Dan Garrette, Yangfeng Ji, LingpengKong, Adhiguna Kuncoro, Gaurav Kumar, Chai-tanya Malaviya, Paul Michel, Yusuke Oda, MatthewRichardson, Naomi Saphra, Swabha Swayamdipta,and Pengcheng Yin. 2017. Dynet: The dy-namic neural network toolkit. arXiv preprintarXiv:1701.03980 .

Garrett Nicolai, Colin Cherry, and Grzegorz Kondrak.2015. Inflection generation as discriminative stringtransduction. In Proceedings of the 2015 Confer-ence of the North American Chapter of the As-sociation for Computational Linguistics: HumanLanguage Technologies. Association for Computa-tional Linguistics, Denver, Colorado, pages 922–931. http://www.aclweb.org/anthology/N15-1093.

Pushpendre Rastogi, Ryan Cotterell, and Jason Eisner.2016. Weighting finite-state transductions with neu-ral context. In Proceedings of the 2016 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies. Association for ComputationalLinguistics, San Diego, California, pages 623–633.http://www.aclweb.org/anthology/N16-1076.

D. Roth and W. Yih. 2004. A linear program-ming formulation for global inference in naturallanguage tasks. In Hwee Tou Ng and EllenRiloff, editors, Proc. of the Conference on Compu-tational Natural Language Learning (CoNLL). As-sociation for Computational Linguistics, pages 1–8.http://cogcomp.org/papers/RothYi04.pdf.

D. Roth and W. Yih. 2007. Global in-ference for entity and relation identifica-tion via a linear programming formulationhttp://cogcomp.org/papers/RothYi07.pdf.

Wolfgang Seeker and Ozlem Cetinoglu. 2015. A graph-based lattice dependency parser for joint morpholog-ical segmentation and syntactic analysis. Transac-tions of the Association for Computational Linguis-tics 3:359–373.

Radu Soricut and Franz Och. 2015. Unsuper-vised morphology induction using word embed-dings. In Proceedings of the 2015 Conferenceof the North American Chapter of the Associa-tion for Computational Linguistics: Human Lan-guage Technologies. Association for ComputationalLinguistics, Denver, Colorado, pages 1627–1637.http://www.aclweb.org/anthology/N15-1186.