mg parsing as a model of gradient acceptability in

12
Proceedings of the Society for Computation in Linguistics Proceedings of the Society for Computation in Linguistics Volume 3 Article 7 2020 MG Parsing as a Model of Gradient Acceptability in Syntactic MG Parsing as a Model of Gradient Acceptability in Syntactic Islands Islands Aniello De Santo Stony Brook University, [email protected] Follow this and additional works at: https://scholarworks.umass.edu/scil Part of the Computational Linguistics Commons, Psycholinguistics and Neurolinguistics Commons, and the Syntax Commons Recommended Citation Recommended Citation De Santo, Aniello (2020) "MG Parsing as a Model of Gradient Acceptability in Syntactic Islands," Proceedings of the Society for Computation in Linguistics: Vol. 3 , Article 7. DOI: https://doi.org/10.7275/srck-2j50 Available at: https://scholarworks.umass.edu/scil/vol3/iss1/7 This Paper is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Proceedings of the Society for Computation in Linguistics by an authorized editor of ScholarWorks@UMass Amherst. For more information, please contact [email protected].

Upload: others

Post on 04-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Proceedings of the Society for Computation in Linguistics Proceedings of the Society for Computation in Linguistics

Volume 3 Article 7

2020

MG Parsing as a Model of Gradient Acceptability in Syntactic MG Parsing as a Model of Gradient Acceptability in Syntactic

Islands Islands

Aniello De Santo Stony Brook University, [email protected]

Follow this and additional works at: https://scholarworks.umass.edu/scil

Part of the Computational Linguistics Commons, Psycholinguistics and Neurolinguistics Commons,

and the Syntax Commons

Recommended Citation Recommended Citation De Santo, Aniello (2020) "MG Parsing as a Model of Gradient Acceptability in Syntactic Islands," Proceedings of the Society for Computation in Linguistics: Vol. 3 , Article 7. DOI: https://doi.org/10.7275/srck-2j50 Available at: https://scholarworks.umass.edu/scil/vol3/iss1/7

This Paper is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Proceedings of the Society for Computation in Linguistics by an authorized editor of ScholarWorks@UMass Amherst. For more information, please contact [email protected].

MG Parsing as a Model of Gradient Acceptability in Syntactic Islands

Aniello De SantoDepartment of Linguistics

Institute for Advanced Computational ScienceStony Brook University

[email protected]

AbstractIt is well-known that the acceptability judg-ments at the core of current syntactic theoriesare continuous. However, an open debate iswhether the source of such gradience is situ-ated in the grammar itself, or can be derivedfrom extra-grammatical factors. In this paper,we propose the use of a top-down parser forMinimalist grammars (Stabler, 2013; Kobeleet al., 2013; Graf et al., 2017), as a formalmodel of how gradient acceptability can arisefrom categorical grammars. As a test case, wetarget the acceptability judgments for islandeffects collected by Sprouse et al. (2012a).

1 Introduction

The human judgments linguists use to evaluate theadequacy of syntactic theories fall in a wide, non-binary spectrum of acceptability — a fact well-known from the early days of generative gram-mar (Chomsky, 1956, 1965, a.o.). Nonetheless,mainstream syntax has long claimed that gram-matical knowledge is, at its core, categorical, andthat gradience in acceptability judgments comesfrom extra-grammatical factors (Sprouse, 2007,a.o.). However, the rise of experimental methodsin theoretical syntax has renewed the question ofwhether gradience should be integrated in gram-matical theories directly, for instance in the formof probabilistic models (Keller, 2000; Crocker andKeller, 2005; Sorace and Keller, 2005; Lau et al.,2014, 2015, 2017).

As the relation between grammaticality andacceptability is not transparent, constructing awell-specified theory of how gradient acceptabil-ity arises from grammatical knowledge is clearlyvaluable. From an empirical perspective, how-ever, categorical approaches seem to be at a disad-vantage when compared to gradient grammaticalmodels rooted in quantitative, probabilistic frame-works.

There is an abundance of well-known propos-als about the way syntactic structure and cogni-tive resources can be integrated to derive connec-tions between acceptability and processing diffi-culty (e.g., Yngve, 1960; Wanner and Maratsos,1978; Rizzi, 1990; Rambow and Joshi, 2015; Gib-son, 2000; McElree et al., 2003; Lewis and Va-sishth, 2005, a.o.). However, few models basedon current grammatical formalisms have been im-plemented in precise computational frameworks(cf. Boston, 2010). In order to have a completetheory of how acceptability judgments correlateto categorical grammars, what seems to be neces-sary is a formal model of the syntactic structureslicensed by said grammars, and a theory of howsuch structures interact with extra-grammaticalfactors to derive differences in acceptability. Thiswould make it possible to test how assumptionsabout fine-grained syntactic details lead to quan-tifiable predictions for the gradient acceptability ofindividual sentences (Stabler, 2013; Sprouse et al.,2018).

Here, we suggest that a parser for Minimal-ist grammars (MGs; Stabler, 2013), coupled withcomplexity metrics measuring memory usage (Ko-bele et al., 2013; Graf et al., 2017, a.o.), is an effec-tive model to address these issues. The MG parserhas been used in the past to study which aspects ofgrammar drive processing cost for a vast set of off-line processing asymmetries cross-linguistically(Gerth, 2015; Graf et al., 2017; Zhang, 2017).Given the ability of MGs to encode rich syntacticanalyses, the MG parser is especially sensitive tofine-grained grammatical information, and thus isable to generate quantitative predictions especiallysuited to our purposes.

In particular, we relate sentence acceptability tosentence structure by specifying: 1) a formalizedtheory of syntax in the form of MGs; 2) a parser asa model of how the structural representation of a

53Proceedings of the Society for Computation in Linguistics (SCiL) 2020, pages 53-63.

New Orleans, Louisiana, January 2-5, 2020

sentence is built from its linear form; 3) a linkingtheory between structural complexity and accept-ability in the form of metrics measuring memoryusage. As a proof-of-concept for the validity of thelinking theory, we model the acceptability judg-ments for three types of syntactic islands, usingas a baseline the judgments reported in (Sprouseet al., 2012a).

Importantly, our main aim is not to settle thedebate of whether gradience should be found inthe grammar itself, or in the interaction betweengrammar and external factors (if such a debatecould ever be settled). What we offer is a formal-ized, testable model of the latter hypothesis, in thehope of providing ground for a more principled in-vestigation of categorical grammaticality and con-tinuous acceptability.

2 MG Parsing

2.1 MGs

MGs (Stabler, 1997, 2011) are a lexicalized,mildly context-sensitive formalism incorporatingthe structurally rich analyses of Minimalist syntax— the most recent version of Chomsky’s transfor-mational grammar.

An MG grammar is a set of lexical items (LIs)consisting of a phonetic form and a finite, non-empty string of features. LIs are assembled via twofeature checking operations: Merge and Move. In-tuitively, Merge encodes subcategorization, whileMove encodes long-distance movement depen-dencies. Here, we avoid most of the technical de-tails of the formalism, and we limit our discussionto a general description of the data structures de-fined by these grammars.

MGs’ derivation trees encode the sequence ofMerge and Move operations required to buildthe phrase structure tree for a specific sentence(Michaelis, 1998; Harkema, 2001). In a traditionalderivation tree, all leaf nodes are labeled by LIs,while unary and binary branching nodes are la-beled as Move or Merge, respectively. However, asthe details of the feature calculus are irrelevant tous, we adopt a simpler representation that discardsthe feature annotation of LIs, and labels internalnodes as standard in minimalist syntax. We alsoexplicitly include dashed arrows indicating move-ment relations.1

1Note that, due to the fact that intermediate landing sitesfor moved phrases do not affect the traversal strategy, we donot explicitly highlight them with movement arrows.

The fundamental difference between a phrasestructure tree and a derivation tree is that in thelatter, moved phrases remain in their base position,and their landing site must be fully reconstructedvia the feature calculus (cf. Fig. 1a and Fig. 1b).As a concequence, the final word order of a sen-tence is not directly reflected in the order of theleaf nodes in a derivation tree.

Importantly, MG derivation trees form a regulartree language, and thus can be regarded as a simplevariant of context-free grammars (CFG), allowingus to exploit some of CFGs more established pars-ing algorithms.

2.2 Top-down MG Parsing

We follow recent sentence processing results, andadopt Stabler (2013)’s top-down parser for MGs.This parser is a variant of a standard depth-first,top-down parser for CFGs: it takes as input thestring representation of a sentence, hypothesizesthe structure top-down, verifies that the words inthe structure match the input string, and outputs anencoding of the sentence structure in the form ofa derivation tree. Importantly, the surface order oflexical items in the derivation tree is not the phrasestructure tree’s surface order. Thus, simple top-to-bottom and left-to-right scanning of the leaf nodesyields the wrong word order. While scanning thenodes then, the MG parser must also keep track ofthe derivational operations which affect the linearword order.

Memory plays a crucial role in this procedure:if a node is hypothesized at step i, but cannot beworked on until step j, it must be stored for j� isteps in a priority queue. To make this traversalstrategy transparent to the reader, we adopt Ko-bele et al. (2013)’s notation, in which each nodein the tree is annotated with an index (superscript)and an outdex (subscript). Intuitively, the annota-tion indicates for each node in the tree when it isfirst conjectured by the parser (index) and placedin the memory queue, and at what point it is con-sidered completed and flushed from memory (out-dex). Consider the tree in Fig. 1b, explicitly an-notated with the parsing steps. The node does ishypothesized at step 3. However, which engineercomes before it in the input, so does has to waituntil step 12 to be flushed out of the queue.

Finally, note that Stabler’s parser was originallygiven a search beam discarding the most unlikelypredictions. Here though, we are not interested

54

CP

DP j

which engineer

C’

does TP

Elmoi T’

T vP

ti v’

v VP

like t j

(a)

CP

C’

does TP

T’

T vP

Elmo v’

v VP

like DP

which engineer

1

2

3

12

2

3

3

4

4

5

5

13

5

6

6

14

6

7

7

15

7

8

8

16

8

9

9

1010

11

(b)

Figure 1: Phrase structure tree (a), and annotated MG derivation tree (b) for Which engineer does Elmo like?.Boxed nodes in (b) are those with tenure value greater than 2, following (Graf and Marcinek, 2014).

in the cost of choosing among alternative parsingchoices, and want to focus on the specific contri-bution of the grammar to memory usage. Thus,we assume that the parser is equipped with a per-fect oracle, which always makes the right choiceswhen constructing a tree (Kobele et al., 2013). Es-sentially, the MG model employs a deterministicparsing strategy, where ambiguity has no role.

2.3 Measuring Memory Usage

Recently, Stabler (2013)’s MG parser has beenused to investigate which aspect of grammaticalstructure affect off-line processing difficulty (Ko-bele et al., 2013; Graf and Marcinek, 2014; Gerth,2015; Graf et al., 2017, a.o.).

In order to allow for psycholinguistic predic-tions, the behavior of the parser is related to pro-cessing difficulty via complexity metrics measur-ing how the structure of a tree affects memory. TheMG model refers to three main notions of mem-ory usage (Graf et al., 2017): (a) how long a nodeis kept in memory (tenure); (b) how many nodesmust be kept in memory (payload); (c) how muchinformation is stored in a node (size).

Tenure and payload for each node n in the treecan be easily computed via the node annotationscheme of Kobele et al.: a node’s tenure is equalto the difference between its index and its outdex;the payload of a derivation tree is computed as thenumber of nodes with a tenure strictly greater than

2 (boxed nodes in our tree annotation scheme).2

For instance, tenure for the node does in Fig. 1b iscomputed as 12�3 = 9.

Defining size in an informal way is slightlytrickier, as it was originally based on how infor-mation about movers is stored by Stabler’s top-down parser (for a technical discussion, see Grafet al., 2015). In practice, size measures the hier-archical length of a movement dependency, and iscomputed as the index of a mover minus the in-dex of its target site. Considering again the tree inFig. 1b, the size of Elmo is 6�3 = 3.

In order to contrast derivations, past work hasused these general concepts to define a vast setof complexity metrics measuring processing dif-ficulty over a full tree (Kobele et al., 2013). Forinstance, tenure can be associated to metrics likeMAXT := max({tenure-of(n)}) and SUMT :=Ân tenure-of(n). MAXT measures the maximumamount of time any node stays in memory dur-ing processing, while SUMT measures the over-all amount of memory usage for all nodes whosetenure is not trivial. It thus captures total memoryusage over the course of a parse. As an illustrativeexample, consider one last time the tree in Fig. 1b.Tenure in this tree is mostly driven by the move-ment of the embedded object, thus MAXT is mea-

2We refer to tenure values 2 as trivial, since it arisesnaturally from the binary nature of derivation trees, and it’snot due to extra waiting time in the priority queue (Graf andMarcinek, 2014).

55

sured at does and it is equal to 12�3 = 9. Similarmetrics can be defined for size. For instance, inFig. 1b SUMS is given by the length of the ob-ject movement and the length of the subject move-ment: (8�1)+(6�3) = 10.

These metrics have been surprisingly successfulin accounting for a vast array of different process-ing phenomena, such as right embedding vs. cen-ter embedding, nested dependencies vs. crossingdependencies, as well as a set of contrasts involv-ing relative clauses (Graf and Marcinek, 2014;Graf et al., 2015). However, Graf et al. (2015)argue that a better approach would make use ofranked metrics of the type hM1,M2, . . . ,Mni. Suchrankings work in a way similar to constraint rank-ing in Optimality Theory (Prince and Smolensky,2008): a lower ranked metric matters only if allhigher ranked metric have failed to pick out aunique winner (e.g., if two constructions result ina tie over MAXT). Following this idea, Graf et al.(2017) show that when complexity metrics are al-lowed to be ranked in such a way the space of pos-sible metrics quickly explodes (up to 1600 distinctmetrics). Considering the total number of possiblemetrics, it is conceivable that some metric com-bination could explain any hypothetical process-ing asymmetry — thus reducing the explanatorypower of the model. However, this does not seemto be the case. Graf et al. (2017) rule out the vastmajority of these metrics, by showing their insuf-ficiency in accounting for some crucial construc-tions across a variety of grammatical analyses.

Here then, we rely on previous work and fo-cus on the predictions made by a ranked versionof hMAXT, SUMSi in comparing memory bur-den for contrasting sentences (Zhang, 2017; Liu,2018; Lee, 2018; De Santo, 2019; De Santo andShafiei, 2019). In addition, our core linking hy-pothesis connects processing difficulty to accept-ability by assuming that higher memory cost im-plies lower acceptability.

3 Gradient Acceptability in SyntacticIslands

Given the metrics’ sensitivity to minor differencesin syntactic structure, the MG parser’s predictionsare the most interpretable when used to comparethe relative complexity of minimally different sen-tences. Careful comparisons across sentences assimilar as possible in their underlying syntacticstructure seem also to be desirable if we want to

understand the source of gradient variation in ac-ceptability judgments. For these reasons, we choseto model the data on the acceptability of syntacticislands collected by Sprouse et al. (2012a) (hence-forth SWP), in a first investigation of the viabilityof the parser as a model of gradient acceptability.

Syntactic islands are well-known in linguistics(Chomsky, 1965; Ross, 1968) as a set of phe-nomena in which the acceptability of a sentenceis degraded, in relation to the interaction of along-distance dependency and its syntactic con-text. Consider the following sentences:

(1) a. Whati did John say Bill saw ti ?b. Whati did John have dinner before Bill

saw ti?

In 1a, what is displaced from its lower positionas the object of the verb saw to a sentence initialposition. In 1b, this same displacement cannot takeplace, as what is inside an adjunct clause (headedby because). Thus, 1b is considered ill-formedby native speakers of standard American English.Since displacing an element from inside an ad-junct leads to ungrammaticality, adjunct clausesare classic example of island structures.

SWP conducted an extensive investigation ofthe acceptability of island constructions, by col-lecting formal acceptability judgments for fourisland types using a magnitude estimation task.The acceptability contrasts in this study are opti-mal for our purposes for multiple reasons. First,while a categorical grammar would predict a bi-nary split in sentence acceptability (violates anisland/doesn’t violate an island), the continuousscale the estimation task was based upon revealeda spectrum of gradient judgments. Second, thestimuli in SWP’s design were based on a (2⇥ 2)factorial definition of island effects, and explic-itly identify two structural factors that might af-fect acceptability: 1) the length of a movement de-pendency; 2) the presence of a so-called “islandconstruction” (Kluender and Kutas, 1993). Thiscareful dimensional decomposition of the test sen-tences, coupled with the continuous scale of thejudgment task, resulted in a set of well-definedpairwise comparisons ideal for the MG parser’smodeling approach.

In what follows, we test whether the gradientof acceptability shown in SWP’s data is predictedby a parser grounded in a rich categorical gram-mar. Before proceeding with our analysis though,it seems to be important to make an additional note

56

about our aims. An expert reader might know thatthere is an ongoing debate in the literature aboutthe nature of islands effects (see, for instance,Hofmeister et al., 2012a; Sprouse et al., 2012b;Hofmeister et al., 2012b, and references therein)— with classical syntactic accounts rooting themin grammatical constraints, while others arguingthat such effects can be reduced to a conspiracy ofprocessing factors.

Importantly, we are not attempting to reducethese effects to processing demands and, at leastat this stage, it is not our purpose to directly en-gage with this debate. For the same reasons, wedo not investigate the super-additivity found inSWP’s paper, as we are not interested in modelingthe grammaticality of an island violation per-se.Relatedly, we do not claim that the acceptabilityof island violations is purely syntactic in nature,as it has been shown to be sensitive to a varietyof semantic factors (Truswell, 2011; Kush et al.,2018; Kohrt et al., 2018, a.o.). Crucially, we are“just” interested in exploring the idea that the gra-dient component of acceptability judgments arisesdue to processing factors. We focus on islands ef-fects exclusively because of the optimal baselineoffered by SWP’s data.

We will return to the question of whether ourmodel could give any insights into the question ofseparating processing and grammatical contribu-tions to island effects in Sec. 5.

4 Modeling Results

SWP focused on English wh-movement depen-dencies to explore four types of islands con-structions: Subject, Adjunct, Complex NP, andWhether islands. Since the MG parser is only sen-sitive to structural differences, in this paper we ig-nore the case of Whether islands and concentrateon the remaining three cases. Table 1 presents asummary of all modeling contrasts in the paper,compared with the experimental results of SWP.3

4.1 Subject Island: Case 1

First, we model Subject islands as in SWP’sExperiment 1, comparing 4 sentence typesacross 2 conditions: subject/object extraction, andisland/non-island. Note that here island does notimply a violation, but refers to the presence of anisland structure (Kluender and Kutas, 1993).

3All scripts are available at https://github.com/CompLab-StonyBrook/mgproc.

Island Type Sprouse et al. (2012) MG Parser

Subject IslandCase 1

2b > 2a X2b > 2d X2b > 2c X2a > 2c X2a > 2d X2c > 2d 2c < 2d

Subject IslandCase 2

3a > 3b X3a > 3c X3a > 3d X3b > 3d X3c > 3b X3c > 3d X

Adjunct Island

4a > 4b X4a > 4c X4a > 4d X4b > 4d X4c > 4b X4c > 4d X

Complex NPIsland

5a > 5b X5a = 5c X5a > 5d X5b > 5d X5c > 5b X5c > 5d X

Table 1: Summary of results (as pairwise compar-isons) from (Sprouse et al., 2012a), and correspondingparser’s predictions (x > y: x more acceptable than y).

(2) a. What do you think the speech interruptedt? Obj/Non Island

b. What do you think t interrupted theshow? Subj/Non Island

c. What do you think the speech aboutglobal warming interrupted the showabout t? Obj/Island

d. What do you think the speech about t in-terrupted the show about global warm-ing? Subj/Island

Annotated MG derivation trees for these sen-tences are shown in Fig. 2 (object/subject with noisland) and Fig. 3 (with island).4 The parser’s pre-dictions (via MAXT) overall match the experimen-tal results (see Table 1).5

4Due to space constraints, annotated derivations areprovided just for the Subject island case, as an illustrative ex-ample. Derivations for all other island types can be easily re-constructed from standard minimalist analyses of the test sen-tences (e.g., Adger, 2003). Source files can also be found athttps://github.com/aniellodesanto/mgproc/tree/master/islands.

5When a wh-element is displaced from an embedded po-sition, we avoid intermediate landing sites due to successivecyclicity. As intermediate movement steps do not affect the

57

CP

C’

do TP

T’

T1 vP

you v’

v1 VP

think CP

C2 TP

T’

T2 vP

DP

the speech

v’

v2 VP

interrupted what

1

2

2

3

3

16

3

4

4

5

5

18

5

6

6

17

6

7

7

19

7

8

8

20

8

9

9

21

9

10

10

11

11

25

11

12

12

22

22

2322

24

12

13

13

26

13

14

14

27

14

15

(a)

CP

C’

do TP

T’

T1 vP

you v’

v1 VP

think CP

C2 TP

T’

T2 vP

what v’

v2 VP

interrupted DP

the show

1

2

2

3

3

14

3

4

4

5

5

16

5

6

6

15

6

7

7

17

7

8

8

18

8

9

9

19

9

10

10

11

11

20

11

12

12

13

12

21

21

22

21

23

23

24

23

25

25

26

25

27

(b)

Figure 2: Annotated derivation trees for (a) 2a (object, non-island) and (b) 2b (subject, non-island).

The factorial design of the original study helpsus understand the model’s predictions. The con-trast between 2b and 2a,2d is correctly capturedby MAXT. This is due to the wh-element spanninga longer, more complex structure comprising thewhole embedded DP subject in the Island cases.Compare 2a and 2b, both with highest tenure ondo (14 and 11, respectively — cf. Tbl. 2). In 2a, dois conjectured after what has been scanned fromthe input. But then it cannot be flushed out ofmemory until what is confirmed in its base posi-tion as the embedded complement. In 2b, do onlyhas to wait until the embedded subject position isreached, and then it is discarded from memory.

Consider now 2c. Here the highest tenure is onthe embedded T head, which has to wait for thewh-element in object position, and then for thewhole complex DP in subject position, before itcan finally be flushed out of the queue. The longerwh-dependency in the object case explains onceagain why 2b is preferred over 2c, and the addi-

traversal strategy, this choice does not significantly changeour results (cf. Zhang, 2017).

Clause Type Ex. # MaxT SumSObj./Non Island 2a 14/do 19Subj./Non Island 2b 11/do 14Obj./Island 2c 23/T2 22Subj./Island 2d 15/do 20Short/Non Island 3a 5/C 9Long/Non Island 3b 11/do 14Short/ Island 3c 11/T2 9Long/ Island 3d 17/T2 20

Table 2: Summary of MAXT (value/node) and SUMSby test sentence for Subject island in case 1 and 2 (T2marks the embedded T head.)

tional complexity of the DP subject is crucial indriving the 2b > 2c contrast.

Finally, there is one case in which parser’spredictions and experimental data disagree: thecontrast between subject and object extraction inthe island condition (2c vs 2d). The parser pre-dicts that 2c should be more acceptable than 2d(Subj/Island > Obj/Island). This is not surprising,

58

CP

C’

do TP

T’

T1 vP

you v’

v1 VP

think CP

C2 TP

T’

T2 vP

DP

the NP

speech PP

about NP

global warming

v’

v2 VP

interrupted DP

the NP

show PP

about what

1

2

2

3

3

19

3

4

4

5

5

21

5

6

6

20

6

7

7

22

7

8

8

23

8

9

9

24

9

10

10

11

11

34

11

12

12

25

25

26

25

27

27

28

27

29

29

30

29

31

31

32

31

33

12

13

13

35

13

14

14

36

14

15

15

37

15

16

16

38

16

17

17

39

17

18

(a)

CP

C’

do TP

T’

T1 vP

you v’

v1 VP

think CP

C2 TP

T’

T2 vP

DP

the NP

speech PP

about what

v’

v2 VP

interrupted DP

the NP

show PP

about NP

global warming

1

2

2

3

3

17

3

4

4

5

5

19

5

6

6

18

6

7

7

20

7

8

8

21

8

9

9

22

9

10

10

11

11

26

11

12

12

13

13

23

13

14

14

24

14

15

15

25

15

16

12

27

27

28

27

29

29

30

29

31

31

32

31

33

33

34

33

35

35

36

35

37

37

38

37

39

(b)

Figure 3: Annotated derivation trees for the test sentences in (a) 2c (object, island) and (b) 2d (subject, island).

as the memory metrics pick up on the additionallength of the extraction in the object case, and thusobviously predict the preference for a subject gap.However, SWP show Obj/Island > Subj/Island —which is expected from a theoretical perspectivesince 2d is the ungrammatical condition (i.e., thereis an extraction out of an island).

We will come back to the significance of thismismatch in Sec. 5. Crucially for our main claimthough, the parser correctly predicts the gradientof acceptability for those conditions that, accord-ing to a categorical grammar, should all be equiv-alent (i.e., those containing no forbidden extrac-tion).

4.2 Subject Island: Case 2

The previous section suggests that, when a gram-matical violation coincides with processing factors(e.g., length of a dependency), parser and humanjudgments should match on all contrasts. Luck-

ily, SWP offer us the chance to test such a pre-diction, with a second set of subject island sen-tences. SWP’s Experiment 2 compares a short de-pendency and long dependency (matrix vs embed-ded extraction in the original paper), again in anisland and non-island condition.

(3) a. Who t thinks the speech interrupted theprimetime TV show? Short/Non Island

b. What do you think t interrupted theprimetime TV show? Long/Non Island

c. Who t thinks the speech about globalwarming interrupted the primetime TVshow? Short/Island

d. What do you think the speech about t in-terrupted the primetime TV show?

Long/Island

As expected, parser’s preferences and experi-mental data fully match in this case, as the un-grammatical condition (3d) is also the one in

59

which the movement dependency is the longest.Here however, deriving the correct preferences re-quires the ranking of hMAXT,SUMSi, instead ofjust MAXT alone (note also that SUMS by itselfwould not suffice, as it would not predict 3a >3c, cf. Tbl. 2). Such a ranking also preserves theresults in the previous section, which fully reliedon MAXT. Interestingly, note how MAXT val-ues for 3b (Long/Non Island) and 3c (Short/ Is-land) tie here, as the additional structural com-plexity of 3c does not interact with the main move-ment dependency (who raising from Spec,TP toSpec,CP). Moreover, the Short/Non Island (3a)and Short/Island (3c) conditions have very similarstructures (with an extraction out of the main sub-ject). Nonetheless, the memory metrics are ableto capture subtle differences in the way the parsergoes through the two sentences (arguably captur-ing the “island construction” cost of (Kluender andKutas, 1993)).

4.3 Adjunct and Complex NP Islands

So far, we have been successful in replicatingSWP’s acceptability judgments via the MG parser.However, we might wonder whether this successis due to something peculiar in the way the Sub-ject island test cases interact with the MG parsingstrategy. Thus, we tested the MG parser on Ad-junct and Complex NP islands, again using as abaseline the results in SWP’s Experiment 1. Thetest sentences for the adjunct case were as follows:

(4) a. Who t thinks that John left his briefcaseat the office? Short/Non Island

b. What do you think that John left t at theoffice? Long/Non Island

c. Who t laughs if John leaves his briefcaseat the office? Short/Island

d. What do you laugh if John leaves t at theoffice? Long/Island

As for Subject islands in case 2,hMAXT,SUMSi correctly predicts the patternof acceptability reported by SWP, matching theempirical results across all conditions (cf. Tbl. 1).Similar results are obtained for the Complex NPisland, with test sentences as follows:

(5) a. Who t claimed that John bought a car?Short/Non Island

b. What did you claim that John bought t?Long/Non Island

Clause Type Ex. # MaxT SumSShort/Non Island 4a 13/PP 10Long/Non Island 4b 17/PP 18Short/Island 4c 13/PP 11Long/Island 4d 21/PP 28Short/Non Island 5a 5/C 9Long/Non Island 5b 13/did 19Short/Island 5c 5/C 9Long/Island 5d 15/did 21

Table 3: Adjunct Island and Complex NP Island:MAXT (value/node) and SUMS values by test sentence.

c. Who t made the claim that John boughta car? Short/Island

d. What did you make the claim that Johnbought t? Long/Island

Once more, the parser matches the acceptabilitypreferences reported in SPW correctly in all con-ditions. Particularly interesting is the absence of acontrast between 4a and 4c. This is again due tothe absence of a real interaction between the addi-tional structural complexity of the island and themain movement dependency. The fact that this re-sults in a tie stresses how movement dependenciesand structural complexity conspire with the top-down strategy of the MG parser in non-trivial waysto drive memory cost.

5 Discussion

This paper argues for an MG parser as a good, nonprobabilistic formal model of how gradient accept-ability can be derived from categorical grammars.In doing so, we provide one of the first quanti-tative models of how processing factors and fine-grained, minimalist-like grammatical informationcan conspire to modulate acceptability. As a proof-of-concept, we replicated the gradient acceptabil-ity scores for the island effects in (Sprouse et al.,2012a). These results are certainly preliminary, butthe success of the parser on this baseline is encour-aging.

As mentioned in the Introduction, many hy-potheses have been formulated in the past aboutthe way memory and grammatical factors con-spire to produce processing differences across sen-tences. Thus, it is reasonable to wonder what arethe benefits of the particular linking hypothesisimplemented here. As we pointed out before, oneof the main advantages of our model is the tightconnection between the parser behavior and the

60

rich grammatical information encoded in the MGderivation trees. This allows for rigorous evalua-tions of the cognitive claims made by modern syn-tactic theories.

In line with recent work using the MG parser asa model of processing difficulty, Section 4 focusedon the predictions made by MAXT and SUMS,Clearly, one could easily conceive of metrics thattake different syntactic information into account(for example, by counting the amount of bound-ing nodes or phases). However, tenure and sizearguably rely on the simplest possible connectionbetween memory, structure, and parsing behavior— as they exclusively refer to the geometry ofa derivation tree, without additional assumptionsabout the nature of its nodes.

Of course, a question remains about the cogni-tive plausibility of such metrics. While this modelis certainly not the first to formalize memory costas associated to the length of movement depen-dencies, the previous discussion highlighted howsize-centered metrics do not simply depend on thelength of a movement steps. Instead, they pickup on the non-trivial changes in the behavior ofthe parser, based on how long-distance dependen-cies interact with local structural configurations.Thus, they cannot trivially be identified with otherlength-based measures (cf. Gibson, 1998; Ram-bow and Joshi, 2015, a.o.). As previous workpoints out, in the future it will be important to ex-plore the relation between these complexity met-rics, and psychological insights about the natureof human memory mechanisms (De Santo, 2019).

Similarly, as one reviewer suggests, it would beinteresting to see whether SPW’s results can be de-rived from different cognitive hypotheses; for in-stance by implementing in the MG model the va-riety of constraints explored by Boston (2012) fora dependency parser. Moreover, in this study weemploy a deterministic parser to exclusively focuson the relation between structural complexity andmemory usage. However, it is known that struc-tural and lexical frequency influence islands’ ac-ceptability (Chaves and Dery, 2019, a.o.). Thus,informative insights would come from imple-menting information-theoretical complexity met-rics over the MG parser (Hale, 2016), and explorethe predictions of expectation-based approaches.

Obviously, the target judgments modeled hereare part of a restricted set. Future studies in thissense will benefit from wider comparisons among

minimally different variants of acceptable and un-acceptable sentences (cf. Sprouse et al., 2013,2016). As mentioned, the nature of the modelmakes comparisons beyond pairs of minimal sen-tences hard to interpret. However, in future itmight be possible to define normalization mea-sures for memory metrics computed over sen-tences with widely different underlying structures.

Finally, in Section 3 we avoided discussing thenature of island effects, as we do not mean for theMG model to address the debate of whether islandviolations are reducible to processing factors, orare instead tied to core grammatical constraints.Importantly, while this approach might superfi-cially be construed as a reductionist theory, it isnot: for instance, the MG parser by itself is notable to explain the difference between sentencesthat are simply hard to process, and sentences con-sidered unacceptable/ungrammatical. Thus, themodel is theoretically neutral with respect togrammatical or reductionist frameworks.

However, consider the first case of Subject is-lands we analyzed in Sec. 4. The parser producedthe right predictions for all test sentences exceptwhen, in the presence of an island construction,the longest movement dependency and the islandviolation did not coincide (2c and 2d). This mis-match is not only explained, but it is actually ex-pected, if we embrace a grammatical theory of is-land constraints. Under such theory, 2d is prefer-able from a processing perspective (as it involvesshorter dependencies), but its acceptability is low-ered by the fact that it violates a grammatical con-straint, while 2c does not.

While we have to be careful in formulating hy-potheses based on a single data point, this contrastsuggests that the MG model could help us investi-gate those aspects of acceptability that are funda-mentally tied to grammatical constraints.

Acknowledgements

Sincere thanks to Thomas Graf, Jon Sprouse, andthree anonymous reviewers for insightful feedbackon this work.

ReferencesDavid Adger. 2003. Core syntax: A minimalist ap-

proach, volume 33. Oxford: Oxford UniversityPress.

Marisa Ferrara Boston. 2010. The role of memoryin superiority violation gradience. In Proceedings

61

of the 2010 Workshop on Cognitive Modeling andComputational Linguistics, pages 36–44. Associa-tion for Computational Linguistics.

Marisa Ferrara Boston. 2012. A computational modelof cognitive constraints in syntactic locality. Ph.D.thesis, Cornell University.

Rui P Chaves and Jeruen E Dery. 2019. Frequencyeffects in subject islands. Journal of Linguistics,55(3):475–521.

Noam Chomsky. 1956. Three models for the descrip-tion of language. IRE Transactions on InformationTheory, 2:113–124.

Noam Chomsky. 1965. Aspects of the Theory of Syn-tax, volume 11. MIT Press.

Matthew W Crocker and Frank Keller. 2005. Prob-abilistic grammars as models of gradience in lan-guage processing. In Gradience in grammar: Gen-erative perspectives. Citeseer.

Aniello De Santo. 2019. Testing a Minimalist gram-mar parser on Italian relative clause asymmetries.In Proceedings of the ACL Workshop on CognitiveModeling and Computational Linguistics (CMCL)2019, June 6 2019, Minneapolis, Minnesota.

Aniello De Santo and Nazila Shafiei. 2019. On thestructure of relative clauses in Persian: Evidencefrom computational modeling and processing ef-fects. Talk at the 2nd North American Conferencein Iranian Linguistics (NACIL2), April 19-21 2019,University of Arizona.

Sabrina Gerth. 2015. Memory Limitations in Sen-tence Comprehension: A Structural-based Complex-ity Metric of Processing Difficulty, volume 6. Uni-versitatsverlag Potsdam.

E Gibson. 2000. The dependency locality theory: adistance-based theory of linguistic complexity. In2000, Image, Language, Brain: Papers from theFirst Mind Articulation Project Symposium, pages95–126. MIT press.

Edward Gibson. 1998. Linguistic complexity: Localityof syntactic dependencies. Cognition, 68(1):1–76.

Thomas Graf, Brigitta Fodor, James Monette, GianpaulRachiele, Aunika Warren, and Chong Zhang. 2015.A refined notion of memory usage for minimalistparsing. In Proceedings of the 14th Meeting on theMathematics of Language (MoL 2015), pages 1–14,Chicago, USA. Association for Computational Lin-guistics.

Thomas Graf and Bradley Marcinek. 2014. Evaluatingevaluation metrics for minimalist parsing. In Pro-ceedings of the 2014 ACL Workshop on CognitiveModeling and Computational Linguistics, pages 28–36.

Thomas Graf, James Monette, and Chong Zhang. 2017.Relative clauses as a benchmark for Minimalist pars-ing. Journal of Language Modelling, 5:57–106.

John Hale. 2016. Information-theoretical complex-ity metrics. Language and Linguistics Compass,10(9):397–412.

Henk Harkema. 2001. A characterization of minimalistlanguages. In International Conference on LogicalAspects of Computational Linguistics, pages 193–211. Springer.

Philip Hofmeister, Laura Staum Casasanto, and Ivan ASag. 2012a. How do individual cognitive differ-ences relate to acceptability judgments? A replyto Sprouse, Wagers, and Phillips. Language, pages390–400.

Philip Hofmeister, Laura Staum Casasanto, and Ivan ASag. 2012b. Misapplying working-memory tests: Areductio ad absurdum. Language, 88(2):408–409.

Frank Keller. 2000. Gradience in grammar: Exper-imental and computational aspects of degrees ofgrammaticality. Ph.D. thesis.

Robert Kluender and Marta Kutas. 1993. Subjacencyas a processing phenomenon. Language and cogni-tive processes, 8(4):573–633.

Gregory M Kobele, Sabrina Gerth, and John Hale.2013. Memory resource allocation in top-downminimalist parsing. In Formal Grammar, pages 32–51. Springer.

Annika Kohrt, Trey Sorensen, and Dustin A Chacon.2018. The real-time status of semantic exceptionsto the adjunct island constraint. In Proceedings ofWECOL 2018: Western Conference on Linguistics.

Dave Kush, Terje Lohndal, and Jon Sprouse. 2018. In-vestigating variation in island effects. Natural lan-guage & linguistic theory, 36(3):743–779.

Jey Han Lau, Alexander Clark, and Shalom Lappin.2014. Measuring gradience in speakers’ gram-maticality judgements. In Proceedings of the An-nual Meeting of the Cognitive Science Society, vol-ume 36.

Jey Han Lau, Alexander Clark, and Shalom Lap-pin. 2015. Unsupervised prediction of acceptabil-ity judgements. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing, volume 1, pages1618–1628.

Jey Han Lau, Alexander Clark, and Shalom Lappin.2017. Grammaticality, acceptability, and probabil-ity: A probabilistic view of linguistic knowledge.Cognitive Science, 41(5):1202–1241.

So Young Lee. 2018. A minimalist parsing account ofattachment ambiguity in English and Korean. Jour-nal of Cognitive Science, 19(3):291–329.

62

Richard L Lewis and Shravan Vasishth. 2005. Anactivation-based model of sentence processing asskilled memory retrieval. Cognitive science,29(3):375–419.

Lei Liu. 2018. Minimalist Parsing of Heavy NP Shift.In Proceedings of PACLIC 32 The 32nd Pacific AsiaConference on Language, Information and Com-putation, The Hong Kong Polytechnic University,Hong Kong SAR.

Brian McElree, Stephani Foraker, and Lisbeth Dyer.2003. Memory structures that subserve sentencecomprehension. Journal of memory and language,48(1):67–91.

Jens Michaelis. 1998. Derivational minimalism ismildly context–sensitive. In International Confer-ence on Logical Aspects of Computational Linguis-tics, pages 179–198. Springer.

Alan Prince and Paul Smolensky. 2008. OptimalityTheory: Constraint interaction in generative gram-mar. John Wiley & Sons.

Owen Rambow and Aravind K Joshi. 2015. A process-ing model for free word-order languages. Perspec-tives on sentence processing.

Luigi Rizzi. 1990. Relativized minimality. The MITPress.

J. R. Ross. 1968. Constraints on variables in syntax.Ph.D. dissertation, MIT.

Antonella Sorace and Frank Keller. 2005. Gradience inlinguistic data. Lingua, 115(11):1497–1524.

Jon Sprouse. 2007. Continuous acceptability, categor-ical grammaticality, and experimental syntax. Bi-olinguistics, 1:123–134.

Jon Sprouse, Ivano Caponigro, Ciro Greco, and CarloCecchetto. 2016. Experimental syntax and the vari-ation of island effects in english and italian. NaturalLanguage & Linguistic Theory, 34(1):307–344.

Jon Sprouse, Carson T Schutze, and Diogo Almeida.2013. A comparison of informal and formal accept-ability judgments using a random sample from lin-guistic inquiry 2001–2010. Lingua, 134:219–248.

Jon Sprouse, Matt Wagers, and Colin Phillips. 2012a.A test of the relation between working-memorycapacity and syntactic island effects. Language,88(1):82–123.

Jon Sprouse, Matt Wagers, and Colin Phillips. 2012b.Working-memory capacity and island effects: Areminder of the issues and the facts. Language,88(2):401–407.

Jon Sprouse, Beracah Yankama, Sagar Indurkhya,Sandiway Fong, and Robert C Berwick. 2018. Col-orless green ideas do sleep furiously: Gradient ac-ceptability and the nature of the grammar. The Lin-guistic Review, 35(3):575–599.

Edward P Stabler. 1997. Derivational minimalism.In International Conference on Logical Aspects ofComputational Linguistics, pages 68–95. Springer.

Edward P Stabler. 2011. Computational perspectiveson minimalism. In The Oxford Handbook of Lin-guistic Minimalism.

Edward P Stabler. 2013. Two models of minimalist,incremental syntactic analysis. Topics in cognitivescience, 5(3):611–633.

Robert Truswell. 2011. Events, phrases, and questions.33. Oxford University Press.

Heinz Wanner and Michael P. Maratsos. 1978. AnATN approach to comprehension. In Linguistic the-ory and psychological reality. MIT Press.

Victor H Yngve. 1960. A model and an hypothesisfor language structure. Proceedings of the Ameri-can philosophical society, 104(5):444–466.

Chong Zhang. 2017. Stacked Relatives: Their Struc-ture, Processing and Computation. Ph.D. thesis,State University of New York at Stony Brook.

63