grammatical noriegas interaction in corpora and treebanks icame 30 lancaster 27-31 may 2009 sean...

31
Grammatical Grammatical Noriegas Noriegas interaction in corpora and treebanks interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London [email protected]

Upload: victor-williamson

Post on 16-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Grammatical Grammatical NoriegasNoriegasinteraction in corpora and treebanksinteraction in corpora and treebanks

ICAME 30

Lancaster 27-31 May 2009

Sean WallisSurvey of English Usage

University College London

[email protected]

Page 2: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

OutlineOutline

• The probability of Noriega

• What can a parsed corpus tell us?

• Individual choices

• Repeating choices

• Potential sources of interaction

• Case interaction

• LITEs

• What use is interaction evidence?

Page 3: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

The probability of The probability of Noriega Noriega (Church (Church 2000)2000)

• Ken Church looked at word frequency in corpus data– Method

• Find probability of word occurring overall, pr(w)• Divide each text into two halves: T1, T2QWhat is the probability of the word in T2 if it has

already been found in T1, pr(w in T2 | w in T1) ?

– Result• ‘Content words’ like Noriega leap in probability if

seen before pr(w in T2 | w in T1) >> pr(w in T2)• Pronouns, determiners, etc.

no change

T1 T2

Page 4: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

What can a parsed corpus tell What can a parsed corpus tell us?us?• Parsed corpora contain (lots of) trees

– Use Fuzzy Tree Fragment queries to get data

– An FTF

– A matchingcase in a tree

– UsingICECUP

Page 5: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

What can a parsed corpus tell What can a parsed corpus tell us?us?• Three kinds of evidence may be obtained

from a parsed corpus Frequency evidence of a particular known rule,

structure or linguistic event Coverage evidence of new rules, etc. Interaction evidence of the relationship

between rules, structures and events

• Evidence is necessarily framed within a particular grammatical scheme– So… (an obvious question) how might we

evaluate this grammar?

Page 6: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Individual choices Individual choices (Nelson, Wallis & Aarts (Nelson, Wallis & Aarts 2002)2002)

• What factors affect a lexical / grammatical choice?– experiment: does IV DV?

• Independent Variable (IV) = sociolinguistic or grammatical• Dependent Variable (DV) = grammatical alternation

– carry out a 2 test– e.g. does the type of preceding NP head affect the

choice between relative and non-finite postmodification?

people who live in Hawaiivs. those living in Hawaii

– a significant but small interaction– for more complex experiments

repeat with multiple variables(ICECUP IV)

N

non-fin.rel. Total

6,790 6,193 12,983

771 446 1,217

7,561 6,639 14,200

PRON

Total

DV

IV}{

Page 7: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Repeating choices Repeating choices (Wallis, submitted)(Wallis, submitted)

• Construction often involves repetition– e.g. repeated decisions to add an

attributive AJP to specify a NP head: the tall white ship

Page 8: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Repeating choices Repeating choices (Wallis, submitted)(Wallis, submitted)

• Construction often involves repetition– e.g. repeated decisions to add an

attributive AJP to specify a NP head: the tall white ship

the tall ship

the tall white ship

the ship

+

+

Page 9: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Repeating choices Repeating choices (Wallis, submitted)(Wallis, submitted)

• Construction often involves repetition– e.g. repeated decisions to add an attributive AJP

to specify a NP head: the tall white ship

• Sequential probability analysis– calculate probability of adding each AJP

the tall ship

the tall white ship

the ship

+

+

Page 10: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Repeating choices Repeating choices (Wallis, submitted)(Wallis, submitted)

• Construction often involves repetition– e.g. repeated decisions to add an attributive AJP

to specify a NP head: the tall white ship

• Sequential probability analysis– calculate probability of adding each AJP– probability falls

• second < first• third < second• fourth < second

– choices interact– a feedback loop

0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4 5

probability

Page 11: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Repeating choices Repeating choices - more examples- more examples

Adjectives before a noun• similar to AJPs before a noun NP head

AVPs before a verb• no interaction

NP postmodification, embedded vs. multiple• both interact• the probability of

postmodification of the same head falls faster than that for embedding

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0 1 2 3 4

multiple

embedded

probability

Page 12: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Potential sources of Potential sources of interactioninteraction• shared context

– topic or ‘content words’ (Noriega)

• idiomatic conventions– semantic ordering of attributive adjectives (tall white ship)

• logical semantic constraints– exclusion of incompatible adjectives (?tall short ship)

• communicative constraints– brevity on repetition (just say ship next time)

• psycholinguistic processing constraints– attention and memory of speakers

Page 13: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Case interactionCase interaction (new research) (new research)

• Individual choice experiments – measure interaction between variables– statistics assume that cases are independent

• we know AJPs in an NP interact – what if we study AJPs?

• Cases from same text may also interact

variables

cases

Page 14: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Case interactionCase interaction (new research) (new research)

• Cases should be independent – what can we do? ignore problem discount ‘obvious’ duplicate cases randomly subsample take only one case per text score each case by the degree to which

it interacts with others from the same text

• We need a model of case interaction

Page 15: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Case interactionCase interaction (new research) (new research)

• An a posteriori model of case interactionclassify grammatical relationships

between A and B

B

A

Page 16: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Case interactionCase interaction (new research) (new research)

• An a posteriori model of case interactionclassify grammatical relationships

between A and Bmeasure interaction strength

dp(A, B) between A and B in each relationship

B

A

Page 17: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Case interactionCase interaction (new research) (new research)

• An a posteriori model of case interactionclassify grammatical relationships

between A and Bmeasure interaction strength

dp(A, B) between A and B in each relationship

compute marginal probabilityfor each case A fromdependent probabilitiesdp(A, B), dp(A, C)...

B

A

Page 18: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Classify grammatical Classify grammatical relationshipsrelationships• Order

– word order, dominance (parent-child vs. child-parent), etc.

• Topology– basic relationship: word, sibling, dominance etc.

• Grammar – subclassify topology by grammar

– e.g. distinguishing co-ordination from other clauses

• Distance– steps along an axis and how steps

are measured – e.g. whether to include all

intermediate elements

B

A

Page 19: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Measure interaction strengthMeasure interaction strength

• Previous experiments involved single events– Bayesian probability differences (‘swing’)

• Noreiega ‘content words’: pr(a | b) – pr(a)• Repeating choices: pr(a2 | a1) – pr(a1 |

a0)

• Interaction between two groups of (alternate) events– Difference in probabilities of choice

Page 20: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Measure interaction strengthMeasure interaction strength

• Previous experiments involved single events– Bayesian probability differences (‘swing’)

• Noreiega ‘content words’: pr(a | b) – pr(a)• Repeating choices: pr(a2 | a1) – pr(a1 | a0)

• Interaction between two groups of (alternate) events– Difference in probabilities of choice– Bayesian dependence dpB

• sum relative probability difference

– Cramér’s c

• based on chi-square (2)• not affected by direction

0

0.2

0.4

0.6

0.8

1

0 0.5 1

dpB(B, A)

dpB(A, B)

c

2 × 2

p

Page 21: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Compute marginal probabilityCompute marginal probability

• Find the probability that A is dependent on other cases

– Suppose two other cases B and C exist with dependent probabilities dp(A, B), dp(A, C) and B and C also interact with c(B, C)

dp(A, B)

dp(A, C)A

B

C

c(B, C)

Page 22: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Compute marginal probabilityCompute marginal probability

• Find the probability that A is dependent on other cases

– Suppose two other cases B and C exist with dependent probabilities dp(A, B), dp(A, C) and B and C also interact with c(B, C)

– if c(B, C) = 1 then dp(A) = maximum dp

– if c(B, C) = 0 then dp(A) = area

– interpolate for other values of c

dp(A, B)dp(A, C)

dp(A, B)

dp(A, C)

1

dependent

independent

dp(A, B)

dp(A, C)A

B

C

c(B, C)

Page 23: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

Compute marginal probabilityCompute marginal probability

• Find the probability that A is dependent on other cases

– Suppose two other cases B and C exist with dependent probabilities dp(A, B), dp(A, C) and B and C also interact with c(B, C)

– if c(B, C) = 1 then dp(A) = maximum dp

– if c(B, C) = 0 then dp(A) = area

– interpolate for other values of c

• Then compute marginal probability– ip(A) = 1 – dp(A) + {dp(A) / 2+c(B, C)}

• Extend to more than three cases!

dp(A, B)dp(A, C)

dp(A, B)

dp(A, C)

1

dependent

independent

dp(A, B)

dp(A, C)A

B

C

c(B, C)

Page 24: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

LITEs LITEs (new research)(new research)

• Case interaction models– classify grammatical relationships– measure interaction strength between two

choices

• A legitimate experimental method?

Page 25: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

LITEs LITEs (new research)(new research)

• Case interaction models– classify grammatical relationships– measure interaction strength between two

choices

• A legitimate experimental method?– cf. transmission experiments in physics

emitter receivermedium

Page 26: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

LITEs LITEs (new research)(new research)

• Case interaction models– classify grammatical relationships– measure interaction strength between two choices

• A legitimate experimental method?– cf. transmission experiments in physics

• Linguistic interaction transmission experiments?emitter receivermedium

B

A

emitter

receiver

medium

Page 27: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

LITEs LITEs (new research)(new research)

• A LITE investigates the interaction between two choices in a defined relationship– emitter/receiver

• non-finite vs. relative clauses

– medium – up+down distance d via a clause

C• co-ordinated clauses; other clauses

B A

C

{non-finite, relative}

{non-finite, relative}

Page 28: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

LITEs LITEs (new research)(new research)

• A LITE investigates the interaction between two choices in a defined relationship– emitter/receiver

• non-finite vs. relative clauses

– medium – up+down distance d via a clause C• co-ordinated clauses; other clauses

– Plot c over d• skip intermediate

co-ordination nodes

– Result• co-ordination exhibits

>1.5x interactionfor this choice

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9

c

d

co-ordinated clauses

other clauses

Page 29: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

What use is interaction What use is interaction evidence?evidence?• New methods for evaluating interaction

along grammatical axes– General purpose, robust, structural– Based on grammar in corpus– Classifying grammatical relationships allows us to

experiment with the corpus grammar

• Methods have philosophical implications– Grammar structure framing linguistic choices– Linguistics as an evaluable observational science

• Signature (trace) of language production decisions

– A unification of theoretical and corpus linguistics?

Page 30: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

What use is interaction What use is interaction evidence?evidence?• Corpus linguistics

– Optimising existing grammar• e.g. co-ordination, compound nouns

• Theoretical linguistics– Comparing different grammars, same language– Comparing different languages or periods

• Psycholinguistics– Search for evidence of language production

constraints in spontaneous speech corpora• speech and language therapy• language acquisition and development

Page 31: Grammatical Noriegas interaction in corpora and treebanks ICAME 30 Lancaster 27-31 May 2009 Sean Wallis Survey of English Usage University College London

More informationMore information

• Useful links– Survey of English Usage

• www.ucl.ac.uk/english-usage– Fuzzy Tree Fragments

• www.ucl.ac.uk/english-usage/resources/ftfs– Individual choice experiments with FTFs

• www.ucl.ac.uk/english-usage/resources/ftfs/experiment.htm– To obtain ICE-GB (or DCPSE)

• www.ucl.ac.uk/english-usage/resources/sales.htm

• ReferencesChurch 2000. Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p2.

Proceedings of Coling-2000. 180-186.

Nelson, G., Wallis, S.A. & Aarts, B. 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam: John Benjamins.

Wallis, S.A. {submitted}. Capturing linguistic interaction in a grammar: a method for empirically evaluating the grammar of a parsed corpus. Language. Available from www.ucl.ac.uk/english-usage/staff/sean/resources/analysing-grammatical-interaction.pdf