corpus linguistics (l615) - computational...

33
Corpus Linguistics Syntactic Annotation and Treebanks Treebanks Remarks Guidelines Treebank issues Theory-dependency Partial analysis Spoken language Constituency & Dependency Examples English treebanks References Corpus Linguistics (L615) Syntactic Annotation and Treebanks Markus Dickinson Department of Linguistics, Indiana University Spring 2013 1 / 31

Upload: vodien

Post on 18-Apr-2018

230 views

Category:

Documents


2 download

TRANSCRIPT

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Corpus Linguistics(L615)

Syntactic Annotation and Treebanks

Markus Dickinson

Department of Linguistics, Indiana UniversitySpring 2013

1 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Treebanks

A treebank is a syntactically annotated corpus

As with other corpora, they have several general issues:

◮ spoken vs. written language?◮ Spoken language faces unique structural challenges

◮ manual vs. automatic annotation?◮ Parsers are not more than 90% precise, generally

speaking◮ theory-neutral vs. theory-dependent?

◮ Every decision is a theoretical decision

2 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Penn WSJ Treebank – Example

( (S (NP-SBJ (NP Pierre Vinken),(ADJP (NP 61 years)

old),)

(VP will(VP join

(NP the board)(PP-CLR as

(NP a nonexecutive director))(NP-TMP Nov. 29)))

.))

3 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Some remarks about treebanking

◮ treebanking is extremely labor-intensive (i.e. costly)

◮ good planning is therefore necessary

◮ good tools are crucial◮ they speed up the process◮ they help with consistency

◮ a detailed stylebook & thorough training are essential

4 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Some syntactic annotation tools

◮ DTAG:http://code.google.com/p/copenhagen-dependency-treebank/wiki/DT

◮ Brat: http://brat.nlplab.org◮ Annotate:

http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/annotate◮ XCDG:

http://nats-www.informatik.uni-hamburg.de/CDG/DownloadPage◮ CLaRK: http://www.bultreebank.org/clark/

Can find more at, e.g., http://www.clarin.eu/view tools

5 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Guidelines examples

From Bies et al. (1995, p. 12-13)1.1.3 Level of attachment◮ The following are attached at S-level: subject NP,

highest VP, fronted constituents, initial and finalpunctuation, and most modifiers that precede theverb phrase. When there is no VP (as in “smallclauses”), the predicate is labled -PRD, and it andany following adjuncts are attached at S-level.

◮ VP-level:1. Almost all modifiers that follow the verb are attached

under the lowest appropriate VP. When there isconjunction and the modifer applies to both VPs, themodifier is attached at conjunction level.

2. An exception is made for modifiers that areinterpreted as appositives to the event or thepredicate itself. Such modifiers are adjoined to VP.Some of them may also have a -ADV tag.

6 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Guidelines example (cont.)

(S (NP-SBJ Investors)(VP might(VP (VP appear

(ADJP-PRD unenthusiastic(PP about

(NP the new issue))))(SBAR (WHNP-1 which)

(S (NP-SBJ *T*-1)(VP might

(VP force(S (NP-SBJ the government)

(VP to(VP raise

(NP the coupon)(PP-CLR to

(NP (QP more than 7)%))))))))))))

7 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Treebank issues

Various issues for treebanks:◮ theory-dependency vs. theory-neutrality◮ complete analysis vs. partial analysis

◮ Syntactic chunks are easier to annotate more reliablyand can be sued for a variety of purposes

◮ Chunks are generally non-recursive NPs and PPs

◮ analysis for written vs. spoken language◮ constituency vs. dependency annotation

◮ Within constituency annotation: should we annotategrammatical functions?

8 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Theory-dependent treebanks

Some treebanks encode a great deal of theory, e.g.,:

◮ Prague Dependeny Treebank◮ based on Dependency Grammar

◮ The Redwoods HPSG Treebank◮ based on Head-Driven Phrase Structure Grammar

◮ CCGbank◮ translation of the Penn Treebank into a corpus of

Combinatory Categorial Grammar derivations

9 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Theory-neutral treebanks

Theory-neutral treebanks:◮ do not adhere to any particular linguistic theory◮ encode those grammatical properties that are

distinguished by most grammatical frameworks

Advantage:◮ Widely usable◮ Less dependent on particular grammatical theory at

the time when the treebank annotation scheme wasdetermined

Examples:◮ Penn Treebank, Negra treebank, Tubingen treebanks

10 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Partial analysisAdding to the PTB

In a sense, the PTB annotation is only partial, at least insome spots, and some researchers have added structure

e.g., Vadas and Curran (2007) add intenal noun phrasestructure◮ Instead of this:

(NP (NN lung) (NN cancer) (NNS deaths))

◮ we have this:

(NP (NML (NN lung) (NN cancer) )(NNS deaths) )

You can get this standoff annotation at:

http://www.cs.usyd.edu.au/∼dvadas1/?Noun Phrases

11 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Treebanks for spoken language

Here’s an example from the Switchboard (swbd) corpusfrom the PTB:

( (CODE (SYM SpeakerA1) (. .) ))( (INTJ (UH Okay) (. .) (-DFL- E_S) ))( (CODE (SYM SpeakerB2) (. .) ))( (SBARQ

(INTJ (UH So) )(, ,)(WHNP-1 (WP who) )(SQ (BES ’s)(NP-SBJ (PRP$ your)(, ,)(INTJ (UH uh) )(, ,) (JJ favorite) (NN team) )

(NP-PRD (-NONE- *T*-1) ))(. ?) (-DFL- E_S) ))

12 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Treebanks for spoken language (cont.)

Things to note:

◮ Need to have explicit notation for speakers◮ Have explicit disfluency tags (e.g., -DFL-)◮ Interjections (INTJ) are prevalent, so bracketing

guidelines need to know where to put them◮ Also, questions are much more common than in

newspaper text

13 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Constituency annotation

◮ Constituency annotation describes phrase structureand clause structure

◮ e.g., noun phrases, adjectival phrases, adverbialphrases, clauses

◮ Structures are often recursive◮ For languages like German, this might also include

notions such as topological fields

14 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Penn WSJ Treebank (example)

15 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Discontinuous constituents

◮ Discontinuous constituents (or equivalents) havebeen proposed in a wide range of syntacticframeworks

◮ They are also used in the two German treebanks:Verbmobil (Hinrichs et al. 2000) and the TIGERcorpus (Brants et al. 2002)

◮ German extraposition example (Brants et al. 2002):

(1)

Eina

Mannman

kommtcomes

,,derwho

lachtlaughs

‘A man who laughs comes.’

Ein Mann der lacht labeled as discontinuous NP

16 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Discontinuous contituent example

A sentence with crossing branches in TIGER:

17 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Dependency grammar

Dependency grammar is interested in grammaticalrelations between words of a sentence

Dependency grammars do not propose a recursivestructure but rather a network of relations◮ The verb is the part of the sentence on which

everything ultimately depends◮ The direction of a link represents the dependency,

the angle represents the word order

18 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Dependency grammar example

the little bald man likes the present

likes

man

the little bald

present

the

19 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Extending Dependency Grammar

◮ dependency grammars are often extended by labelsthat denote the grammatical function that thedependent word has with regard to its governor

Example:

I saw a man with a dog and a cat in the park

subj spec adjn spec spec spec

cmpl cmpl cmpl

adjn

from (Lin 1995)20 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Properties of dependency annotation

◮ Generally, you can have non-adjacent arcs◮ You also get non-projective structures, where the

dependency arcs cross◮ These structures generally have correlates with

discontinuous constituents◮ Difficult aspects to cover without constituency:

◮ Coordination: what is the head of a coordinatephrase?

◮ Verbal modification: no distinction between sentential& VP adjuncts

21 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

The Prague Dependency Treebank (PDT)

The PDT has different layers to handle syntactic andsemantic relations

1. Morphemic layer: tag assigned to each word form (c.3000 tag values)

2. Analytic tree structures: dependency relations forevery word form & puncutation (forms a roooted tree)

3. Tectogrammatical tree structures: underlyingsentence representations

Both analytic & tectogrammatical layers aredependency-based, but capture different information

22 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

The tectogrammatical level

Tectogrammatical structures are more semantic and havethe following aspects:◮ only lexical words serve as tree nodes

◮ Auxiliaries & prepositions are attached as indices tolexical items

◮ Nodes are added for surface deletions◮ Non-projectivity is not allowed◮ Analytic functions (e.g., subject) are replaced with

tectogrammatical ones◮ e.g., Actor/Bearer, Patient, Addressee, ...

◮ Topic-Focus information is added

23 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Dependency & constituency annotation

In principle, if a constituency treebank containsinformation about what the head is, then it can also havedependencies

◮ The Talbanken05 corpus of Swedish is a goodexample(http://w3.msi.vxu.se/∼nivre/research/Talbanken05.html)

◮ 1976 corpus converted into one with bothannotations:

1. Original flat annotation converted to bare phrasestructure

2. Bare phrase structure extended to full phrasestructure

3. Full phrase structure converted to dependencyannotation, with grammatical functions as edgelabels

24 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

English treebanksPenn Treebank

25 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

English treebanksICE Treebank

[<#6:1> <sent>]PU,CL(main,montr,pass,pres)

SU,NPNPHD,PRON(pers,sing) {It}

VB,VP(montr,pres,pass)OP,AUX(pass,pres) {is}MVB,V(montr,edp) {chosen}

A,PPP,PREP(ge) {for}PC,NP(coordn)CJ,NP

NPHD,N(com,sing) {comfort}COOR,CONJUNC(coord) {and}CJ,NP

NPHD,N(com,sing) {ease}NPPO,PP

P,PREP(ge) {of}PC,NPNPHD,N(com,sing) {washing}

COOR,CONJUNC(coord) {rather than}CJ,NP

NPHD,N(com,sing) {stylishness}PUNC,PUNC(per) {.}

26 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

English treebanksVerbmobil Treebank

27 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Some other treebanks for English

◮ Penn Treebank◮ BLLIP Treebank◮ The Penn-Helsinki Parsed Corpus of Middle English◮ Susanne Corpus and Christine Project◮ International Corpus of English (ICE)◮ Lancaster Treebank◮ The Redwoods HPSG Treebank◮ ...

28 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Treebank projects

◮ Arabic◮ Penn Arabic Treebank

◮ Bulgarian◮ HPSG-based Syntactic Treebank of Bulgarian

(BulTreeBank)◮ Catalan

◮ CAT3LB project◮ Chinese

◮ Penn Chinese Treebank◮ Sinica Treebank

◮ Czech◮ Prague Dependency Treebank

29 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Treebank projects (2)

◮ Danish◮ Danish Dependency Treebank

◮ Dutch◮ The Alpino Treebank

◮ French◮ Project TALANA

◮ German◮ NeGra Project - NeGra Corpus◮ Project TIGER◮ Verbmobil Treebank of Spoken German (TuBa-D/S)◮ The Tubingen Treebank of Written German

(TuBa-D/Z)

30 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Treebank projects (3)

◮ Italian◮ Turin University Treebank TUT◮ Italian Syntactic-Semantic Treebank

◮ Japanese◮ Verbmobil Treebank of Spoken Japanese (TuBa-J/S)

◮ Portuguese◮ The Floresta Sinta(c)tica project

◮ Swedish◮ Talbanken05, Swedish Treebank

◮ Turkish◮ METU treebank

31 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

References

Bies, Ann, Mark Ferguson, Karen Katz and Robert MacIntyre (1995).Bracketing Guidelines for Treebank II Style Penn Treebank Project.University of Pennsylvania.

Brants, Sabine, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius and GeorgeSmith (2002). The TIGER Treebank. In Proceedings of TLT-02. Sozopol,Bulgaria.

Hinrichs, Erhard, Julia Bartels, Yasuhiro Kawata, Valia Kordoni and HeikeTelljohann (2000). The Tubingen Treebanks for Spoken German, English,and Japanese. In Wolfgang Wahlster (ed.), Verbmobil: Foundations ofSpeech-to-Speech Translation, Berlin: Springer, pp. 552–576.

McCawley, James D. (1982). Parentheticals and discontinuous constituentstructure. Linguistic Inquiry 13(1), 91–106.

Platek, Martin, Tomas Holan, Vladimir Kubon and Karel Oliva (2001).Word-Order Relaxations and Restrictions within a Dependency Grammar.In G. Satta (ed.), Proceedings of the Seventh International Workshop onParsing Technologies (IWPT). Beijing: Tsinghua University Press, pp.237–240.

Rambow, Owen and Aravind Joshi (1994). A Formal Look at DependencyGrammars and Phrase-Structure Grammars, with Special Consideration ofWord-Order Phenomena. In L. Wanner (ed.), Current Issues inMeaning-Text-Theory , London: Pinter. http://arxiv.org/abs/cmp-lg/9410007.

Reape, Mike (1993). A Formal Theory of Word Order: A Case Study in WestGermanic. Ph.D. thesis, University of Edinburgh, Edinburgh.

31 / 31

Corpus Linguistics

SyntacticAnnotation and

Treebanks

TreebanksRemarks

Guidelines

Treebank issuesTheory-dependency

Partial analysis

Spoken language

Constituency &Dependency

ExamplesEnglish treebanks

References

Vadas, David and James Curran (2007). Adding Noun Phrase Structure to thePenn Treebank. In Proceedings of the 45th Annual Meeting of theAssociation of Computational Linguistics. Prague, Czech Republic:Association for Computational Linguistics, pp. 240–247.

31 / 31