annotation of grammatemes in the prague dependency treebank 2.0 magda razímová zdeněk...

30
Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University Prague, Czech Republic {razimova,zabokrtsky}@ufal.mff. cuni.cz

Upload: emmett-bramhall

Post on 14-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

Annotation of Grammatemes in the Prague Dependency Treebank 2.0

Magda Razímová

Zdeněk Žabokrtský

Institute of Formal and Applied Linguistics

Charles University

Prague, Czech Republic

{razimova,zabokrtsky}@ufal.mff.cuni.cz

Page 2: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes

Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment

Final remarks

Page 3: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Introduction grammatemes in the PDT 2.0

one type of attributes of nodes of a deep syntactic tree capturing morphological meanings that are semantically

indispensable• number for nouns, degree of comparison for adjectives, tense for

verbs, etc. annotation of grammatemes

the last task in the PDT 2.0 annotation procedure possible to assign automatically – profiting from the

already available annotation:• annotation of the same sentence at the lower layers• already available components of the t-tree (tree structure, types

of dependency relations, co-reference, etc.)

Page 4: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Historical backgroundand development of PDT project mid 1960’s – Praguian Functional Generative Description (Petr

Sgall et al.) 1994 – Czech National Corpus 1995 – PDT started 1998 – PDT 0.5 pre-release 2001 – PDT 1.0 released by LDC

manual annotation of morphology and surface syntax

2006 – PDT 2.0 to be released by LDC interlinked morphological, surface-syntactic and complex

deep-syntactic annotation • including annotation of grammatemes

Page 5: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes

Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment

Final remarks

Page 6: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Layers of annotation tectogrammatical layer

deep-syntactic dependency tree

analytical layer surface-syntactic dependency tree

morphological layer m-lemma and m-tag

associated with each token

word layer original text, segmented on word

boundaries lit: He-was would went toforest.He would have gone to the forest.

Page 7: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Interlinking the layers

lit: He-was would went toforest.He would have gone to the forest.

any unit at any layer has a PDT unique ID

neighboring layers connected by top-down pointers

Page 8: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Size of the PDT 2.0 data (i) 7,129 manually annotated textual documents

all documents annotated at the m-layer• 16,065 sentences with 1,960,657 tokens

75 % of the m-layer data annotated at the a-layer• 5,338 documents, 87,980 sentences, 1,504,847 tokens

44 % of the m-layer data annotated also at the t-layer• 3,168  documents, 49,442  sentences, 833,357  tokens

Page 9: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

training data (80 %) development test data (10 %) evaluation test data (10 %)

Size of the PDT 2.0 data (ii)

Page 10: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

M-layer sentence represented as a

sequence of tokens each token lemmatized and

tagged (attributes m-lemma and m-tag)

positional m-tag: 15 characters 1. (main) POS 2. detailed POS 3. gender 4. number 5. case ...

lit.: Some contours problem(gen) reflexive_pronoun though after resurgence(instr) Havel's speech(instr) they-seem to-be clearer.

Some contours of the problem seem to be clearer after the resurgence by Havel's speech.

Page 11: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

A-layer rooted ordered tree with labeled

nodes and edges a-nodes

one token of the m-layer is represented by exactly one a-node

labeled with a-lemmas (identical with word forms)

a-edges represent dependency relations (Sb,

Obj, Adv, Atr) represent non-dependency relations

(Coord) analytical function attribute appears

as an a-node attribute

Some contours of the problem seem to be clearer after the resurgence by Havel's speech.

Page 12: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

T-layer

Some contours of the problem seem to be clearer after the resurgence by Havel's

speech.

rooted ordered tree with labeled nodes and edges

t-nodes complex typed feature

structures represent auto-semantic

words functional words do not have

nodes of their own artificially added nodes

t-edges dependency relations (functor) non-dependency relations

(coordination constructions) functor attribute appears as an

t-node attribute

Page 13: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

lit. [To] all was handed over a certificate of successful graduation from the course.They all received a certificate of successful graduation from this course.

Areas of annotation at the t-layer

tree structure t-lemma attribute dependency relation

(functor and subfunctor)

topic-focus attributes co-reference attributes

node typing attributes (nodetype and sempos)

grammateme attributes

Všem bylo předáno osvědčení o úspěšném

absolvování kurzu.

Page 14: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes

Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment

Final remarks

Page 15: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

grammatemes t-node attributes representing inflectional information that

is semantically indispensable (morphological meanings such as number for nouns, tense for verbs, degree of comparison for adjectives, etc.)

semantically irrelevant morphological meanings are not part of the t-layer (e.g. case for nouns)

Peter met her youngest brother. Peter will meet her young brothers.

PeterACT

meetPREDtense=ant brother

PATnumber=sg

#PersPronAPP

youngRSTRdegree=sup

PeterACT

meetPREDtense=post brother

PATnumber=pl

#PersPronAPP

youngRSTRdegree=pos

Peter met her youngest brother. Peter will meet her young brothers.

PeterACT

meetPREDtense=ant brother

PATnumber=sg

#PersPronAPP

youngRSTRdegree=sup

PeterACT

meetPREDtense=post brother

PATnumber=pl

#PersPronAPP

youngRSTRdegree=pos

Grammatemes: Motivation

Page 16: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Grammateme attributes

15 grammatemes indeftype numertype negation degcmp

tense aspect verbmod deontmod dispmod resultative iterativeness

number gender person politeness

Page 17: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Conditioned presence/absence of grammatemes obviously, not all grammatemes are relevant for all nodes

no tense for dog, no degree of comparison for (he) waits, etc.

how to formally declare presence/absence of a given grammateme attribute in a given node?

the need for node typing

chosen solution: two-level typing 1st level: 8 more general types of nodes

• grammatemes relevant only for one of them 2nd level: 19 more specific subtypes, corresponding to detailed semantic

parts of speech

Page 18: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Presence/absence of grammateme values: Two-level t-node hierarchy

1st level: attribute nodetype 2nd level: attribute sempos

t-n o d e s

co m p le x a to m q c o m p le xlis tco a p d p h rfp h rro o t

se m a n ticve rb s

se m a n ticn o u n s

se m a n tica d ve rb ss em a n tic a d je c tiv e s

d e n o ta tivea d j.d e n o t

(d e g cm p ,n e g a tio n )

h e zký , p s í, čo ko lá d o vý

p ro n o m in a l

in d e fin itea d j.p ro n .in d e f

( in d e ftyp e )

ja ký , k te rý

d e fin itea d j.q u a n t.d e f

(n u m e rtyp e )

tř i (d ě ti), to lik

q u a n tifica tiv e

d e fin ite

d e m o n s tra tiv ea d j.p ro n .d e f.d e m o n

Ø

te n (u č ite l), ta ko vý

in d e fin itea d j.q u a n t in d e f

(n u m e rtyp e ,in d e ftyp e )

ko lik

g ra d a b lea d j.q u a n t.g ra d

(n u m e rtyp e ,d e g cm p )

h o d n ě , m á lo

s em a n tic a d je c tiv e s

d e n o ta tivea d j.d e n o t

(d e g cm p ,n e g a tio n )

h e zký , p s í, čo ko lá d o vý

p ro n o m in a l

in d e fin itea d j.p ro n .in d e f

( in d e ftyp e )

ja ký , k te rý

d e fin itea d j.q u a n t.d e f

(n u m e rtyp e )

tř i (d ě ti), to lik

q u a n tifica tiv e

d e fin ite

d e m o n s tra tiv ea d j.p ro n .d e f.d e m o n

Ø

te n (u č ite l), ta ko vý

in d e fin itea d j.q u a n t in d e f

(n u m e rtyp e ,in d e ftyp e )

ko lik

g ra d a b lea d j.q u a n t.g ra d

(n u m e rtyp e ,d e g cm p )

h o d n ě , m á lo

Page 19: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

8 attribute values: root | qcomplex | list | atom | coap | dphr | fphr | complex

fully automatic annotation - use of the tree structure root t-attributes

• t-lemma qcomplex | list• functor atom | coap | dphr | fphr

else complex

Levnější benzín na Východě, dražší na Západě Cheaper gasoline in the East, more expensive one in the West

First level of the hierarchy: attribute nodetype

Page 20: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

only complex nodes grouped into semantic parts of speech 19 values of the attribute sempos:

n. ... | adj. ... | adv. ... | v. ... fully automatic annotation – use of

m-tag t-lemma other t-attributes

sempos value delimits the set of relevant grammatemes

semantic adjectives

denotativeadj.denot

(degcmp,negation)

hezký, psí, čokoládový

pronominal

indefiniteadj.pron.indef

(indeftype)

jaký, který

definiteadj.quant.def

(numertype)

tři (děti), tolik

quantificative

definite

demonstrativeadj.pron.def.demon

Ø

ten (učitel), takový

indefiniteadj.quant indef

(numertype,indeftype)

kolik

gradableadj.quant.grad

(numertype,degcmp)

hodně, málo

semantic adjectives

denotativeadj.denot

(degcmp,negation)

hezký, psí, čokoládový

pronominal

indefiniteadj.pron.indef

(indeftype)

jaký, který

definiteadj.quant.def

(numertype)

tři (děti), tolik

quantificative

definite

demonstrativeadj.pron.def.demon

Ø

ten (učitel), takový

indefiniteadj.quant indef

(numertype,indeftype)

kolik

gradableadj.quant.grad

(numertype,degcmp)

hodně, málo

Second level of the hierarchy: attribute sempos

Page 21: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Values of nodetype and sempos in the PDT 2.0 – an overview

nodetype values: sempos values:

Page 22: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Grammateme value assignment n-tred environment for processing the PDT data http://ufal.mff.cuni.cz/˜pajas

automatic annotation 2000 lines of Perl code

• crucial importance of inter-layer links – use of• t-attributes, a-attributes, m-attributes

rules using special economic notation • 2000 lines written in a text file

lexical resources• special purpose lists of adverbs / verbs

manual annotation of special problems two annotators working in parallel simplified annotation environment: treebank positions

extracted into simple HTML forms

Page 23: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Simple HTML-basedenvironment for manual annotation

lit: The difference [you] would have

to pay yourself.

Page 24: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Automatic vs. manual assignment

at the t-layer of the PDT 2.0: 1,594,333 grammateme values assigned

at 550,947 complex nodes

manually assigned:• 17,520 grammateme values

• inter-annotator agreement: 70-85 %

Page 25: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Grammateme assignment and m-tag

number grammateme: values sg | pl assigned automatically using m-tag

e.g. les (forest)• m-layer: tag NNIS2-----A---- t-layer: number=sg

manual assignment nouns with only plural forms (identified by

a list extracted from the machine-readable dictionary of standard Czech)

e.g. dveře (door/doors)• m-layer: always plural• t-layer: annotator decision sg | pl

n.denotnumber=sg

lit: He-was would went toforest.He would have gone to the forest.

Page 26: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Grammateme assignment and tree structure

vverbmod=cdn

mood grammateme verbmod: values ind | imp | cdn

assigned automatically one-word verbal forms

• e.g. jde (goes)• m-tag information

verbal forms consisting of more word forms (represented by a single node at the t-layer)

• e.g. byl by šel (would have gone)• corresponding a-layer subtree

involves the node by• m-tag of the node by

lit: He-was would went toforest.He would have gone to the forest.

Page 27: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

lit. From remainder of raw material the diary produces dried milk, which [it] exports to Asia and South America.

From the rest of the material, the diary produces dried milk, which is exported [by it] to Asia and South America.

Grammateme assignment and co-reference

grammatemes gender, number and person in relative pronouns are left underspecified (value inher), since they are imposed only by grammatical agreement (thus can be “inherited from the antecedents”)

Ze zbytku suroviny mlékárna vyrábí sušené

mléko, které vyváží do Asie a Jižní Ameriky.

Page 28: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes

Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment

Final remarks

Page 29: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

Final remarks achievements:

two-level typing of t-layer nodes which makes it possible to formally capture presence/absence of individual grammatemes in a given node

automatic procedure for capturing the node classification and the grammateme attributes

verification of the procedure on large-scale data experience:

it was the existence of the lower annotation layers and the existence of inter-layer links what allowed to make the procedure of grammateme assignment more or less automatic

Page 30: Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University

LREC 2006, Annotation Science [email protected]/30

http://ufal.mff.cuni.cz/pdt2.0/