pdt 2.0 prague dependency treebank 2.0 zdeněk Žabokrtský dept. of formal and applied linguistics...

38
http://ufal.mff.cuni.cz/ pdt2.0 PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague [email protected]

Upload: hilary-pearson

Post on 17-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Prague Dependency Treebank 2.0

Zdeněk ŽabokrtskýDept. of Formal and Applied Linguistics

Charles University, [email protected]

Page 2: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Outline of the talk

Introduction

Layers of annotation

Data

Software tools

Documentation

Tour through the CD-ROM

Final remarks

Page 3: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Introduction

treebank syntactically annotated corpus (“bank” of syntactic trees)

Prague Dependency Treebank collection of linguistically annotated Czech texts (2MW), software tools and documentation morphological and surface- and deep-syntactic dependency-oriented sentence analyses

Page 4: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

About Czech

western group of Slavic languages

rich inflectional morphology

(relatively) free word order language

Latin alphabet extended with accents

(příliš žluťoučký kůň)

spoken in the Czech republic

10+ million speakers

Page 5: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Historical backgroundand development of PDT

1920’s – Prague Linguistic Circle founded

1930-50’s – influential dependency-oriented works of Lucien

Tesniere and Vladimír Šmilauer

mid 1960’s – Petr Sgall’s Functional Generative Description

1992 – Penn Treebank

1994 – Czech National Corpus

1995 – PDT started

1998 – PDT 0.5 pre-release

2001 – PDT 1.0 released by LDC

2006 – PDT 2.0 to be released by LDC

Page 6: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Outline of the talk

Introduction

Layers of annotation

Data

Software tools

Documentation

Tour through the CD-ROM

Final remarks

Page 7: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Layered annotation scheme

tectogrammatical layerdeep-syntactic dependency tree

analytical layersurface-syntactic dependency tree

morphological layermorphological lemma and tag associated with each token

word layeroriginal text, segmented on word boundaries

He would have gone intoforest.

Page 8: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

M-layer

sentence represented as a sequence of tokens each token lemmatized and tagged (attributes lemma and tag)15-character long positional morphological tag

1. (main) POS 2. detailed POS 3. gender 4. number 5. case ...

Page 9: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

A-layer (1)- nodes and edges

sentence represented as a rooted ordered tree with labeled nodes and edges

edges labeled with analytical functions:

dependency relations (Sb, Obj, Adv, Atr)non-dep. relations (Coord)auxiliary (functional) nodes (AuxP for prepositions, AuxC for subordinating conjunctions...)

special treatment of coordination constructions

Page 10: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

A-layer (2)- coordination

intricate interplay between dependency and coordination relations

PDT solution: both conjuncts (members of coordination) and shared modifiers attached below the coordination conjunction (but distinguished from each other by a special attribute is_member)

direct parent vs. effective parent:

M M

Page 11: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

T-layer (1) - nodes

t-nodescomplex typed feature structuresnodes represent autosemantic wordsfunctional words do not have nodes of their ownartificially added nodes (e.g. for pro-drops)

node attributestectogrammatical lemmadependency relation – functor and subfunctorgrammateme attributes (representing morphological meanings)attributes for topic-focus articulationattributes for coreference relations

Page 12: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

T-layer (2) - dependency relations

according to FGD, two types of functorsactants (arguments)

ACT – actorPAT – patientADDR – addresseeEFF – effectORIG - origin

free modifiers (adjuncts) various types of temporal modifiers - TWHEN, TTIL, TSIN...spatial and directional modifiers – LOC, DIR1, DIR2, DIR3MEANS, BENeficiary, CAUSe, REGard, EXTent, MATerial, CONDition...

additional functors for representing non-dependency relations coordinations – CONJ, DISJ, ADVS ... appositions – APPS parenthetical constructions - PAR expressions in foreign language - FPHR

Page 13: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

T-layer (3) - valency

all occurrences of all verbs in t-trees interlinked with the valency lexicon PDT-VALLEXindividual valency frames roughly corresponds to individual senses of the given verbvalency frame ~ a sequence of frame slots, for each of which its functor, obligatority and its possible surface realizations are specified

Page 14: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

T-layer (3) - coreference

two types of coreference according to FGD grammatical (verbs of control, relative clauses, reflexive pronouns...) textual (personal pronouns, incl. elided ones)

coreference in PDT binary relation between t-nodes depicted as a “non-tree” arc (arrow)

Page 15: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

T-layer (4) - grammatemes

grammatemes t-node attributes representing morphological meanings

motivation

number for nouns, tense for verbs, degree for adjectives, deontic/verb/sentence modality ...

Peter met her youngest brother. Peter will meet her young brothers.

PeterACT

meetPREDtense=ant brother

PATnumber=sg

#PersPronAPP

youngRSTRdegree=sup

PeterACT

meetPREDtense=post brother

PATnumber=pl

#PersPronAPP

youngRSTRdegree=pos

Peter met her youngest brother. Peter will meet her young brothers.

PeterACT

meetPREDtense=ant brother

PATnumber=sg

#PersPronAPP

youngRSTRdegree=sup

PeterACT

meetPREDtense=post brother

PATnumber=pl

#PersPronAPP

youngRSTRdegree=pos

Page 16: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

T-layer (5) - node typing

presence/absence of a given attribute? the need for node typing

two-level hierarchy of t-layer node types used in PDT 2.0:

tectogrammatical node

com plex atom qcom plexlistcoap dphrfphrroot

semanticadjectives

semanticadverbs

semanticverbs

semantic nouns

denotativen.denot

(number,gender)

pronominal

indefiniten.pron.indef

(number,gender,person,indeftype)

definiten.quant.def

(number,gender,numertype)

quantificative

definitenegationn.denot.neg

(number,gender,negation)

demonstrativen.pron.def.demon

(number,gender)

personaln.pron.def.pers

(number,gender,person,politeness)

tectogrammatical node

com plex atom qcom plexlistcoap dphrfphrroot

semanticadjectives

semanticadverbs

semanticverbs

semantic nouns

denotativen.denot

(number,gender)

pronominal

indefiniten.pron.indef

(number,gender,person,indeftype)

definiten.quant.def

(number,gender,numertype)

quantificative

definitenegationn.denot.neg

(number,gender,negation)

demonstrativen.pron.def.demon

(number,gender)

personaln.pron.def.pers

(number,gender,person,politeness)

Page 17: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Interlinking the layers

any unit at any layer has a PDT unique ID

neighboring layers connected by top-down pointers

Page 18: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Outline of the talk

Introduction

Layers of annotation

Data

Software tools

Documentation

Tour through the CD-ROM

Final remarks

Page 19: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Sources of text

texts provided by the Czech National Corpus

7000 articles (or article fragments) from Czech newspapers and journals:

Lidové noviny (daily newspapers) Mladá fronta Dnes (daily newspapers) Českomoravský profit (business weekly) Vesmír (scientific journal)

Page 20: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Amount of annotated data

m-layer data1.96 MW in 116 kS

a-layer data (75 % of m-layer)1.5 MW in 88 kS

t-layer data (59 % of a-layer)0.8 MW in 49 kS

Page 21: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Division into files

1 XML file per document and annotation layer

Page 22: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Train/test data

train : devtest : evaltest = 8 : 1 : 1

Page 23: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Full vs. sample data

sample data 500 sentences a freely available subset of the full data converted also to HTML (can be viewed in any WWW browser, no tree editor needed)

the whole PDT 2.0 except for the full data (but including sample data, all tools, docs, and sample data) is available on the web

the full data will be available only to the licensed users who obtain the CD from the Linguistic Data Consortium

Page 24: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Outline of the talk

Introduction

Layers of annotation

Data

Software tools

Documentation

Tour through the CD-ROM

Final remarks

Page 25: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Tree editor TrEd

general customizable tree editor implemented in Perl the main editing and browsing tool in the PDT project

Page 26: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Batch processing of the data

btred – batch processing version of tred

ntred – networked (parallelized) version of btred

$ btred -TNe 'print "$this->{t_lemma}\n" if $this->parent==$root and grep{$_->{functor}=~/^DIR/} $this->children()‘ data/sample/*.t.gz -q

Page 27: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Netgraph

client-server application for on-line PDT search implemented in Java

Page 28: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Tools for post-annotation consistency checking

hundreds of btred scripts of various types:

technical tests e.g. each sentence contains at least one token all identifiers are unique, all referred identifiers exist...

m-layer tests locative (6th case) cannot occur without a preposition improbable word forms (e.g. imperatives haš, tel)

a-layer testsnot more than one subject in a clauseattributes (afun Atr) should not appear directly below verbs

t-layer testssurface forms of verb arguments match the specifications in the valency lexiconrelative pronouns in relative clauses should be in agreement with their antecedent (in the sense of grammatical coreference)

Page 29: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Tools for automatic annotation

chain of tools for automatic text processing (from a raw text to a-layer trees):

1. sentence segmentation and tokenization

2. morphological analysis

3. morphological disambiguation

4. dependency parsing (adapted Collins)

5. analytical function assignment

Page 30: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Tools for format conversions

conversion not only between PDT data formats, but also from other treebanks’ formats constituency trees from Negra in TrEd:

Page 31: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Outline of the talk

Introduction

Layers of annotation

Data

Software tools

Documentation

Tour through the CD-ROM

Final remarks

Page 32: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

PDT 2.0 Documentation

PDT Guide overview of all parts of PDT 2.0 mirrors the directory structure of the PDT 2.0 CD-ROM

Annotation guidelines m-layer (~100 pages) a-layer (~ 250 pages) t-layer (~ 800 pages)

Publications conference and journal papers, technical reports, theses ...

Technical documentation (software tools and data formats)

Page 33: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Outline of the talk

Introduction

Layers of annotation

Data

Software tools

Documentation

Tour through the CD-ROM

Final remarks

Page 34: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Outline of the talk

Introduction

Layers of annotation

Data

Software tools

Documentation

Tour through the CD-ROM

Final remarks

Page 35: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Want to experiment with...

tagging ? dependency parsing ? semantic-role labeling ? frame semantics ? word-sense disambiguation ? anaphora resolution ? information structure ? ...

Use PDT 2.0,it’s all there !!!

Page 36: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Annotation scheme not limited to Czech

T-layer in English T-layer in German A-layer in German

A-layer in Arabic A-layer in Slovene A-layer in Romanian

Page 37: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Those involved (some of)

Page 38: PDT 2.0  Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0

Thank you!

BTW, anyone interestedin beta-testing?