Resources for linguistically motivated
Multilingual Anaphora Resolution
Kepa Joseba Rodrıguez
Advisor: Massimo Poesio18. January 2011
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Outline
1 Motivation of the research
2 Contributions of this dissertation
3 Limitations of previous annotation schemes
4 Annotation scheme proposal
5 Annotated data
6 Usability of the data for anaphora resolution
7 Use of the data
8 Conclusions
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Motivation
Linguistic research: cross linguistic studies aboutanaphora (Poesio et al 2004)
Applications: summarization (Steinberger et al 2007)
Applications: machine translation
1 German: Peter hat Maria seine Blumen zum Gießengegeben. Sie hat sie vertrocknen lassen.
2 English (Babelfish): Peter gave Maria his flowers forpouring. Then it left it to dry.
3 English (Google translate): Peter gave Mary flowersto his casting. Then she let them dry up.
4 English (wanted): Peter gave Maria his flowers towater. Then she let them dry out.
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Contributions
Development of a linguistically motivated annotationscheme for anaphoric relations.
Implementation of the scheme for manual annotation ofEnglish and Italian data.
Creation of annotated data for English and Italian.
Use of the corpora for feature extraction and developmentof anaphora resolution systems in English and Italian.
Participation of the systems in SemEval 2010.
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Limitations of previous schemes (1)
Coverage of the annotation.
Annotation of reference.
Identification and annotation of discontinuity of semanticmaterial.
Problem of multiple interpretations: ambiguity.
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Limitations of previous schemes (2)
Coverage of the annotation:
Annotated relations: only identity
ACE-like annotation schemes constraint the annotation tonoun phrases from a list of semantic types.
Genres: Most annotation schemes focus the annotationon a few genres.
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Limitations of previous schemes (3)
Annotation of reference
Expletives: they are not considered.There are two people waiting for the interview.Predication:
MUC, ACE: No distinction between predication andidentity relation.OntoNotes: no semantic criteria to decide which nounphrase is referring and which is a predicate.
[The president of the bank] is [John Smith].[John Smith] is [the president of the bank].
Coordination: coordinated items are considered referringexpressions in corpora like MUC or OntoNotes.
[Milosevic or anyone else]
Nominals and proper names in premodifier position.Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Limitations of previous schemes (4)
Identification of discontinuous semantic material.
Bill and Hillary Clinton
black cars and bikes
Multiple interpretations are not captured
[The house] is on [a long street]. [It] is very dirty.
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Annotation scheme
Annotation of all noun phrases
Distinction between referring and non-referringexpressions
Annotation of clitics attached to the verb and emptypronouns
Introduction of ambiguity
Introduction of discontinuous markables
Annotation of different kind of relations: identity,discourse deixis and bridging.
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Reference
Markables are classified in referring and non-referring
Non-referring markables are annotated with type ofnon-referring expression
Referring markables are annotated with:
Information status: New or old.Semantic type
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Reference
Types of non-referring expressionsExpletives
[There] are two people waiting for the interviewThe new car is [there]
Predicate: semantic criteria to distinguish predicate andreferring expression.
[Il presidente della Repubblica, [Giorgio Napolitano]][The president of the bank] is [John Smith].[John Smith] is [the president of the bank].
Quantifiers:[All of [the box cars]]
Coordination.Idiomatic expressions
by [the nape of [the neck]]Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Semantic types
1 Person2 Animate3 Organization4 Facility5 Geopolitical entity (GPE)6 Location7 Temporal8 Numerical9 Concrete10 Abstract11 Event12 Other13 Unknown
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Annotation of ambiguity
Not always a unique interpretation for a markable.
1 Be careful hooking up [the engine] to [the boxcar]because [it] is faulty.
2 [The house] is on [a long street]. [It] is very dirty.
In case of ambiguity, we tag the markable as ambiguousand we annotate the possible interpretations.
Other possible ambiguities are:
Information status: between new and old.Old and not referring.
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
List of annotated features
Agreement features
GenderNumberPerson
Grammatical function
Reference and information status
Semantic type
Type of non-referring
Link to antecedent
Ambiguity
Bridging
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Description of the annotated data
ARRAU (English)
Wall Street Journal textsTrains dialoguesGnome corpusPear stories
Live Memories Corpus for Italian (LMC)
Wikipedia sitesBlog sitesVENEX dataset
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Description: English corpus
WSJ dataset205 files147,600 words in 5585 sentences. 47,900 markables.1% of discontinuous markables, 12.6% non-referring.
Trains dialogues35 files26,000 words in 4600 sentences. 5200 markables.
GNOME corpus5 files21,600 words in 1000 sentences. 6100 markables
PEAR stories20 files14,000 words in 2,000 sentences. 3,900 markables.
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Description: Italian corpus
Wikipedia dataset:
144 files.140.000 words in 4700 sentences. 44.500 markables.0.5% discontinuous markables, 0.5% clitics attached tothe verb, 4.5% empty subjects.13.7% non-referring.
Blogs dataset:
75 files.53.000 words in 2230 sentences.16.000 markables.
VENEX corpus:
30 files20,300 words in 720 sentences6.220 markables
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Reliability of the annotation – ARRAU
Previous study for annotation of anaphoric links publishedby (Poesio and Artstein, 2008)
Metric: Krippendorf’s α
α = 0.6-0.7
Statistics reflect the complexity of the task.
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Reliability of the annotation – LMC
Metric: Sigel and Castellan’s κ
Information status and reference: old, new andnon-referring
κ = 0.80
Basic annotation of the markable: new, phraseantecedent, segment antecedent, predicate, quantifier,expletive, coordination and idiom.
κ = 0.79Main disagreement between discourse new and predicate
Semantic type
κ = 0.85
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Reliability of the annotation – LMC
Link to the antecedent
κ = 0.88
Antecedent of clitics
κ = 0.84
Antecedent of empty pronouns
κ = 0.93
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Use of the corpus for anaphora resolution (1)
Baseline proposed by (Soon et al 2001)
Classifier: MaxEnt
English data: ACE02, MUC-7 and ARRAU
Italian data: ICAB and LMC
Evaluation metrics:
MUC (Vilain et al. 1995)CEAF (Luo, 2005)Link based evaluation
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Use of the corpus for anaphora resolution (2)
English corpora: ARRAU, ACE, MUCACE Carafe MUC-7 ACE02 ARRAU
MUC 0.618 0.585 0.590 0.557CEAF-AGGR Φ-3 0.537 0.379 0.393 0.683CEAF-AGGR Φ-4 0.506 0.206 0.309 0.717Link-based 0.638 0.594 0.532 0.540Pronouns 0.686 0.492 0.597 0.558Nominals 0.355 0.455 0.239 0.352Names 0.638 0.817 0.784 0.763
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Use of the corpus for anaphora resolution (3)
Italian corpora: LMC, ICABICAB LMC-Sys LMC-Gold
MUC 0.494 0.456 0.619CEAF-AGGR Φ-3 0.557 0.622 0.798CEAF-AGGR Φ-4 0.560 0.671 0.869Link-based 0.556 0.470 0.580Pronouns 0.452 0.520 0.521Nominals 0.421 0.303 0.522Names 0.741 0.642 0.752
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Use of the corpus for anaphora resolution (4)
Use of C4 decision trees to compare the impact ofindividual features.
The impact of the baseline features is similar for Englishand Italian with two exceptions:
The impact of gender matching is high in English, buthas no effect for Italian.The use of automatically computed aliases have a highimpact for Italian and a low impact for English.
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Use of the data
5th International Workshop on Semantic Evaluations(SemEval 2010)Task: Coreference Resolution in Multiple Languages.
Comparative research about zero-anaphora in Italian andJapanese
Training and evaluation of content extraction models inthe Live Memories project.
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Conclusions
Linguistic motivated annotation scheme applicable toEnglish and Italian.
Scheme used to annotate different genres: newspapers,encyclopedic text, dialogue, narrative and weblogs.
Corpora are usable to build anaphora resolution models.
Datasets have been used for international competitionsand for linguistic research.
Kepa Joseba Rodrıguez
Resources for linguistically motivated Multilingual Anaphora Resolution