evaluation issues in anaphora resolution and beyond ruslan mitkov university of wolverhampton faro,...
TRANSCRIPT
![Page 1: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/1.jpg)
Evaluation issues in
anaphora resolution and beyond
Ruslan Mitkov
University of Wolverhampton
Faro, 27 June 2002
![Page 2: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/2.jpg)
Evaluation
Evaluation is a driving force for every NLP task/approach/application
Evaluation is indicative of the performance of a specific approach/application but not less importantly, reports where it stands as compared to other approaches/applications
Growing research in evaluation inspired by the availability of annotated corpora
![Page 3: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/3.jpg)
Major impediments to fulfilling evaluation’s mission
Different approaches evaluated on different data
Different approaches evaluated in different
modes Results not independently confirmed As a result, no comparison or objective
evaluation possible
![Page 4: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/4.jpg)
Anaphora resolution vs. coreference resolution
• Anaphora resolution has to do with tracking
down an antecedent of an anaphor
• Coreference resolution seeks to identify all
coreference classes (chains)
![Page 5: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/5.jpg)
Anaphora resolution
For nominal anaphora which involves coreference it would be logical to regard each of the preceding noun phrases which are coreferential with the anaphor(s) as a legitimate antecedent Computational Linguists from many different countries attended PorTAL. The participants enjoyed the presentations; they also took an active part in the discussions.
![Page 6: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/6.jpg)
Evaluation in anaphora resolution
Two perspectives:
• Evaluation of anaphora resolution algorithms
• Evaluation of anaphora resolution systems
![Page 7: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/7.jpg)
Recall and Precision
MUC introduced the measures recall and
precision for coreference resolution.
These measures, as defined, are not
satisfactory in terms of clarity and
coverage (Mitkov 2001).
![Page 8: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/8.jpg)
Evaluation package for anaphora resolution algorithms (Mitkov 1998; 2000)
Evaluation package for anaphora resolution
algorithms
(i) performance measures
(ii) comparative evaluation tasks and
(iii) component measures.
![Page 9: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/9.jpg)
Performance measures
Success rate
Critical success rate
Critical success rate applies only to those ‘tough’
anaphors which still have more than one
candidate for antecedent after gender and
number filter
![Page 10: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/10.jpg)
Example
• Evaluation data: 100 anaphors • Number of anaphors correctly resolved: 80• Number of anaphors correctly resolved
after gender and number constraints: 30
Success rate: 80/100 = 80%,
Critical success rate 50/70 = 71.4%
![Page 11: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/11.jpg)
Comparative evaluation tasks
• Evaluation against baseline models • Comparison to similar approaches • Comparison with well-established approaches
Approaches frequently used for comparison:
Hobbs (1978), Brenan et al. (1987), Lappin and Leass (1994), Kennedy and Boguraev (1996), Baldwin (1997), Mitkov (1996; 1998)
![Page 12: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/12.jpg)
Component measures
• Relative importance
• Decision power (Mitkov 2001)
![Page 13: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/13.jpg)
Evaluation measures for anaphora resolution systems
• Success rate
• Critical success rate
• Resolution etiquette (Mitkov et al. 2002)
![Page 14: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/14.jpg)
Reliability of evaluation results
Evaluation results can be regarded as reliable if evaluation covers/employs
(i) All naturally occurring texts
(ii) Sampling procedures
![Page 15: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/15.jpg)
Relative vs. absolute results
• Results may be relative with regard to a specific evaluation set or other approach
• More “absolute” figures may be obtained if there existed a measure which quantified for the complexity of anaphors to be resolved
![Page 16: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/16.jpg)
Measures quantifying complexity in anaphora resolution
Measures for complexity (Mitkov 2001):
• Knowledge required for resolution
• Distance between anaphor and
antecedent (in NPs, clauses, sentences)
• Number of competing candidates
![Page 17: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/17.jpg)
Fair evaluation
Algorithms should be evaluated on the
basis of the same
• Evaluation data
• Pre-processing tools
![Page 18: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/18.jpg)
Evaluation workbench
Evaluation workbench for anaphora resolution (Mitkov 2000; Barbu and Mitkov 2001)
• Allows the comparison of approaches sharing common principles or similar pre-processing
• Enables the ‘plugging in’ and testing of different anaphora resolution algorithms
All algorithms implemented operate in a fully automatic mode
![Page 19: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/19.jpg)
The need for annotated corpora
Annotated corpora are vital for training
and evaluation
Annotation should cover anaphoric or
coreferential chains and not only anaphor-
antecedent pairs only
![Page 20: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/20.jpg)
Scarce commodity
Lancaster Anaphoric Treebank (100 000 words)
MUC coreference task annotated data (65 000)
Part of the Penn Treebank (90 000)
![Page 21: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/21.jpg)
Additional issues
Annotation scheme
Annotating tools
Annotation strategy
Interannotators’ (dis)agreement is a major issue!
![Page 22: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/22.jpg)
The Wolverhampton coreference annotation project
A 500 000-word corpus annotated for
anaphoric and coreferential links (identity-
of-sense direct nominal anaphora)
Less ambitious in terms of coverage, but
much more consistent
![Page 23: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/23.jpg)
Watch out for the traps!
• Are all annotated data reliable?
• Are all original documents reliable?
• Are all results reported “honest”?
![Page 24: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/24.jpg)
Morale and motivation important!
If I may offer you my advice.... Do not despair if your first evaluation results are
not as high as you wanted them to be Be prepared to provide considerable input in
exchange of minor performance improvement Work hard Be transparent
... and you´ll get there!
![Page 25: Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c0051a28abf838cc5280/html5/thumbnails/25.jpg)
Anaphora resolution projects
Ruslan Mitkov’s home page
http://www.wlv.ac.uk/~le1825
Research Group in Computational Linguistics
http://clg.wlv.ac.uk