corpus annotation for corpus linguistics (nov2009)
DESCRIPTION
Lecture on corpus annotation for corpus linguistics. Contents: DIY corpus, e-texts, character set and text encoding issues, document structure, DTDs, documentation; tools and issues in annotation procedures, good practices; examples from anaphora resolution and named entity recognition annotation campaigns; evaluation of corpus annotationTRANSCRIPT
Corpus Annotation for corpus linguistics
Jorge BaptistaUniversidade Algarve
L2F Spoken Language Laboratory, INESC ID Lisboa
Erasmus Mundus Master in Natural Language Processing and Human Language Technology
Universidade Autònoma de BarcelonaCampus de Bellaterra, November 10 and 12, 2009
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
Plan
2
corpus linguistics corpus annotation
before you get to work with your corpus
once you got your eText character set document structure DTDs
Evaluation of annotated corpus gold-standard evaluation methods
Annotating a corpus for Anaphora Resolution
Annotating a corpus for Named Entities Recognition
Annotation tools References
http://www.visualthesaurus.com/
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
3
Corpus linguistics
corpus (a definition): a large body of linguistic evidence typically composed of attested language use
machine-readable form well organized collection of data
collected within a sampling frame, designed for exploration of linguistic
features balance, representativeness
multifunctional resource, serve many different disciplines
McEnery 2003, in Mitkov (ed) 2003: 448 ff.
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
4
corpus annotation
corpus ‘enhanced with linguistic information’ analysts (humans and/or computers) linguistic analysis is imposed upon the
corpus (make explicit the implicit linguistic information)
encoded by reference to specified range of features
advantages of corpus annotation: ease corpus exploitation, reusability, multifunctionality, explicit analysis
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
5
corpus annotation (continued)
markup metadata
corpus information (doc id, speaker id, sex, age, etc, date, review number and history, etc.)
information pertaining to the text as such paragraphs, formatting (italics, bold)
annotation linguistic information superimposed to text
POS, NE_tags, discourse-structure tags, referential information, syntactic tags, semantic tags (for WSD), etc.
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
6
corpus annotation (continued)
annotation process automatic (lemmatization, PoS tagging: 3% error
rate)
semi-automatic (treebank)
manual (reference chains for anaphora resolution)
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
7
Before you get to work with your corpus*
Corpus-based approach to (computational) linguistics
Quality of corpora > RESULTS Methodology and procedures for
corpus collection, preparation and distribution
General remarks: true problems and difficulties lie in the details
text (whatever its support) and eText (in any digital medium)
* Thompson 2000 in Dale et al. 2000: 385 ff.
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
8
Once you got your eText …
Preparation in an ideal scenario
UNICODE (ISO 10646) encoding SGML (ISO 8879) mark-up
in a real-world scenarioraw text, different text-file types different sources and poor metadata, different encodings, no markup at all, or mixed and inconsistent
markup
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
9
character set and encoding
characters: abstract objects, glyphs; set of integers (code-points) > set of characters
encoding : mapping computer-representable byte- or word-stream to sequence of code points
ASCII, UNICODE, JIS, ISO-Latin-1 (ISO 8859-1), UTF-8
choosing, recoding, word-boundaries
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
10
document structure
any eText already has some structure words, sentences, paragraphs, quotations,
headings, … font size and face changes
what to notate explicitly? sentence boundaries
(never replace orthographic symbols but always add sentence boundaries)
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
11
document structure (continued)
How is explicit structural information recorded?
kim: most user-friendly and reusable way1. design you own idiosyncratic annotation
syntax2. use a database3. use a standard markup language: SGML,
XMLa. public DTD (document-type definition): TEI, CESb. design your very own DTD
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
12
document structure (continued)
SGML (Standard Generalized Markup Language) ISO 8879
XML (eXtensible Markup Language) simplified version of SGML originally
targeted at providing flexible document markup for the WWW
low-level grammar of annotation (how is markup to be distinguished from text)
definition of the structure of families of related documents or document types
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
Text Encoding Initiative (TEI)
Text Encoding Initiative (TEI) sponsored by ACL, ALLC and ACH guidelines to facilitate data exchange standardizing mark-up or encoding of
information stored in electronic form each text (document):
header <teiHeader> body each one may have several elements
13
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
TEI
Header <teiHeader> file description <fileDesc> :
full bibliographic description of na electronic file encoding description <encodingDesc> :
relates eText to its source(s) text profile <profileDesc> :
non-bibliographic description, languages, sublanguages, situation of production participants and settings
revision history <revisionDesc> : records changes made to file
14
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
TEI
Body of document <p>,<s>,<w>,<c><w POS=AT0>the</w>simplified: <w AT0>the
TEI scheme may be expressed in different formal languages: SGML, XML (system independent) XML (simplified SGML, for the web)
15
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
Corpus Enconding Standard (CES) Corpus Enconding Standard
specifically designed for encoding language corpora
EAGLES (Expert Advisory Group on Language Engineering Standards)
TEI-compliant application of SGML available both in SGML and XML (XCES)
16
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
17
DTDs (document-type definitions) context-free grammars of allowed tag
structures allowed attributes for each tag up-translation
consistency preexisting markup >replace> XML sed, awk, pearl scripting record every step ! (backtracking changes) manual post-processing > context-sensitive
patches diff
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
18
DTDs<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE colHAREM [<!ELEMENT colHAREM (DOC)*><!ATTLIST colHAREMversao CDATA #REQUIRED><!ELEMENT DOC (#PCDATA|ALT|EM|OMITIDO|P)*> <!ATTLIST DOC DOCID CDATA #REQUIRED><!ELEMENT P (#PCDATA|ALT|EM|OMITIDO)*><!ELEMENT ALT (#PCDATA|EM|OMITIDO)*><!ELEMENT EM (#PCDATA)><!ATTLIST EMID CDATA #REQUIREDCATEG CDATA #IMPLIEDTIPO CDATA #IMPLIEDSUBTIPO CDATA #IMPLIEDCOMENT CDATA #IMPLIEDTIPOREL CDATA #IMPLIEDCOREL CDATA #IMPLIEDTEMPO_REF (ENUNCIACAO|TEXTUAL) #IMPLIEDSENTIDO (ANTERIOR|POSTERIOR|SIMULT|ANTERIOR_OU_SIMUL|POSTERIOR_OU_SIMULT) #IMPLIEDVAL_DELTA CDATA #IMPLIEDVAL_NORM CDATA #IMPLIED><!ELEMENT OMITIDO (#PCDATA|EM)*>]><colHAREM versao="ColeccaoSegundoHAREM-2.0"><DOC DOCID="cha-73943"><P>Dividir o IRA, eis a estratégia</P><P>Hugo Estenssoro, em Londres</P><P>O IRA esteve esta semana na ofensiva, paralisando o aeroporto de Londres e causando prejuízos à temporada turística britânica, com presença obrigatória nas grandes manchetes. As bombas não explodiram, mas o IRA matou um polícia no Ulster em frente à esposa grávida. Foi uma violência anunciada: o líder do Sinn Fein -- o braço político do IRA -- falara poucos dias antes num «`show' espectacular» como resposta à iniciativa anglo-irlandesa lançada pelos primeiros-ministros da Grã-Bretanha e da República da Irlanda com a sua «declaração» de 15 de Dezembro do ano passado. Mas a campanha terrorista foi só parte da resposta.</P>
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
19
Evaluation of annotated corpus machine-learning techniques evaluation of NLP systems
analysis systems (linguistic input → abstract representation or classification)
gold standard (‘correct’ output) analysis components: segmentation,
tagging, information extraction and information retrieval
Hirschman and Mani (2003) in Mitkov (ed.) 2003 : 414 ff.
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
20
gold-standard-based measuresgold-standard evaluation methods: Definition of evaluation task and an
associated ‘gold-standard’ format annotation guidelines annotation and scoring tools validation (inter-annotator agreement)
annotated training and test corpora release (data+tools), evaluation interpretation (baseline and ceiling)
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
21
Annotating a corpus for Anaphora Resolution
John arrived. He looked tired.
antecedent anaphoranaphora
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
22
AR (continued)
John arrived. He looked tired.
<NE ID=267 TYPE=“person”>John</NE> arrived.
<REF TYPE=pro COREF=267>He</REF> looked tired.
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
23
AR (an exercise) identification of all the markables (NPs) in a
text regardless of whether they were coreferential or not
coref and ucoref (out of ARE)
relations marked between entities: IDENTITY, SYNONYMY, GENERALISATION and SPECIALISATION
Indirect anaphora relation was not annotated: (the house ... the door)
Hasler et al. (2006); Orasan et al. (2009)
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
24
task#1 Pronominal AR on pre-annotated texts
evaluation of pronoun algorithms
NPs annotated (known candidates)
only PRO NP were marked referential (to be resolved)
no influence from wrongly identified candidates
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
25
task#2 Coreferential chains on pre-annotated texts
cluster coreferential NPs together in coreferential chains
all referential NP were marked (to be resolved), not only PRO
NPs outside coreferential chains were not annotated
no influence from wrongly identified candidates
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
26
an example: NER
www.linguateca.pt/avaliacaoconjunta
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
27
annotation tools
PALinkA Perspicuous and Adjustable Links Annotatorhttp://clg.wlv.ac.uk/projects/PALinkA/index.php
Alembic workbench a natural language engineering environment for the development of tagged corpora http://www.mitre.org/tech/alembic-workbench/
ATLAS Architecture and Tools for Linguistic Analysis Systems http://www.nist.gov/speech/atlas/
CLaRK system an XML Based System For Corpora Development http://www.bultreebank.org/clark/index.html
GATE is an architecture, framework and development environment for language engineering which can be also used to annotate textshttp://www.gate.ac.uk/
MMAX a tool for multi-modal annotation in XML, but the new version is no longer free http://mmax.eml-research.de/
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
28
ReferencesDale, Robert; Moils, Hermann; Sommers, Harold. 2000. Handbook of Natural Language Processing. New
York/Basel: Marcel Dekker, Inc.Hasler, Laura K.; Naumann, K. ; Orasan, C. (2006). Guidelines for Annotation of Within-document NP
Coreference http://clg.wlv.ac.uk/projects/NP4E/NP_guidelines_2006.pdf.Hajičova, E.; Abeillé, A.; Hajič, J.; Mirovský, J. 2010. Treebank annotation. in Indurkhya and Damerau (2010): 167-188.Hirschman, Lynette; Mani, Inderjeet. 2003. Evaluation. in Mitkov, Ruslan (ed.) 2003, pp. 414-429.
Indurkhya, Nitin; Damerau, Fred (Eds.). 2010. Handbook of Natural Language Processing (2nd ed.). Chapman & Hall/CRC.
McEnery, Tony. 2003. Corpus Linguistics. in Mitkov, Ruslan (ed.) 2003 , pp. 448-463. McEnery, Tony; Xiao, Richard; Tono, Yukio. 2006. Corpus-Based Language Studies. An advanced resource
book. Routledge.Mitkov, Ruslan (ed.) 2003. Oxford Handbook of Computational Linguistics. Oxford: Oxford University Press.Mitkov, Ruslan ; Orasan, Constantin ; Evans, Richard. 1999. The importance of annotated corpora for NLP:
the cases of anaphora resolution and clause splitting. TALN ’99 The importance of annotated corpora for NLP. http://clg.wlv.ac.uk/papers/mitkov-99b.pdf.
Orăsan, Constantin; Cristea, Dan; Mitkov, Ruslan; Branco António. Anaphora Resolution Exercise: An overview. Proceedings of 6th Language Resources and Evaluation Conference (LREC’2008), Marrakesh, Morocco, 28 – 30 May http://clg.wlv.ac.uk/papers/713_paper.pdf.
Renouff, Antoinette; Kehoe, Andrew (eds.).2009. Corpus Linguistics: Refinements and Reassessments. Amsterdam/New York: Rodopi.
Thompson, Henry S. 2000. Corpus Creation for Data-Intensive Linguistics. in Dale et al. (eds) 2000, pp. 385-401.
Xiao, Richard. 2010. Corpus Creation. in Indurkhya and Damerau (2010): 147-166.
Resourceshttp://www.ldc.upenn.edu/annotation/http://www.routledge.com/textbooks/0415286239