enabling xcomfortable mapping to the linguistic annotation framework marion freese sony...
TRANSCRIPT
Enabling xComForTable Mapping to the Linguistic Annotation Framework
Marion Freese
Sony International (Europe) Gmbh;
IMS, Universität Stuttgart;
hmb Datentechnik Gmbh
2/24 LREC 2004 05/29/2004Marion Freese
Overview
xComForT – Outline Relevance for richly annotated corpora xComForT Features
– Adaptation to new text formats– Integration of annotation tools
Proposal for integration into LAF Summary
3/24 LREC 2004 05/29/2004Marion Freese
xComForT – What is it?
extensible Common Format for Text based on
– XML– Text Encoding Initiative (TEI)– Corpus Encoding Standard (CES / XCES)
provides extensibility and reusability
4/24 LREC 2004 05/29/2004Marion Freese
xComForT – What’s it for?
NOT– Standard for linguistic annotation
BUT– Standards proposal for structural annotation of
primary data– Common anchor for linguistic annotations (LA)– Set of guidelines for LA architecture
(company-internal standard)
5/24 LREC 2004 05/29/2004Marion Freese
Example: Newspaper (plain text)
bylinecopyright
meta information
headlinequotation
bylinedateline paragraph
6/24 LREC 2004 05/29/2004Marion Freese
xComForT – Primary Document Example
<xcomfortDoc type="text" extension="SZ" version="v0.6" TEIform="TEI.2"> <cesHeader ...> <!-- ... -> </cesHeader> <text xml:lang="de"> <!-- ... -> zu erhalten.</p> <byline type="signer"> <docAuthor type="short">mgd</docAuthor> </byline> </div>
<div type="article" id="d19990104_a12"> <opener id="d19990104_a12o"> <divMeta> <publDate>Montag, 4. Januar 1999</publDate> <cat target="ns8"><hi>BAYERN</hi></cat> <!-- ... -> </divMeta> <head id="d19990104_a12hl1">Kafkaeskes Augsburg</head> <head id="d19990104_a12hl2" type="sub">Der nächste Akzent <!-- ... -></head> <byline type="main">Von <docAuthor type="full">Peter Richter</docAuthor> </byline> <dateline><location>Augsburg</location> – </dateline> <p id="d19990104_a12p1">Auch wenn nicht <!-- ... -></p><!-- ... -></xcomfortDoc>
7/24 LREC 2004 05/29/2004Marion Freese
xComForT – Data Architecture
substring / 1:1ran
ge-to / 1
:1
1:1 (#id)
range-to / 1:1
1:1 (#id)
1:1 (#id)
xComForT storage format
base document
level 1 level 2
token level
token stream
substring
e.g. morpheme, syllable streams
e.g. sentence, chunk, mw streams
level 3
1st linguistic level
e.g. PoS, lemma, pronunciation
streams
level 4
2nd linguistic level
e.g. parse tree stream
e.g. intonation stream
segInfo
8/24 LREC 2004 05/29/2004Marion Freese
Relevance for richly annotated Corpora
Standoff-Markup– supports huge amount of annotation data
» alternative / concurrent / ambiguous annotations» partial / underspecified results» flexible merging» various annotation types (multimodal, multimedia,
metadata, …) media independence– reduces annotation dependencies
Support for integration of external tools for annotation and exploitation
common standards-based starting point for rich annotation
9/24 LREC 2004 05/29/2004Marion Freese
Comparison with CES
Structural markup and linguistic annotation are strictly separated in xComForT
provides common base format for arbitrary linguistic annotation
allows for using consistent annotation schema Primary document DTD is easily extensible while
retaining TEI conformance
xComForT provides more flexibility than CES wrt. resource formats (e.g. integration of different modalities possible)
10/24 LREC 2004 05/29/2004Marion Freese
Creation of an extended DTD for storage
xComForT.ent
xComForT.dtd
core markup definition
class.modclass.new
class.comments
elem.modelem.new
elem.comments
xcomfort_new.ent
xcomfort_new.dtd
extension definition
xComForT_store.dtd
TEI conformant storage format
template
TEI conformant extension
storage format
xComForT_store_new.dtd
11/24 LREC 2004 05/29/2004Marion Freese
Extension Definition Support
core markup definition contains extension entity for each element and entity, e.g.
» <!ENTITY % x.byline ‘’>
» <!ELEMENT byline (#PCDATA | author %x.byline;)>
<!ENTITY % x.byline ‘| interviewer’>
<!ELEMENT byline (#PCDATA | author | interviewer)>
12/24 LREC 2004 05/29/2004Marion Freese
Integration of Annotation Tools
Toolbox support for converting annotation tool output to xComForT
annotationstream
elementnames
xComForTdocument
type of annotation annotate.perl
text nodes for annotation tool input:
<tn ancestors=“div p“ parentID=“div1.p1“>With</tn>
<tn ancestors=“div p“ parentID=“div1.p1“>the</tn>
...
e.g. sentence
<elem>p</elem>
<s xlink:href=“..“/>
13/24 LREC 2004 05/29/2004Marion Freese
Linguistic Annotation Tools – implemented examples
input and output formats of– Tokenizer (from IMS, University of Stuttgart)
» tokens» sentences
– IMS TreeTagger» lemma» part-of-speech
14/24 LREC 2004 05/29/2004Marion Freese
Relation to current LAF standardization issues (1)
General requirements for the standard for a Linguistic Annotation Framework (LAF) (cf. Ide & Romary 2003)
xComForT conforms to these requirements, i.e. to– Media independence– Human readability– Processability
15/24 LREC 2004 05/29/2004Marion Freese
Relation to current LAF standardization issues (2)
Remaining requirements are xComForT’s main features, i.e. – Consistency– Uniformity– Incrementality– Expressiveness
Two proposals for integration into the LAF Mapping between proprietary resource formats and
the LAF annotation data model Resource reusability
16/24 LREC 2004 05/29/2004Marion Freese
Proposal to the LAF (1-1)
LAF architecture (Ide & Romary)
Dump format
17/24 LREC 2004 05/29/2004Marion Freese
Proposal to the LAF (1-2)
Dump Format conforming to xComForT guidelines Advantages
– Direct mapping from/to user-defined formats– Support for annotation tool integration– Easy conversion into proprietary formats
Disadvantages– xComForT is possibly not the most
adequate/efficient processing format– Different requirements of processing format vs.
exchange format
18/24 LREC 2004 05/29/2004Marion Freese
Proposal to the LAF (2-1)
LAF architecture (Ide & Romary)
Intermediate Format between resource and LAF dump format
19/24 LREC 2004 05/29/2004Marion Freese
Proposal to the LAF (2-2)
Intermediate Format (Common Document Format) Disadvantages
– One more mapping step Advantages
– Standards-based adaptation to proprietary formats– Mapping to dump format tightly defined and
targeted– Common mapping tool, e.g. provided by the LAF
20/24 LREC 2004 05/29/2004Marion Freese
Example: Potential LAF dump format
“Jones followed him into the front room, closing the door behind him” (Ide&Romary2001)
<struct id="s0" type="S"> <struct id="s1" type="NP" xlink:href="xptr(substring(p/s[1]/text(),1,5))" rel="SBJ"/> <struct id="s2" type="VP" xlink:href="xptr(substring(p/s[1]/text(),7,8))"/> <struct id="s3" type="NP" xlink:href="xptr(substring(p/s[1]/text(),16,3))"/> <struct id="s4" type="PP" xlink:href="xptr(substring(p/s[1]/text(),20,4))" rel="DIR"> <struct id="s5" type="NP" xlink:href="xptr(substring(p/s[1]/text(),25,14))"/> </struct> <struct id="s6" type="S" rel="ADV"> <!-- ... --></struct>
21/24 LREC 2004 05/29/2004Marion Freese
Example: Possible xComForT Representation (1)
segments
xComForT storage format
level 1
PTBraw.xml
level 2
token level
substring
token.xml
level 3
1st linguistic level
level 4
2nd linguistic level
range-t
o
range-to
sentence.xml
chunk.xml
segInfo
chunk_relation.xml
1:1 (#id)
22/24 LREC 2004 05/29/2004Marion Freese
Example: Possible xComForT Representation (2)
chunk.xml
chunk_relation.xml
<segments level="ling1" type="chunk" xml:base="token.xml"> <chunk id="div1.p1.chunk1" type="NP" xlink:href="#div1.p1.tok1"/> <chunk id="div1.p1.chunk2" type="VP" xlink:href="#div1.p1.tok2"/> <chunk id="div1.p1.chunk3" type="NP" xlink:href="#div1.p1.tok3"/> <chunk id="div1.p1.chunk4" type="PP" xlink:href="#xpointer(id('div1.p1.tok4')/ range-to(id('div1.p1.tok7'))"/> <chunk id="div1.p1.chunk5" type="NP" xlink:href="#xpointer(id('div1.p1.tok5')/ range-to(id('div1.p1.tok7'))"/></segments>
<segInfo level="ling2" type="rel" xml:base="chunk.xml"> <rel id="div1.p1.chunk1.rel" xlink:href="#div1.p1.chunk1>SBJ</rel> <rel id="div1.p1.chunk4.rel" xlink:href="#div1.p1.chunk4>DIR</rel></segInfo>
23/24 LREC 2004 05/29/2004Marion Freese
Summary
standards-based
common tools available and usable stand-off annotation
easy plugging-in of linguistic annotation schema easily extensible markup of primary document
easy adaptation to arbitrary resource
Standard base format, e.g. to simplify support for mapping into the Linguistic Annotation Framework
24/24 LREC 2004 05/29/2004Marion Freese
xComForTable Mapping to the LAF
Thanks for your attention!
… Any questions?
25/24 LREC 2004 05/29/2004Marion Freese
Structural Markup improves Analysis
e.g. sentence boundary detection
Then things would get even worse. (see also pages 4 and 11)
SHADOWS
By Leena Dhingra
I couldn’t possibly do that.
tokenizer input:<p>-elements (without <rs>-elements)
correct sentence markup
<p>[..]Then things would get even worse.<rs type=“see also“> (see also pages 4 and 11)</rs></p></div>
<div><head>SHADOWS</head><byline>By Leena Dhingra</byline><p>I couldn’t possibly do that.</p>
26/24 LREC 2004 05/29/2004Marion Freese
Example – Discontinuous Material
CES
xComForT<div id="d19990607_a1" type="article"> <opener><!-- ... --></opener> <discontinuous id="d19990607_a1. discontinuous" type="rubbish"> Die GewinnzahlenLotto (5. Juni): 5, 19, 21, 31, 43, 48 Zusatzzahl: 32, Superzahl: 9 Toto: lag noch nicht vor </ discontinuous> <closer><!-- ... --></closer> </div>
Die Gewinnzahlen
Lotto (5. Juni): 5, 19, 21, 31, 43, 48 Zusatzzahl: 32, Superzahl: 9 Toto: lag noch nicht vor
<!ELEMENT discontinuous (#PCDATA)><!ATTLIST discontinuous id ID #REQUIRED type (rubbish | editorial | ..) #IMPLIED>
27/24 LREC 2004 05/29/2004Marion Freese
Example – Meta Information
CES
xComForT<opener> <divMeta> <publDate>Montag, 7. Juni 1999</publDate> <cat target="ns1">NACHRICHTEN</cat> <distribution>M / F</distribution> <publBy>Süddeutsche Zeitung</publBy> <volNr>Nr. 127</volNr> / <pageNr>Seite 7</pageNr> </divMeta></opener>
Montag, 7. Juni 1999 NACHRICHTEN M / F Süddeutsche Zeitung Nr. 127 / Seite 7
<opener><date>Montag, 7. Juni 1999</date> NACHRICHTEN M / F Süddeutsche Zeitung Nr. 127 / Seite 7</opener>
reference to taxonomy