embedding semantic annotations within texts: the fretta approach

10

Click here to load reader

Upload: silvio-peroni

Post on 22-Jun-2015

608 views

Category:

Technology


2 download

DESCRIPTION

In order to make semantic assertions about the text content of a document we need a mechanism to identify and organize the text structures of the document itself. Such mechanism would closely resemble a document-oriented markup language and would be free of the classical constraints of an embedded markup language, having no limitations given by sequentiality, containment, or contiguity of text fragments. In the past years we developed EARMARK, our OWL proposal for expressing arbitrary semantic annota- tions about the structure and the text content of a document. In this paper we describe FRETTA, our mechanism for rendering arbitrary EARMARK annotations (including non-sequential, non-hierarchical and non-contiguous ones) in XML, bringing into a unifying framework a half dozen of syntactic tricks used in literature to handle overlapping structures in a strictly hierarchical language.

TRANSCRIPT

Page 1: Embedding semantic annotations within texts: the FRETTA approach

http://creativecommons.org/licenses/by-sa/3.0

Embedding semantic annotations within texts: the FRETTA approach

Gioele Barabucci - [email protected] Peroni - [email protected]

Francesco Poggi - [email protected] Vitali - [email protected]

Page 2: Embedding semantic annotations within texts: the FRETTA approach

Outline

• Conversion from an XML format into another

• Overlapping markup

• Abstract conversion framework

• FRETTA

• Evaluation

• Conclusions

Page 3: Embedding semantic annotations within texts: the FRETTA approach

Converting XML vocabularies that use syntactic workarounds

• The conversion of OpenOffice Writer documents (ODT) into Microsoft Word documents (DOCX) (and vice versa) is not a straightforward operation

• Converters exist and are included as core components of word processors

• Those converters do not implement mechanisms for a full and effective document conversion, especially when particular features are needed – e.g., information tracking document changes occuring over time

Page 4: Embedding semantic annotations within texts: the FRETTA approach

What happens to markup

<text:p> The beginning and the end.</text:p>

<w:p> <w:r> <w:t> The beginning and the end. </w:t> </w:r></w:p>

Op

en

Offi

ce (

OD

T)

Mic

roso

ft W

ord

(D

OC

X)

<w:p>! <w:pPr><w:rPr>! ! <w:ins w:id="0" w:author="John Smith" ! ! ! w:date="2009-10-27T18:50:00Z"/>! </w:rPr></w:pPr>! <w:r><w:t>The beginning and </w:t></w:r></w:p><w:p>! <w:ins w:id="1" w:author="John Smith" ! ! w:date="2009-10-27T18:50:00Z">! ! <w:r><w:t>also </w:t></w:r></w:ins>! <w:r><w:t>the end.</w:t></w:r></w:p>

<text:tracked-changes><text:changed-region text:id="S1">! <text:insertion><office:change-info>! ! <dc:creator>John Smith</dc:creator>! ! <dc:date>2009-10-27T18:45:00</dc:date>! </office:change-info></text:insertion></text:changed-region>

</text:tracked-changes>[…]<text:p>The beginning and! <text:change-start text:change-id="S1"/></text:p><text:p>also

<text:change-end text:change-id="S1"/> the end.</text:p>

Page 5: Embedding semantic annotations within texts: the FRETTA approach

Overlapping markup

• Overlapping markup is needed when different markup items refer to the same document fragment

Previous example in incorrect XML<p>The beginning and <ins></p><p>also </ins> the end</p>

XML formalisation via workarounds<p>The beginning and <ins start=”foo”/></p><p>also <ins end=”foo”/>the end</p>

• Different techniques to embed overlapping structures in XML hierarchies:✦ milestones: a pair of empty elements representing the start and the end tags, connected to each other by

special attributes✦ fragmentation: elements separated within the primary hierarchy and connected to each other by special

attributes✦ twin documents: each hierarchy is represented by a different document which contains the same textual

content✦ stand-off: places overlapping elements in a separate resource (e.g. another file) specifying the position

(down to the individual character) of each start and end location within the main structure

Page 6: Embedding semantic annotations within texts: the FRETTA approach

Abstract conversion framework

XML format 1 withoverlapping workarounds

(e.g., ODT + change tracking)

XML format 2 withoverlapping workarounds

(e.g., DOCX + change tracking)

XML document format 1

Step1: Indentification of XML overlapping workarounds

and creation of document with explicit overlap

Step2: Syntactic and semantic conversion from format 1 into

format 2

Step3: Linearisation into XML document with

overlapping workarounds

EARMARK document format 1

EARMARK document format 2

XML document format 2

EARMARK is a non-XML markup metalanguage used as intermediate language for the conversion.

It allows markup structures to be organized both as trees and as generic graphs with no particular limitations.

Today’s contribution

Page 7: Embedding semantic annotations within texts: the FRETTA approach

FRETTA

• FRETTA (From EARMARK To Tag) is a general and extensible Java framework for expressing EARMARK documents in an embedded XML syntax

• Users that want to convert from EARMARK into XML document formats must indicate which workarounds are used in a certain target format

• Fretta performs the requested conversion passing through four different and consecutive steps

EARMARK document XML document

workaroundspecification

The user specifies which workaround to use to represent

an (EARMARK) overlapping element

in XML

structuralconversion

Pure-structural conversion that produces a new

EARMARK document in which overlapping

elements are transformed appropriately according to the specified workarounds

semanticconversion

Semantic conversion that may change the

current structure of the EARMARK document according to how the target XML format

handles the specified workarounds

linearisation

Generation of the resulting XML tree with the requested

workarounds

Page 8: Embedding semantic annotations within texts: the FRETTA approach

Evaluation

• Comparing FRETTA’s outputs against a set of twelve TEI documents (TEIDocs) written by markup experts

• The evaluation took into account four different principles

✦ well-formedness (WF): whether the framework returns well-formed XML documents

✦ validity (V): whether the framework returns valid XML documents according to the particular target XML vocabulary

✦ naturalness (N): how much the XML documents returned by the framework are structurally similar to TEIDocs

✦ minimality (M): how much the amount of nodes (i.e., elements, attributes and text nodes) in the XML documents returned by the framework varies from TEIDocs

100% well-formed and valid documents67% continues to be natural (N) against TEIDocs83% continues to be minimal (M) against TEIDocs

document workarounds WF V N M

agrippine fragmentation ✓ ✓ ✓ ✓agrippine milestones ✓ ✓ ✓ ✓

drivemycar fragmentation ✓ ✓ X X

johnlovesmary fragmentation ✓ ✓ ✓ ✓johnlovesmary milestones ✓ ✓ ✓ ✓

peergynt fragmentation ✓ ✓ ✓ ✓peergynt milestones ✓ ✓ ✓ ✓

peterpaulhammer milestones ✓ ✓ ✓ ✓thoughtalice fragmentation ✓ ✓ ✓ ✓

titwillow fragmentation ✓ ✓ X ✓titwillow fragmentation ✓ ✓ X X

titwillow milestones ✓ ✓ X ✓

Page 9: Embedding semantic annotations within texts: the FRETTA approach

Conclusions

• Converting XML documents with overlaps expressed via XML workarounds is not a straightforward task

• We propose an abstract framework to address this issue, composed of three consecutive steps

• FRETTA implements the third step of the conversion framework. It enables one to convert any EARMARK document (that allows multiple overlapping hierarchies at the same time) into one or more embedded XML markup structures

• Future works:✦ developing algorithms that autonomously select the workarounds to adopt in the

conversions✦ integrating FRETTA in the broader framework for the semi-automatic and round-

trip conversion from any supported XML format into another

Page 10: Embedding semantic annotations within texts: the FRETTA approach

Thanks for your attention