embedding semantic annotations within texts: the fretta approach
DESCRIPTION
In order to make semantic assertions about the text content of a document we need a mechanism to identify and organize the text structures of the document itself. Such mechanism would closely resemble a document-oriented markup language and would be free of the classical constraints of an embedded markup language, having no limitations given by sequentiality, containment, or contiguity of text fragments. In the past years we developed EARMARK, our OWL proposal for expressing arbitrary semantic annota- tions about the structure and the text content of a document. In this paper we describe FRETTA, our mechanism for rendering arbitrary EARMARK annotations (including non-sequential, non-hierarchical and non-contiguous ones) in XML, bringing into a unifying framework a half dozen of syntactic tricks used in literature to handle overlapping structures in a strictly hierarchical language.TRANSCRIPT
http://creativecommons.org/licenses/by-sa/3.0
Embedding semantic annotations within texts: the FRETTA approach
Gioele Barabucci - [email protected] Peroni - [email protected]
Francesco Poggi - [email protected] Vitali - [email protected]
Outline
• Conversion from an XML format into another
• Overlapping markup
• Abstract conversion framework
• FRETTA
• Evaluation
• Conclusions
Converting XML vocabularies that use syntactic workarounds
• The conversion of OpenOffice Writer documents (ODT) into Microsoft Word documents (DOCX) (and vice versa) is not a straightforward operation
• Converters exist and are included as core components of word processors
• Those converters do not implement mechanisms for a full and effective document conversion, especially when particular features are needed – e.g., information tracking document changes occuring over time
What happens to markup
<text:p> The beginning and the end.</text:p>
<w:p> <w:r> <w:t> The beginning and the end. </w:t> </w:r></w:p>
Op
en
Offi
ce (
OD
T)
Mic
roso
ft W
ord
(D
OC
X)
<w:p>! <w:pPr><w:rPr>! ! <w:ins w:id="0" w:author="John Smith" ! ! ! w:date="2009-10-27T18:50:00Z"/>! </w:rPr></w:pPr>! <w:r><w:t>The beginning and </w:t></w:r></w:p><w:p>! <w:ins w:id="1" w:author="John Smith" ! ! w:date="2009-10-27T18:50:00Z">! ! <w:r><w:t>also </w:t></w:r></w:ins>! <w:r><w:t>the end.</w:t></w:r></w:p>
<text:tracked-changes><text:changed-region text:id="S1">! <text:insertion><office:change-info>! ! <dc:creator>John Smith</dc:creator>! ! <dc:date>2009-10-27T18:45:00</dc:date>! </office:change-info></text:insertion></text:changed-region>
</text:tracked-changes>[…]<text:p>The beginning and! <text:change-start text:change-id="S1"/></text:p><text:p>also
<text:change-end text:change-id="S1"/> the end.</text:p>
Overlapping markup
• Overlapping markup is needed when different markup items refer to the same document fragment
Previous example in incorrect XML<p>The beginning and <ins></p><p>also </ins> the end</p>
XML formalisation via workarounds<p>The beginning and <ins start=”foo”/></p><p>also <ins end=”foo”/>the end</p>
• Different techniques to embed overlapping structures in XML hierarchies:✦ milestones: a pair of empty elements representing the start and the end tags, connected to each other by
special attributes✦ fragmentation: elements separated within the primary hierarchy and connected to each other by special
attributes✦ twin documents: each hierarchy is represented by a different document which contains the same textual
content✦ stand-off: places overlapping elements in a separate resource (e.g. another file) specifying the position
(down to the individual character) of each start and end location within the main structure
Abstract conversion framework
XML format 1 withoverlapping workarounds
(e.g., ODT + change tracking)
XML format 2 withoverlapping workarounds
(e.g., DOCX + change tracking)
XML document format 1
Step1: Indentification of XML overlapping workarounds
and creation of document with explicit overlap
Step2: Syntactic and semantic conversion from format 1 into
format 2
Step3: Linearisation into XML document with
overlapping workarounds
EARMARK document format 1
EARMARK document format 2
XML document format 2
EARMARK is a non-XML markup metalanguage used as intermediate language for the conversion.
It allows markup structures to be organized both as trees and as generic graphs with no particular limitations.
Today’s contribution
FRETTA
• FRETTA (From EARMARK To Tag) is a general and extensible Java framework for expressing EARMARK documents in an embedded XML syntax
• Users that want to convert from EARMARK into XML document formats must indicate which workarounds are used in a certain target format
• Fretta performs the requested conversion passing through four different and consecutive steps
EARMARK document XML document
workaroundspecification
The user specifies which workaround to use to represent
an (EARMARK) overlapping element
in XML
structuralconversion
Pure-structural conversion that produces a new
EARMARK document in which overlapping
elements are transformed appropriately according to the specified workarounds
semanticconversion
Semantic conversion that may change the
current structure of the EARMARK document according to how the target XML format
handles the specified workarounds
linearisation
Generation of the resulting XML tree with the requested
workarounds
Evaluation
• Comparing FRETTA’s outputs against a set of twelve TEI documents (TEIDocs) written by markup experts
• The evaluation took into account four different principles
✦ well-formedness (WF): whether the framework returns well-formed XML documents
✦ validity (V): whether the framework returns valid XML documents according to the particular target XML vocabulary
✦ naturalness (N): how much the XML documents returned by the framework are structurally similar to TEIDocs
✦ minimality (M): how much the amount of nodes (i.e., elements, attributes and text nodes) in the XML documents returned by the framework varies from TEIDocs
100% well-formed and valid documents67% continues to be natural (N) against TEIDocs83% continues to be minimal (M) against TEIDocs
document workarounds WF V N M
agrippine fragmentation ✓ ✓ ✓ ✓agrippine milestones ✓ ✓ ✓ ✓
drivemycar fragmentation ✓ ✓ X X
johnlovesmary fragmentation ✓ ✓ ✓ ✓johnlovesmary milestones ✓ ✓ ✓ ✓
peergynt fragmentation ✓ ✓ ✓ ✓peergynt milestones ✓ ✓ ✓ ✓
peterpaulhammer milestones ✓ ✓ ✓ ✓thoughtalice fragmentation ✓ ✓ ✓ ✓
titwillow fragmentation ✓ ✓ X ✓titwillow fragmentation ✓ ✓ X X
titwillow milestones ✓ ✓ X ✓
Conclusions
• Converting XML documents with overlaps expressed via XML workarounds is not a straightforward task
• We propose an abstract framework to address this issue, composed of three consecutive steps
• FRETTA implements the third step of the conversion framework. It enables one to convert any EARMARK document (that allows multiple overlapping hierarchies at the same time) into one or more embedded XML markup structures
• Future works:✦ developing algorithms that autonomously select the workarounds to adopt in the
conversions✦ integrating FRETTA in the broader framework for the semi-automatic and round-
trip conversion from any supported XML format into another
Thanks for your attention