dealing with markup semantics
DESCRIPTION
My paper presentation. i-Semantics 2011, Graz, Austria.TRANSCRIPT
http://creativecommons.org/licenses/by-sa/3.0
Dealing withMarkup Semantics
Silvio Peroni – [email protected] Gangemi – [email protected]
Fabio Vitali – [email protected]
Summary
• Semantic markup vs. markup semantics
• Why markup semantics
• Why XML is not enough
• Markup semantics with EARMARK and Linguistic Act
• Real-world scenarios
• Conclusions
Shift of meaning
1990Web of
documents
todayWeb ofdata
TagMarkup
document markupit tells us somethingabout the text or
content of a document
resource markupit is used to identifyany data added to aresource with the
intention to semanticallydescribe it
markup elementa syntactic item
representingthe building block of
a document structure
keyworda non-hierarchical keyword
or term assigned to apiece of information (suchas an Internet bookmark,digital image or computer
file)
First Era of the Web (WWW)
Second Era of the Web (Semantic Web)
Semantics and Markup
markup semantics
“what is the meaning of amarkup element title
contained in a document d?”
semantic markup
“the resource r has the stringDealing with Markup Semantics as title”
Markup semantics today
• The document markup is still here:✦ lot of research issues are still open-problems now✦ some on those partially-solved issues can be addressed in a better way through
nowadays tools and technologies
• So, our question is:
Why the Semantic Web has not yet addressed properly markup semantics?
Possible answers:✦ Because the document markup is dead, really✦ Because markup semantics is not an interesting research topic✦ Because markup semantics is not an useful tool for solving valuable problems✦ Actually, the Semantic Web addressed markup semantics
The document markup is dead... wait, really?
• The document markup does not play any important role in nowadays research fields and company interests
Are we definitely sure?
Maybe not!
Research groups’ interest in markup semantics
• Does it mean that there is no research communities interested in this issue? Well, actually, it is an old and still-live issue:
✦ Renear, A., Dubin, D., Sperberg-McQueen, C. M. (2002). Towards a Semantics for XML Markup.✦ Dubin, D. (2003). Object mapping for markup semantics.✦ Renear, A., Dubin, D., Sperberg-McQueen, C. M., Huitfeldt, C. (2003). XML Semantics and Digital Libraries.✦ Simons, G. F., Lewis, W. D., Farrar, S. O., Langendoen, D. T., Fitzsimons, B., Gonzalez, H. (2004). The semantics of
markup: mapping legacy markup schemas to a common semantics.✦ Garcia, R., Celma, O. (2005) Semantic Integration and Retrieval of Multimedia Metadata.✦ Marcoux, Y. (2006). A natural-language approach to modeling: Why is some XML so difficult to write?✦ Van Deursen, D., Poppe, C., Martens, G., Mannens, E., Van de Walle, R. (2008). XML to RDF Conversion: a
Generic Approach.✦ Marcoux, Y., Rizkallah, E. (2009). Intertextual semantics: A semantics for information design.✦ Sperberg-McQueen, C. M., Marcoux, Y., Huitfeldt, C. (2009). Two representations of the semantics of TEI Lite✦ Nuzzolese, A., Gangemi, A., Presutti, V. (2010). Gathering Lexical Linked Data and Knowledge Patterns from
FrameNet.
• “The problem addressed seems old and seems to have been solved before, but actually has not [sufficiently]”– by an anonymous reviewer
Markup semantics and real-world problems
• Some advantages when having a formal and machine-readable semantics of markup:
✦ perform both syntactic and semantic validation✦ infer facts from documents automatically✦ simplify the federation, conversion and translation of documents among digital
repositories✦ query upon the structure of the document by considering its semantics✦ create visualisations of documents considering the semantics of their
structures rather than their markup vocabularies✦ increase the accessibility of documents’ content (see the “tag abuse” issue)✦ guarantee a better maintainability when a markup schema evolves
• Fields of interest: digital libraries and digital (and semantic) publishing
Semantic Web approaching markup semantics
• RDFa may be a valid choice for associating formal semantics with arbitrary text fragments
✦ Pros: easy to use and parse, compliant with XML-like formats✦ Cons: we need to modify the structure of the document (more attributes, more elements)
• There are domains (e.g., those having to deal with administrative and juridical documents) in which we cannot modify the structure of documents
• How can we say that the element p in the document means “paragraph”?
<?xml version="1.0" encoding="UTF-8"?><p>Fabio says that overlhappens</p> 1 markup element only
<?xml version="1.0" encoding="UTF-8"?><p prefix=”: http://www.example.com/
foaf: http://xmlns.com/foaf/0.1/”><span about=”:fv” property=”foaf:firstName”>Fabio</span>says that overlhappens
</p>2 markup elements3 attributes
RDFa enhancing
Our problems in addressing markup semantics
• Let’s use XML for defining document markup structures✦ Pros: it is the today common format, used in lot of tools and applications✦ Cons: it does not define a formal way for specifying markup semantics
• Let’s use OWL for defining formal semantics and then associating it to XML markup
✦ Pros: OWL was created for define semantics✦ Cons: we have to use XML-based approaches (RDFa, GRDDL) to link semantics to
XML markup and this is not always possible
• A compromise between XML and OWL is not fully satisfying
• A solution: to elevate either the document markup formalism or the formal semantics model to the level of the other, that means:
✦ to use XML for document markup and another formalism, fully compliant with XML in all the possible scenarios, for defining its markup semantics (does it exist?), or
✦ to develop an OWL ontology for defining document markup and another OWL ontology for specifying its semantics
try to guess what we did
• The Extremely Annotational RDF Markup (EARMARK) is at the same time a markup meta-language and an ontology of (document) markup
✦ More expressive than XML – it allows to organise markup structures as graphs
✦ It makes easy to associate OWL semantics to document items – an EARMARK document is a set of OWL assertions, all the markup items and text nodes are individuals of particular classes
✦ Lot of tools available: a Java API, frameworks to convert XML documents into EARMARK ones and to convert complex EARMARK documents (i.e., having a graph structure) into XML ones applying overlapping tricks to store as much information as possible into the simple XML tree hierarchy
more information at http://palindrom.es/phd/research/earmark
An example: XML tricks
p
F a b i o s a y s t h a t o v e r l h a p p e n s
verbagent noun This is not directly representablein XML (unless using tricks):“noun” and “verb” overlap
p
F a b i o s a y s t h a t o v e r l h a p p e n s
verb
agent noun noun
<p><agent>Fabio</agent> says that<noun xml:id=”e1” next=”e2”>
overl</noun><verb>
h<noun xml:id=”e2”>ap</noun>pens</verb>
</p>
XML serialisationwith TEI fragmentation
To be representable in XML it should be...
verb
ex:verb a :Element; :hasGeneralIdentifier "verb"; c:firstItem [c:itemContent ex:r21-28].
noun
ex:noun a :Element; :hasGeneralIdentifier "noun"; c:firstItem [c:itemContent ex:r16-21;c:nextItem [c:itemContent ex:r22-24]] .
agent
ex:agent a :Element; :hasGeneralIdentifier "agent";c:firstItem [c:itemContent ex:r0-5].
ex:r21-28 a :PointerRange; :refersTo ex:dox;:begins "21"; :ends "28".
ex:r22-24 a :PointerRange; :refersTo ex:doc; :begins "22"; :ends "24".
ex:r16-21 a :PointerRange; :refersTo ex:doc; :begins "16"; :ends "21".
ex:r5-16 a :PointerRange;:refersTo ex:doc;:begins "5"; :ends "16".F a b i o s a y s t h a t o v e r l h a p p e n s
ex:doc a :StringDocuverse; :hasContent "Fabio says that overlhappens".
An example: EARMARK document
p
ex:p a :Element ; :hasGeneralIdentifier "p"; c:firstItem [c:itemContent ex:agent; c:nextItem [c:itemContent ex:r5-16; c:nextItem [c:itemContent ex:noun; c:nextItem [c:itemContent ex:verb]]]].
ex:r0-5 a :PointerRange; :refersTo ex:doc; :begins "0"; :ends "5”.
Towards markup semantics
• EARMARK is suitable for expressing markup semantics straightforwardly using OWL
• What model can we use? It must:✦ follow precise and theoretically-founded principles✦ be interoperable across different markup vocabularies
• A large amount of vocabularies addresses the representation of terms vs. meanings vs. things – e.g., SKOS, FRBR, CIDOC, OWL-WordNet
Problems:✦ too specific for particular contexts✦ they are not interoperable
Linguistic Act ontology design pattern
• References: any individual from the world we are describing – e.g., Fabio
• Meanings: any (meta-level) object that explains something – e.g., person
• Information entities: any symbol that has a meaning or denotes one or more references – e.g., the string “Fabio”
• Linguistic acts: any communicative situation including information entities, agents, meanings, references, and a possible spatio-temporal context – e.g., to add markup to a document
http://ontologydesignpatterns.org/cp/owl/semantics.owl
Example: “Results” section of a paper
<div class=”section”><h1>Results</h1><p>...</p>
</div>
<section><info>
<title>Results</title></info><para>...</para>
</section>
2 XML excerpts of“Result” sections
ex1:div a :Element;:hasGeneralIdentifier “div”;c:firstItem [c:itemContent
ex1:class];c:nextItem [c:itemContent ex1:h1;c:nextItem [c:itemContent ex1:p]]];la:expresses
doco:Section, deo:Results....ex1:p a :Element;
:hasGeneralIdentifier “p”;c:firstItem [c:itemContent
ex1:someText];la:express doco:Paragraph.
...
ex2:section a :Element;:hasGeneralIdentifier “section”;c:firstItem [c:itemContent
ex2:info;c:nextItem [c:itemContent
ex2:para]];la:expresses
doco:Section, deo:Results....ex2:para a :Element;
:hasGeneralIdentifier “para”;c:firstItem [c:itemContent
ex2:someText];la:express doco:Paragraph.
...
Related EARMARK conversions
We are using the Document Components Ontology (http://purl.org/spar/doco) and the Discourse Elements Ontology (http://purl.org/spar/deo) to specify the semantics of markup elements
Searches on heterogeneous repositories
• Problem: how to search something across a large number of digital libraries that use storing documents as XML documents of different and non-interoperable formats?
• Query: give me all the markup elements that represents paragraphs of any “Result” section of any available document that were written by any person called Fabio
SELECT ?x WHERE {?x a :Element ; la:expresses doco:Paragraph ;dc:creator [a foaf:Person ; foaf:name “Fabio”];(^c:itemContent/^c:item)+ [a :Element; la:expresses doco:Section , deo:Results]
}
ex1:p and ex2:para are returned
Semantic format conversion
• Problem: how to convert a document from a (unknown) format into a target one, without knowing the markup vocabulary of the former and having the possibility of querying its semantics
• Convert: substitute any markup element representing a section with a new one named “sec” that contains the same elements and text content of the removed one
DELETE {?s :hasGeneralIdentifier ?gi}INSERT {?s :hasGeneralIdentifier “sec”}WHERE {?s a :Element; :hasGeneralIdentifier ?gi;la:expresses doco:Section
}
<sec class=”section”><h1>Results</h1>...
<sec><info>
<title>Results</title> ...
previous excerpts change:
Markup sensibility
• Problem: how to estimate whether a markup element, that is valid at the syntactical and structural level, is also valid at the semantic level
• Semantic constraints can be defined as ontological axioms of the underlying ontology, in order to understand whether a document is adhering to or in contrast with them
<akomaNtoso> ...<TLCPerson id=”smith” href=”/ontology/uk/person/JohnSmith” /> ...<speech id=”sp_1” by=”#smith” as=”#mineconomy”>
<p>Honorable Members of the Parliament...</p></speech> ...
</akomaNtoso>
<smith> a :Element; :hasGeneralIdentifier “TLCPerson”;la:denotes </ontology/ul/person/JohnSmith> ...
</ontology/ul/person/JohnSmith> a akomantoso:Person.
<sp_1> a :Element; :hasGeneralIdentifier “speech”;la:expresses akomantoso:Speech; la:denotes _:aSpeechEvent; ...
_:aSpeechEvent a akomantoso:SpeechEvent; akomantoso:hasSpeaker </ontology/ul/person/JohnSmith>.
[] a la:LinguisticAct; sit:isSettingFor <sp_1>, akomantoso:Speech, </ontology/ul/person/JohnSmith>, _:aSpeechEvent.
Verifying semantic constraints
• Verify: check whether the markup element “speech” denotes a particular speech event that involves only and at least 1 person as speaker, that is introduced in the document through a markup element
(Element that hasGeneralIdentifier value “speech”)SubClassOf(sit:hasSetting only(la:LinguisticAct thatsit:isSettingFor exactly 1 (Element and la:InformationEntity)andsit:isSettingFor exactly 1 ((akomantoso:SpeechEvent and la:Reference) that akomantoso:hasSpeaker some (akomantoso:Person that la:isDenotedBy some Element
))andsit:isSettingFor value akomantoso:Speech
))
Conclusions
• The issue of markup semantics is still a interesting research field, with a lot of possible applications in real-world scenarios
• We proposed our approach for addressing markup semantics through Semantic Web technologies and we introduced EARMARK, as a new document markup meta-language, and the Linguistic Act ontology design pattern for expressing semantics of EARMARK document markup
• We shown how to use these models for addressing real scenarios in which the use of markup semantics can help when doing particular tasks, such as querying on heterogeneous document repositories, converting document markup across different vocabularies, and verifying the validity of markup elements at a semantic level
• Future development: ✦ a software assistant that helps users in the definition of markup semantics of a given XML schema✦ two applications for the semantic validation of markup documents and for the visualisation of
document parts according to their semantics
Thanks for your attention