dealing with markup semantics

21
http://creativecommons.org/licenses/by-sa/3.0 Dealing with Markup Semantics Silvio Peroni – [email protected] Aldo Gangemi – [email protected] Fabio Vitali – [email protected]

Upload: silvio-peroni

Post on 26-Jan-2015

108 views

Category:

Technology


0 download

DESCRIPTION

My paper presentation. i-Semantics 2011, Graz, Austria.

TRANSCRIPT

Page 2: Dealing with Markup Semantics

Summary

• Semantic markup vs. markup semantics

• Why markup semantics

• Why XML is not enough

• Markup semantics with EARMARK and Linguistic Act

• Real-world scenarios

• Conclusions

Page 3: Dealing with Markup Semantics

Shift of meaning

1990Web of

documents

todayWeb ofdata

TagMarkup

document markupit tells us somethingabout the text or

content of a document

resource markupit is used to identifyany data added to aresource with the

intention to semanticallydescribe it

markup elementa syntactic item

representingthe building block of

a document structure

keyworda non-hierarchical keyword

or term assigned to apiece of information (suchas an Internet bookmark,digital image or computer

file)

First Era of the Web (WWW)

Second Era of the Web (Semantic Web)

Semantics and Markup

markup semantics

“what is the meaning of amarkup element title

contained in a document d?”

semantic markup

“the resource r has the stringDealing with Markup Semantics as title”

Page 4: Dealing with Markup Semantics

Markup semantics today

• The document markup is still here:✦ lot of research issues are still open-problems now✦ some on those partially-solved issues can be addressed in a better way through

nowadays tools and technologies

• So, our question is:

Why the Semantic Web has not yet addressed properly markup semantics?

Possible answers:✦ Because the document markup is dead, really✦ Because markup semantics is not an interesting research topic✦ Because markup semantics is not an useful tool for solving valuable problems✦ Actually, the Semantic Web addressed markup semantics

Page 5: Dealing with Markup Semantics

The document markup is dead... wait, really?

• The document markup does not play any important role in nowadays research fields and company interests

Are we definitely sure?

Maybe not!

Page 6: Dealing with Markup Semantics

Research groups’ interest in markup semantics

• Does it mean that there is no research communities interested in this issue? Well, actually, it is an old and still-live issue:

✦ Renear, A., Dubin, D., Sperberg-McQueen, C. M. (2002). Towards a Semantics for XML Markup.✦ Dubin, D. (2003). Object mapping for markup semantics.✦ Renear, A., Dubin, D., Sperberg-McQueen, C. M., Huitfeldt, C. (2003). XML Semantics and Digital Libraries.✦ Simons, G. F., Lewis, W. D., Farrar, S. O., Langendoen, D. T., Fitzsimons, B., Gonzalez, H. (2004). The semantics of

markup: mapping legacy markup schemas to a common semantics.✦ Garcia, R., Celma, O. (2005) Semantic Integration and Retrieval of Multimedia Metadata.✦ Marcoux, Y. (2006). A natural-language approach to modeling: Why is some XML so difficult to write?✦ Van Deursen, D., Poppe, C., Martens, G., Mannens, E., Van de Walle, R. (2008). XML to RDF Conversion: a

Generic Approach.✦ Marcoux, Y., Rizkallah, E. (2009). Intertextual semantics: A semantics for information design.✦ Sperberg-McQueen, C. M., Marcoux, Y., Huitfeldt, C. (2009). Two representations of the semantics of TEI Lite✦ Nuzzolese, A., Gangemi, A., Presutti, V. (2010). Gathering Lexical Linked Data and Knowledge Patterns from

FrameNet.

• “The problem addressed seems old and seems to have been solved before, but actually has not [sufficiently]”– by an anonymous reviewer

Page 7: Dealing with Markup Semantics

Markup semantics and real-world problems

• Some advantages when having a formal and machine-readable semantics of markup:

✦ perform both syntactic and semantic validation✦ infer facts from documents automatically✦ simplify the federation, conversion and translation of documents among digital

repositories✦ query upon the structure of the document by considering its semantics✦ create visualisations of documents considering the semantics of their

structures rather than their markup vocabularies✦ increase the accessibility of documents’ content (see the “tag abuse” issue)✦ guarantee a better maintainability when a markup schema evolves

• Fields of interest: digital libraries and digital (and semantic) publishing

Page 8: Dealing with Markup Semantics

Semantic Web approaching markup semantics

• RDFa may be a valid choice for associating formal semantics with arbitrary text fragments

✦ Pros: easy to use and parse, compliant with XML-like formats✦ Cons: we need to modify the structure of the document (more attributes, more elements)

• There are domains (e.g., those having to deal with administrative and juridical documents) in which we cannot modify the structure of documents

• How can we say that the element p in the document means “paragraph”?

<?xml version="1.0" encoding="UTF-8"?><p>Fabio says that overlhappens</p> 1 markup element only

<?xml version="1.0" encoding="UTF-8"?><p prefix=”: http://www.example.com/

foaf: http://xmlns.com/foaf/0.1/”><span about=”:fv” property=”foaf:firstName”>Fabio</span>says that overlhappens

</p>2 markup elements3 attributes

RDFa enhancing

Page 9: Dealing with Markup Semantics

Our problems in addressing markup semantics

• Let’s use XML for defining document markup structures✦ Pros: it is the today common format, used in lot of tools and applications✦ Cons: it does not define a formal way for specifying markup semantics

• Let’s use OWL for defining formal semantics and then associating it to XML markup

✦ Pros: OWL was created for define semantics✦ Cons: we have to use XML-based approaches (RDFa, GRDDL) to link semantics to

XML markup and this is not always possible

• A compromise between XML and OWL is not fully satisfying

• A solution: to elevate either the document markup formalism or the formal semantics model to the level of the other, that means:

✦ to use XML for document markup and another formalism, fully compliant with XML in all the possible scenarios, for defining its markup semantics (does it exist?), or

✦ to develop an OWL ontology for defining document markup and another OWL ontology for specifying its semantics

try to guess what we did

Page 10: Dealing with Markup Semantics

• The Extremely Annotational RDF Markup (EARMARK) is at the same time a markup meta-language and an ontology of (document) markup

✦ More expressive than XML – it allows to organise markup structures as graphs

✦ It makes easy to associate OWL semantics to document items – an EARMARK document is a set of OWL assertions, all the markup items and text nodes are individuals of particular classes

✦ Lot of tools available: a Java API, frameworks to convert XML documents into EARMARK ones and to convert complex EARMARK documents (i.e., having a graph structure) into XML ones applying overlapping tricks to store as much information as possible into the simple XML tree hierarchy

more information at http://palindrom.es/phd/research/earmark

Page 11: Dealing with Markup Semantics

An example: XML tricks

p

F a b i o s a y s t h a t o v e r l h a p p e n s

verbagent noun This is not directly representablein XML (unless using tricks):“noun” and “verb” overlap

p

F a b i o s a y s t h a t o v e r l h a p p e n s

verb

agent noun noun

<p><agent>Fabio</agent> says that<noun xml:id=”e1” next=”e2”>

overl</noun><verb>

h<noun xml:id=”e2”>ap</noun>pens</verb>

</p>

XML serialisationwith TEI fragmentation

To be representable in XML it should be...

Page 12: Dealing with Markup Semantics

verb

ex:verb a :Element; :hasGeneralIdentifier "verb"; c:firstItem [c:itemContent ex:r21-28].

noun

ex:noun a :Element; :hasGeneralIdentifier "noun"; c:firstItem [c:itemContent ex:r16-21;c:nextItem [c:itemContent ex:r22-24]] .

agent

ex:agent a :Element; :hasGeneralIdentifier "agent";c:firstItem [c:itemContent ex:r0-5].

ex:r21-28 a :PointerRange; :refersTo ex:dox;:begins "21"; :ends "28".

ex:r22-24 a :PointerRange; :refersTo ex:doc; :begins "22"; :ends "24".

ex:r16-21 a :PointerRange; :refersTo ex:doc; :begins "16"; :ends "21".

ex:r5-16 a :PointerRange;:refersTo ex:doc;:begins "5"; :ends "16".F a b i o s a y s t h a t o v e r l h a p p e n s

ex:doc a :StringDocuverse; :hasContent "Fabio says that overlhappens".

An example: EARMARK document

p

ex:p a :Element ; :hasGeneralIdentifier "p"; c:firstItem [c:itemContent ex:agent; c:nextItem [c:itemContent ex:r5-16; c:nextItem [c:itemContent ex:noun; c:nextItem [c:itemContent ex:verb]]]].

ex:r0-5 a :PointerRange; :refersTo ex:doc; :begins "0"; :ends "5”.

Page 13: Dealing with Markup Semantics

Towards markup semantics

• EARMARK is suitable for expressing markup semantics straightforwardly using OWL

• What model can we use? It must:✦ follow precise and theoretically-founded principles✦ be interoperable across different markup vocabularies

• A large amount of vocabularies addresses the representation of terms vs. meanings vs. things – e.g., SKOS, FRBR, CIDOC, OWL-WordNet

Problems:✦ too specific for particular contexts✦ they are not interoperable

Page 14: Dealing with Markup Semantics

Linguistic Act ontology design pattern

• References: any individual from the world we are describing – e.g., Fabio

• Meanings: any (meta-level) object that explains something – e.g., person

• Information entities: any symbol that has a meaning or denotes one or more references – e.g., the string “Fabio”

• Linguistic acts: any communicative situation including information entities, agents, meanings, references, and a possible spatio-temporal context – e.g., to add markup to a document

http://ontologydesignpatterns.org/cp/owl/semantics.owl

Page 15: Dealing with Markup Semantics

Example: “Results” section of a paper

<div class=”section”><h1>Results</h1><p>...</p>

</div>

<section><info>

<title>Results</title></info><para>...</para>

</section>

2 XML excerpts of“Result” sections

ex1:div a :Element;:hasGeneralIdentifier “div”;c:firstItem [c:itemContent

ex1:class];c:nextItem [c:itemContent ex1:h1;c:nextItem [c:itemContent ex1:p]]];la:expresses

doco:Section, deo:Results....ex1:p a :Element;

:hasGeneralIdentifier “p”;c:firstItem [c:itemContent

ex1:someText];la:express doco:Paragraph.

...

ex2:section a :Element;:hasGeneralIdentifier “section”;c:firstItem [c:itemContent

ex2:info;c:nextItem [c:itemContent

ex2:para]];la:expresses

doco:Section, deo:Results....ex2:para a :Element;

:hasGeneralIdentifier “para”;c:firstItem [c:itemContent

ex2:someText];la:express doco:Paragraph.

...

Related EARMARK conversions

We are using the Document Components Ontology (http://purl.org/spar/doco) and the Discourse Elements Ontology (http://purl.org/spar/deo) to specify the semantics of markup elements

Page 16: Dealing with Markup Semantics

Searches on heterogeneous repositories

• Problem: how to search something across a large number of digital libraries that use storing documents as XML documents of different and non-interoperable formats?

• Query: give me all the markup elements that represents paragraphs of any “Result” section of any available document that were written by any person called Fabio

SELECT ?x WHERE {?x a :Element ; la:expresses doco:Paragraph ;dc:creator [a foaf:Person ; foaf:name “Fabio”];(^c:itemContent/^c:item)+ [a :Element; la:expresses doco:Section , deo:Results]

}

ex1:p and ex2:para are returned

Page 17: Dealing with Markup Semantics

Semantic format conversion

• Problem: how to convert a document from a (unknown) format into a target one, without knowing the markup vocabulary of the former and having the possibility of querying its semantics

• Convert: substitute any markup element representing a section with a new one named “sec” that contains the same elements and text content of the removed one

DELETE {?s :hasGeneralIdentifier ?gi}INSERT {?s :hasGeneralIdentifier “sec”}WHERE {?s a :Element; :hasGeneralIdentifier ?gi;la:expresses doco:Section

}

<sec class=”section”><h1>Results</h1>...

<sec><info>

<title>Results</title> ...

previous excerpts change:

Page 18: Dealing with Markup Semantics

Markup sensibility

• Problem: how to estimate whether a markup element, that is valid at the syntactical and structural level, is also valid at the semantic level

• Semantic constraints can be defined as ontological axioms of the underlying ontology, in order to understand whether a document is adhering to or in contrast with them

<akomaNtoso> ...<TLCPerson id=”smith” href=”/ontology/uk/person/JohnSmith” /> ...<speech id=”sp_1” by=”#smith” as=”#mineconomy”>

<p>Honorable Members of the Parliament...</p></speech> ...

</akomaNtoso>

<smith> a :Element; :hasGeneralIdentifier “TLCPerson”;la:denotes </ontology/ul/person/JohnSmith> ...

</ontology/ul/person/JohnSmith> a akomantoso:Person.

<sp_1> a :Element; :hasGeneralIdentifier “speech”;la:expresses akomantoso:Speech; la:denotes _:aSpeechEvent; ...

_:aSpeechEvent a akomantoso:SpeechEvent; akomantoso:hasSpeaker </ontology/ul/person/JohnSmith>.

[] a la:LinguisticAct; sit:isSettingFor <sp_1>, akomantoso:Speech, </ontology/ul/person/JohnSmith>, _:aSpeechEvent.

Page 19: Dealing with Markup Semantics

Verifying semantic constraints

• Verify: check whether the markup element “speech” denotes a particular speech event that involves only and at least 1 person as speaker, that is introduced in the document through a markup element

(Element that hasGeneralIdentifier value “speech”)SubClassOf(sit:hasSetting only(la:LinguisticAct thatsit:isSettingFor exactly 1 (Element and la:InformationEntity)andsit:isSettingFor exactly 1 ((akomantoso:SpeechEvent and la:Reference) that akomantoso:hasSpeaker some (akomantoso:Person that la:isDenotedBy some Element

))andsit:isSettingFor value akomantoso:Speech

))

Page 20: Dealing with Markup Semantics

Conclusions

• The issue of markup semantics is still a interesting research field, with a lot of possible applications in real-world scenarios

• We proposed our approach for addressing markup semantics through Semantic Web technologies and we introduced EARMARK, as a new document markup meta-language, and the Linguistic Act ontology design pattern for expressing semantics of EARMARK document markup

• We shown how to use these models for addressing real scenarios in which the use of markup semantics can help when doing particular tasks, such as querying on heterogeneous document repositories, converting document markup across different vocabularies, and verifying the validity of markup elements at a semantic level

• Future development: ✦ a software assistant that helps users in the definition of markup semantics of a given XML schema✦ two applications for the semantic validation of markup documents and for the visualisation of

document parts according to their semantics

Page 21: Dealing with Markup Semantics

Thanks for your attention