ontology-based multi-domain metadata for research data management using triple stores

Post on 25-Jun-2015

412 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

A presentation given on the IDEAS 2014 Conference about database modelling using triple stores for research data management. IDEAS '14, July 07 - 09 2014, Porto, Portugal. Paper Abstract: Most current research data management solutions rely on a fixed set of descriptors (e.g. Dublin Core Terms) for the description of the resources that they manage. While these are easy to understand and use, their semantics are limited to general concepts, leaving out domain-specific metadata and representing values as sets of text values. While this enables retrieval through free-text search, faceted search and dataset interlinking becomes limited. From the point of view of the relational database schema modeler, designing a more flexible metadata model represents a non-trivial challenge because of the open nature of the model. This work demonstrates the current approaches followed by current open-source platforms and propose a graph-based model for achieving modular, ontology-based metadata for interlinked data assets in the Semantic Web. This proposed model was implemented in a collaborative research data management platform currently under development at the University of Porto.

TRANSCRIPT

Ontology-based multi-domain metadata for research data

management using triple stores

João Rocha da Silva joaorosilva@gmail.com

Faculdade de Engenharia da

Universidade do Porto / INESC TEC

Cristina Ribeiro mcr@fe.up.pt DEI—Faculdade de

Engenharia da Universidade do

Porto / INESC TECJoão Correia Lopes jlopes@fe.up.pt

IDEAS '14, July 07 - 09 2014, Porto, Portugal

Contents• Diverse metadata: relational modeling challenges

• Current approaches built on relational databases

• Dendro: graph-based research data management

• Live demo

• Conclusions

2

Problem: diverse metadataRelational modeling challenges

3

Analytical Chemistry Dataset

Mechanical Engineering Dataset …

GenericAuthor

Description Creation date

Author Description

Creation date …

Domain Specific

Sample Count Analysed Substance

Initial Crack Length Specimen Type

4

Common challenges in RDB schema modeling

• Entities with unknown attributes at time of modeling

• Time-variant attribute values

• Inheritance / sub-class mapping

• Resource hierarchies (parents of parents…)

• Schemas rely on external documentation5

Data management and description platforms

Study of relational models

6

DSpace

• Academic publications management platform

• Not targeted specifically at data

• More than 1000 active installations

• Mature open-source codebase

7

DSpace

• Designed for self-deposit by common users

• Good deposit workflow (validation, licensing…)

8

U.Porto Open Repository Homepage (http://repositorio-aberto.up.pt)

Powered by DSpace

9

Powered by DSpace

A thesis record in the repository (http://repositorio-aberto.up.pt/handle/10216/58508) 10

Bitstream Metadata Schema

Metadata Descriptor

Item

*

1**

metadata value

*

1

11

DSpace

12

• Metadata profiles for objects other than Items

• Descriptor hierarchy for specialization

• Collaborative schema derivation

• Validation of metadata completeness against different schemas

• Restricting possible metadata for each type of resource

New requirements

13

14

CKAN

• Open-source data publishing platform

• Deposit requires minimal metadata at first

• Flexible metadata model

• Open-Source

15

1

2

16

1

17

!source CKAN 18

!source CKAN 18

Entity with variable, time-dependent

attributes

!source CKAN 18

Entity with variable, time-dependent

attributes

Fixed attrs.

!source CKAN 18

Attribute name

Entity with variable, time-dependent

attributes

Fixed attrs.

!source CKAN 18

Attribute name

Value (always varchar)

Entity with variable, time-dependent

attributes

Fixed attrs.

!source CKAN 18

Attribute name

Timestamps

Value (always varchar)

Entity with variable, time-dependent

attributes

Fixed attrs.

!source CKAN 18

Invenio• Software behing Zenodo, a data publishing portal

• Static metadata model

• Very complex relational schema generated by business logic code

• Tight coupling between DB and code

• Open-Source

19

1

2

20

541 Tables

No FKs

!21

!22

!22

OntologiesSemantic annotation for richer metadata

23

24

!!!!!!

http://dendro.fe.up.pt/project/datanotes/data/base

%20data.xls

24

!!!!

http://dendro.fe.up.pt/project/datanotes/data

nie:isLogicalPartOf

!!!!!!

http://dendro.fe.up.pt/project/datanotes/data/base

%20data.xls

24

!!!!

http://dendro.fe.up.pt/project/datanotes/data

nie:isLogicalPartOf

rdf:type

nie:File

!!!!!!

http://dendro.fe.up.pt/project/datanotes/data/base

%20data.xls

24

!!!!

http://dendro.fe.up.pt/project/datanotes/data

nie:isLogicalPartOf

“Base data of the DCB experiments”

dc:titlerdf:type

nie:File

!!!!!!

http://dendro.fe.up.pt/project/datanotes/data/base

%20data.xls

24

!!!!

http://dendro.fe.up.pt/project/datanotes/data

nie:isLogicalPartOf

“Base data of the DCB experiments”

dc:title

base data.xls

nie:title

rdf:type

nie:File

!!!!!!

http://dendro.fe.up.pt/project/datanotes/data/base

%20data.xls

24

!!!!

http://dendro.fe.up.pt/project/datanotes/data

nie:isLogicalPartOf

“Base data of the DCB experiments”

dc:title

base data.xls

nie:title

rdf:type

nie:File

base data.xls

dcb:initialCrackLength

!!!!!!

http://dendro.fe.up.pt/project/datanotes/data/base

%20data.xls

24

Semantic MediaWiki• Semantic extension of MediaWiki, the code behind

Wikipedia

• Semantic Links between pages

• Uses ontologies

• Strong emphasis on page versioning

• DB schema built around the time dimension

25

Loading an ontology

26

Describing a resource

27

Semantic Forms

From DataNotes + UPBox http://purl.pt/24107/1/iPres2013_PDF/UPBox%20and%20DataNotes%20a%20collaborative%20data%20management%20environment%20for%20the%20long%20tail%20of%20research%20data.pdf

28

Semantic Forms

From DataNotes + UPBox http://purl.pt/24107/1/iPres2013_PDF/UPBox%20and%20DataNotes%20a%20collaborative%20data%20management%20environment%20for%20the%20long%20tail%20of%20research%20data.pdf

29

Semantic Forms

From DataNotes + UPBox http://purl.pt/24107/1/iPres2013_PDF/UPBox%20and%20DataNotes%20a%20collaborative%20data%20management%20environment%20for%20the%20long%20tail%20of%20research%20data.pdf

30

Redundancy…

Relational Database (MySQL)

Triple Store (Apache

Jena)Mapping Logic

33

CKAN

DSpace

Invenio

Semantic MediaWiki

Time

Flexible attributes

Wide use

DB-code coupling

34

Issues review• Entities with unknown attributes at time of modeling

• Time-variant attribute values

• Inheritance / sub-classing

• Hierarchies (parents of parents of parents…)

• Need for external documentation

35

Dendroa graph-based data management platform

36

Graph databases • Represent entities (Users, Products, Places…) as

vertexes (entity types are called classes)

• Connections between them are directed graph edges (edge types are called properties)

!

• The meaning of these connections is expressed in ontologies that can be shared and reused

37

Getting all my Projects

• Will fetch all the projects created by the user

• Will also return their attributes (“database columns”)

• Different projects may have different attributes38

Inference

• Transitive Properties

• Subclasses

• Multiple Inheritance

•Resource can be a Folder and a Dataset at the same time)

39

Loading an ontology

• Load ontology straight from the web

• No platform-specific syntax (like in SMW)

40

Nothing comes for free• Aggregation operators slow

• No ACID properties

• Transactions are not supported in standard SPARQL

• (“SPARQL 1.1 Query/Update Services should be atomic but that they are not required to be atomic.”)

• Graph DBMS Solutions are in early stages (many bugs, many “beta”s, many mailing lists…)

41

Dendro • Dropbox and File/Folder description platform

• Variable descriptions

• Time-dependent values

• Directory structures (hierarchy)

• Need for simple querying…

42

nie:isLogicalPartOf

Pn

Dn

280mm

“DCB Base Data”

120

Dn-1

dcb:initialCrackLength

dc:title

dcb:specimenWidth

dc:isReferencedBy

Fn

120

dc:title

dcb:specimenWidth

dc:isVersionOf

Added propertyinstance

01/01/2014^^xsd:date

dc:created

01/01/2014^^xsd:date

dc:modified

Changedmodificationtimestamp

Revision creation

timestamp

Un

dc:creator

Current dataset version Past Revisions

ddr:pertainsTo

Change recording

C

ddr:initialCrackLen

gth

ddr:changedDescriptor

“add”

ddr:operation

“DCB Base Data”

43

nie:isLogicalPartOf

Pn

Dn

280mm

“DCB Base Data”

120

Dn-1

dcb:initialCrackLength

dc:title

dcb:specimenWidth

dc:isReferencedBy

Fn

120

dc:title

dcb:specimenWidth

dc:isVersionOf

Added propertyinstance

01/01/2014^^xsd:date

dc:created

01/01/2014^^xsd:date

dc:modified

Changedmodificationtimestamp

Revision creation

timestamp

Un

dc:creator

Current dataset version Past Revisions

ddr:pertainsTo

Change recording

C

ddr:initialCrackLen

gth

ddr:changedDescriptor

“add”

ddr:operation

“DCB Base Data”

43

nie:isLogicalPartOf

Pn

Dn

280mm

“DCB Base Data”

120

Dn-1

dcb:initialCrackLength

dc:title

dcb:specimenWidth

dc:isReferencedBy

Fn

120

dc:title

dcb:specimenWidth

dc:isVersionOf

Added propertyinstance

01/01/2014^^xsd:date

dc:created

01/01/2014^^xsd:date

dc:modified

Changedmodificationtimestamp

Revision creation

timestamp

Un

dc:creator

Current dataset version Past Revisions

ddr:pertainsTo

Change recording

C

ddr:initialCrackLen

gth

ddr:changedDescriptor

“add”

ddr:operation

“DCB Base Data”

43

nie:isLogicalPartOf

Pn

Dn

280mm

“DCB Base Data”

120

Dn-1

dcb:initialCrackLength

dc:title

dcb:specimenWidth

dc:isReferencedBy

Fn

120

dc:title

dcb:specimenWidth

dc:isVersionOf

Added propertyinstance

01/01/2014^^xsd:date

dc:created

01/01/2014^^xsd:date

dc:modified

Changedmodificationtimestamp

Revision creation

timestamp

Un

dc:creator

Current dataset version Past Revisions

ddr:pertainsTo

Change recording

C

ddr:initialCrackLen

gth

ddr:changedDescriptor

“add”

ddr:operation

“DCB Base Data”

43

nie:isLogicalPartOf

Pn

Dn

280mm

“DCB Base Data”

120

Dn-1

dcb:initialCrackLength

dc:title

dcb:specimenWidth

dc:isReferencedBy

Fn

120

dc:title

dcb:specimenWidth

dc:isVersionOf

Added propertyinstance

01/01/2014^^xsd:date

dc:created

01/01/2014^^xsd:date

dc:modified

Changedmodificationtimestamp

Revision creation

timestamp

Un

dc:creator

Current dataset version Past Revisions

ddr:pertainsTo

Change recording

C

ddr:initialCrackLen

gth

ddr:changedDescriptor

“add”

ddr:operation

“DCB Base Data”

43

nie:isLogicalPartOf

Pn

Dn

280mm

“DCB Base Data”

120

Dn-1

dcb:initialCrackLength

dc:title

dcb:specimenWidth

dc:isReferencedBy

Fn

120

dc:title

dcb:specimenWidth

dc:isVersionOf

Added propertyinstance

01/01/2014^^xsd:date

dc:created

01/01/2014^^xsd:date

dc:modified

Changedmodificationtimestamp

Revision creation

timestamp

Un

dc:creator

Current dataset version Past Revisions

ddr:pertainsTo

Change recording

C

ddr:initialCrackLen

gth

ddr:changedDescriptor

“add”

ddr:operation

“DCB Base Data”

43

Demo

Dendroβ

44

Conclusions• Recording rich metadata requires data model

flexibility

• Unknown attributes, time-variant information or hierarchies can be hard to model in a relational database

• Several current solutions make compromises due to their relational database layer

45

Conclusions (cont’d)• Graph-based models are more flexible and easily

expansible through ontology loading

• Ontologies are shareable on the web, and document the database “schema”

• Queries become simpler due to the graph model’s ability to easily model challenging scenarios for RDBs

• Dendro is a collaborative data management platform fully built on a graph model

46

João Rocha da Silva is an Informatics Engineering PhD student at the Faculty of Engineering of the University of Porto. He specializes on research data management, applying the latest Semantic Web Technologies to the adequate preservation and discovery of research data assets.!!He is also an experienced freelancer iOS Developer with several Apps published on the App Store, and a self-taught DIY mechanic with a special interest in classic cars, particularly his 1987 Toyota Corolla GT Twin Cam, also known as Hachi-Roku or AE86.!

Research Data Management and Semantic Web Researcher, Web & iPhone DeveloperJoão Rocha da Silva!

João Correia Lopes is an Assistant Professor in Informatics Engineering at Universidade do Porto and a researcher at INESC TEC. He has graduated in Electrical Engineering in the University of Porto in 1984 and holds a PhD in Computing Science by Glasgow University in1997. His teaching includes undergraduate and graduate courses in databases and web applications, software engineering and object-oriented programming, markup languages and semantic web. He has been involved in research projects in the area of long-term preservation, service-oriented architectures and e-Science. Currently his main research interests are e-Science and the management of research data.

Cristina Ribeiro is an Assistant Professor in Informatics Engineering at Universidade do Porto and a researcher at INESC TEC. She has graduated in Electrical Engineering, holds a Master in Electrical and Computer Engineering and a Ph.D. in Informatics. Her teaching includes undergraduate and graduate courses in information retrieval, digital libraries, knowledge representation and markup languages. She has been involved in research projects in the areas of cultural heritage, multimedia databases and information retrieval. Currently her main research interests are information retrieval, digital preservation and the management of research data.

Assistant Professor in Informatics Engineering at Universidade do Porto, Researcher at INESC TECCristina Ribeiro!

Assistant Professor in Informatics Engineering at Universidade do Porto, Researcher at INESC TECJoão Correia Lopes!

top related