controlled vocabularies for digital humanities · for digital humanities . 2 outline •overview of...
TRANSCRIPT
2013-09-03 - Copenhagen - 3rd DARIAH-EU General VCC meeting
Matej Ďurčo, ICLTT, Vienna
Hennie Brugman, Meertens Institute, Amsterdam
Controlled Vocabularies
for Digital Humanities
2
Outline
• Overview of related activities in different contexts
• Controlled Vocabularies - potential usages and topic areas
• SKOS
• OpenSKOS - Vocabulary Repository
• CLAVAS schema/ontology
• Next steps – Where do we want to go from here?
3
„Vocabulary“ - disambiguation
• concept vs. term semantic vs. lexical - concept is referred to by one or more terms,
but is not identified by those (“has a life on its own”)
• term list flat
• concept list also flat, but distinguishes between semantic and lexical levels
• taxonomy have (mostly hierarchical) relations between concepts/terms
• schema/ontology both have concepts/entities with properties and different types of relations between them
schema – XML-, DB- world
ontology – knowledge management, semantic web
4
„Vocabulary“ - disambiguation
discussed here
can use
can grow into
• concept vs. term semantic vs. lexical - concept is referred to by one or more terms,
but is not identified by those (“has a life on its own”)
• term list flat
• concept list also flat, but distinguishes between semantic and lexical levels
• taxonomy have (mostly hierarchical) relations between concepts/terms
• schema/ontology both have concepts/entities with properties and different types of relations between them
schema – XML-, DB- world
ontology – knowledge management, semantic web
5
• VCC1/Task 5: Data federation and interoperability
• VCC3/Task3: Reference Data Registries – outcomes ?
• => joint task force (started in Vienna, November 2012)
goal:
establish a service providing
controlled vocabularies
and reference data
for the DARIAH
community.
• Schema Registry + Crosswalks but does not seem to belong here
as it is schema level
Activities in DARIAH
M. Hoogerwerf, P. Gietz: VCC 1, Task 2: Core Infrastructure Services
?
6
• ISOcat – Data Category Registry - registry for defining (linguistic) concepts (“flat” = (almost) no relations)
- implementation of the ISO standard ISO12620:2009
- a cornerstone of CMDI – semantic grounding for MD schemas
www.isocat.org
• Relation Registry - companion to ISOcat to express relations between data categories
- early stage service operational: lux13.mpi.nl/relcat/
• task force on metadata curation - within the SCCTC (Standing Committee for CLARIN Technical Centres)
• CLAVAS - Vocabulary Alignment Service for CLARIN
- initiative originating within CLARIN-NL
- goal: adopt the Vocabulary Repository OpenSKOS for CLARIN needs
• OpenSKOS and controlled vocabularies meeting in Utrecht, 2013-05-17
www.clarin.eu/node/3780
Activities in CLARIN
Activities elsewhere
number of datasets/vocabularies (and tools/services) already exist especially in the libraries world
• VIAF - Virtual International Authority File - carried by national libraries + OCLC - goal: harmonize/cluster (national) authority files - provides services, search interface, data dumps - vocabularies for: Personal Names, Corporate Names, Geographic Names, etc.
• The European Library (48 national libraries) - vocabulary-based data enrichment - MACS – Multingual Access to Subjects (semi-automatic alignment) - Alignment of DDC and UDC via CERIF carried out - Alignment to other ontologies (Geonames, VIAF) - search service: http://www.theeuropeanlibrary.org/tel4/apisearch
• Library of Congress - LCSH, MADS, ...
• Getty Thesauri
• Geonames - search interface, service, dumps
• LT-World @DFKI - full-blown ontology, rather a candidate for LOD-linking
• many more …
• CoNE – Control of Named Entities @MPDL/eSciDoc http://colab.mpdl.mpg.de/mediawiki/Control_of_Named_Entities
• EATS - Entity Authority Tool Set @New Zealand Electronic Text Centre (NZETC). http://eats.readthedocs.org/en/latest/
7
Potential usages for CV
• Metadata Authoring, Curation
• Data Enrichment, Annotation
• Search query expansion, autocomplete, facets etc.
• Data Analysis / Exploration
• indispensable building block for moving data to Semantic Web by allowing to resolve strings to entities
• can provide equivalencies between concepts/entities from different vocabularies cf. links in Wikipedia (page for J. W. Goethe): GND: 118540238 | LCCN: n79003362 |
NDL: 00441109 | VIAF: 24602065
=> Linked Data
8
Topic areas + candidate vocabularies
• Data Categories / Concepts – ISOcat, (dublincore)
• Languages - ISO-639-*
• Countries - country codes
• Organizations - GND, VIAF, dbpedia?
• Persons - GND, VIAF, dbpedia?
• Schlagwörter/Subjects - GND, LCSH, DDC, UDC, MACS, …
• Resource Typology - many attempts
• many other more specialized AAT – Getty Architecture and Arts Thesaurus DDC - Dewey Decimal Classification GND - Gemeinsame Norm Datei (Deutsche National Bibliothek) GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives) VIAF - Virtual International Authority File
9
10
• concept/entity identified by a PID (coolURI would do)
• support multilinguality / localization
• plurality of conceptualizations
allow multiple (conflicting) vocabularies for the same topic
• vocabulary management / curation as collaborative ongoing process
• base on Semantic Web compliant formats
• share created vocabularies
• reuse existing sources/services
• thematic sub-communities with selected profiles
different sub-communities and groups need different vocabularies
• but with a harmonized access
= „one stop shop“ for controlled vocabularies
good for the providers, the users and for the developers
Requirements / Approach
Benefits (and risks) of a harmonized system
• for providers:
- simplifies publication of vocabularies
- simplifies reuse of (own) vocabularies in somebody elses tools
- easy to align concepts between vocabularies
• for users
- easy discover, evaluate and use vocabularies
(less need to construct them yourself)
- new browsing and searching possibilities
- online vocabularies are always up to date
• for tool builders
- no customization for individual vocabularies needed
- reuse of existing tools, modules
• risks
- Babylon scenario – too different conceptual domains clashing
- overwhelming of the system – system as single point of failure
- overwhelming of the user – too much information (too many vocabularies available)
11
SKOS – ultra short primer
Simple Knowledge Organization System
http://www.w3.org/TR/skos-reference/
• SKOS knowledge structures consist of Concepts grouped in ConceptSchemes
• Concepts are identified by a URI
• Concepts have labels in 1 or more languages
skos:prefLabel@lang, skos:altLabel@lang => multilinguality
• Concepts can be documented with ‘notes’
• Concepts have mutual semantic relations
broader, narrower, related => taxonomy construction
• Concept in different ConceptSchemes can have matching relations
• Concepts can be part of multiple ConceptSchemes
12
SKOS – example 13
OpenSKOS
Vocabulary repository and service openskos.org
• data in SKOS format
• Peer to peer architecture
• RESTful API
• Linked Data
• Publication with upload and OAI-PMH
• Management using Interactive Dashboard
• Support for alignment
• Promotion of open database licenses
• And lately, vocabulary curation with built-in editor
14
OpenSKOS
• developed within the Dutch cultural heritage project CATCHplus
• by a commercial company (Picturae), but open source
• currently 3 instances running: Meertens Institute, NISL, ICLTT (test phase)
(Picturae has another 7 instances running for their customers)
15
CLAVAS – Vocabulary Service for CLARIN
Adaptation of OpenSKOS for CLARIN purposes = mainly a separate instance with specific data sets
16
currently > 2.500 entries
bootstrap
ISOcat and CLAVAS
• automatically export all closed+simple data categories - perhaps even better to select manually - not all data categories !
• Third party applications would use - ISOcat for explain() function - CLAVAS for value(/entity)-lists (autocomplete)
17
Open issues –next steps
• CLARIN/CLAVAS:
short term:
- update OpenSKOS instance at the Austrian Centre
- test synchronization of datasets via OAI-PMH with sister instance at Meertens
- continue curation work on Organization names
long term:
- use the Vocabulary Service with other infrastructure components (e.g. metadata editor)
- adopt further vocabularies
- especially work out how to integrate existing large ones / services => proxy?
• DARIAH
- collect candidate vocabularies/topics and people/groups in need of those
- decide if we try to use/adopt OpenSKOS - perhaps a pilot => be bold step up!
- pin down concrete scenarios (+ outcome) , where given vocabularies would be employed
- get rid of the NIH attitude
• Work out relation to Semantic Web activities
- transforming data to Linked Data (RDF)
- interlinking vocabularies/ontologies (dbpedia as the LOD-pivot)
18
19
What do we want/need? Who does what?
Topic / Area existing Vocabularies People
Language
Organization CLAVAS-organizations, VIAF, GND, dbpedia
Resource Type, Format ? -> Taxonomy of DH Research Activities and Objects?
Genre / Topic / Subject LCSH, UDC, DDC, …
Geographica Geonames, Getty, dbpedia
Persons GND, Getty AAT, dbpedia
20
Controlled Vocabularies - Outline
• Overview of related activities in different contexts (DARIAH, CLARIN, Digital Libraries)
• Controlled Vocabularies - potential usages and topic areas
• SKOS – a widely used W3C-standard for “vocabularies”
• OpenSKOS - Vocabulary Repository
• CLAVAS – Vocabulary System for CLARIN
• Next steps – Where do we want to go from here?
Summary discussion
• concentrate on content generation rather than technical development
in adherence to general DARIAH strategy
• but also don‘t reinvent the wheel (data)
=> reuse / mediate / proxy existing vocabularies
however they are often too broad/general, never complete (VIAF, GND)
we need possibility to add concepts or in general
+ edit/curate vocabularies
• hence a dedicated Vocabulary Repository
allowing collaborative curation of vocabularies
one such system would be OpenSKOS (tried out in CLARIN)
• try to feed back/contribute back to the authority bodies (National Libraries)
this is principially possible, however a slow process
DARIAH could/should try to mediate / push, but this need to be a separate/parallel track
21
Summary discussion
• many different topics/areas – but consider fundemental distinction
between concepts/taxonomies and entities (organizations, persons)
- Vocabulary Repository only for SKOS data
- dedicated tools for entities -> e.g. PDR Persons Data Repository (@BBAW)
• very closely related to Linked Data and Scholarly Methods Ontology tracks
two (interconnected) levels to work on:
1. Inventarization + harmonization
- bring existing vocabularies technically on common ground
2. Vocabulary (/ontology) alignment
- create links between concepts/entities in different vocabularies/ontologies
- proposition by IEG Mainz (on the example of place types): align vocabularies based on
features of concepts
• Vocabularies most asked for
- Names (Persons and Organizations)
- Places (Geographica)
22
Vision
23