aligning thesauri for an integrated access to cultural heritage collections

36
Aligning Thesauri for an integrated Access to Cultural Heritage Collections Antoine ISAAC (including slides by Frank van Harmelen) STITCH Project UDC Conference June 5 th , 2007

Upload: shelley

Post on 17-Mar-2016

25 views

Category:

Documents


2 download

DESCRIPTION

Aligning Thesauri for an integrated Access to Cultural Heritage Collections. Antoine ISAAC (including slides by Frank van Harmelen) STITCH Project UDC Conference June 5 th , 2007. Background. CATCH C ontinuous A ccess T o C ultural H eritage Funded by NWO - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Antoine ISAAC(including slides by Frank van Harmelen)STITCH Project

UDC ConferenceJune 5th, 2007

Page 2: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Background

• CATCH • Continuous Access To Cultural Heritage• Funded by NWO• 10 computer science research projects applied to the

Cultural Heritage field

• STITCH• SemanTic Interoperability To access Cultural

Heritage• Exchanging and integrating metadataBeware: this is research!

Page 3: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Agenda

• The Semantic Interoperability problem• Demo• Semantic Web solutions for interoperability

• Conceptual vocabulary alignment• Conceptual vocabulary representation

Page 4: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

KB Illustrated Manuscripts

Page 5: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

KB Illustrated Manuscripts

Page 6: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

BNF Mandragore

Page 7: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

BNF Mandragore

Page 8: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

The Semantic Interoperability Problem

• Trend: simultaneous access to different collections

• Problem: conceptual heterogeneity• No standard vocabulary/thesaurus

• “classical ruins” vs. “landscape with ruins”• “the Virgin Mary” vs. “Saint Mary”

• We don’t really want itdifferent vocabularies for different domains, traditions, tasks

• Practical consequence:• Searching for “the Virgin Mary” misses “Saint Mary”• Unless we know both vocabularies

Page 9: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Old situation

Page 10: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Vocabulary alignment

• Find semantic correspondences between vocabulary elements • “classical ruins” ≈ “landscape with ruins”• “the Virgin Mary” = “Saint Mary”

Page 11: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

New situation

Page 12: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Demo

• http://stitch.cs.vu.nl/rp33333/MANDRA-SV-ICE-mandraNewNONE , amphibians

• Wheat

[Screenshots at the end of these slides]

Page 13: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Agenda

• The Semantic Interoperability problem• Demo• Semantic Web solutions for interoperability

• Conceptual vocabulary alignment• Conceptual vocabulary representation

Page 14: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Vocabulary alignment• Find correspondences between vocabulary elements

• “klassieke ruïnes” ≈ “landschap met ruïnes”• “maagd Maria” = “Heilige Moeder”

• STITCH aim: doing it (semi-)automatically• Vocabularies are big• They evolve over time

• Using techniques from Semantic Web research domain• Problem comparable to ontology alignment• Techniques already investigated there

• Linguistics, statistics

Page 15: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Automatic alignment techniques

• Lexical • Structural• Statistical• Background knowledge

Page 16: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Lexical alignment

• Labels of entities, textual definitions

tumorbrainLong tumor LongMore specific than

Page 17: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Automatic Alignment Techniques

• Lexical • Structural• Statistical• Background knowledge

Page 18: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Statistical alignment• Object information (e.g. book indexing)

Page 19: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Statistical alignment: KB collections

(4951 1152 613) Nederlands - Nederlandse taalkunde (280 714 243) Diabetes mellitus - suikerziekte

Page 20: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Automatic Alignment Techniques

• Lexical • Structural• Statistical• Background knowledge

Page 21: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

backgroundknowledge

Alignment using shared background knowledge• Using a shared conceptual reference to find

links

thesaurus 1 thesaurus 2

Page 22: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Alignment: no universal solution

• No single technique gives an ideal solution• Different techniques have to be

selected/combined, depending on the application case• Poor vs. rich semantic structure• Extensive vs. limited lexical coverage• Existence of collections described by several

vocabularies

• Alignment is a difficult research problem

Page 23: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Agenda

• The Semantic Interoperability problem• Demo• Semantic Web solutions for interoperability

• Conceptual vocabulary alignment• Conceptual vocabulary representation

Page 24: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Representing Vocabularies

Many different models and formats to represent vocabularies

• Need for standard formats to develop standardized tools and methods• Alignment process• Browsing/information retrieval tools using vocabularies

• Need to represent features commonly used by these tools• Especially lexical information and semantic links

Page 25: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

SKOS (Simple Knowledge Organisation System)

• World Wide Web Consortium (W3C)• Model to represent simple conceptual vocabularies

(thesauri, classification schemes) on the Semantic Web• Comparable to Dublin Core, for conceptual vocabularies

• SKOS offers building blocks to create XML/RDF data• Concepts and ConceptSchemes• Lexical properties (prefLabel, altLabel)• Semantic relations (broader, related)• Notes (scopeNote, definition)

Page 26: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

SKOS: Small UDC Example

skos:Concept http://www.udcc.org/udc/class_512skos:prefLabel

512@zxx

skos:prefLabel

Algebra@en

skos:broader http://www.udcc.org/udc/class_51

• Beware: this is a standard, not everything can be represented!E.g. for UDC, difficult to represent all types of

auxiliariesIs -2 Evidence of religion a standard concept?

Page 27: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Agenda

• The Semantic Interoperability problem• Demo• Semantic Web solutions for interoperability

• Conceptual vocabulary alignment• Conceptual vocabulary representation

Page 28: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Conclusion: New opportunities for making knowledge accessible

• Integration of collections at the semantic level• Semantic integration and vocabulary alignment

• Representation and publication of conceptual vocabularies• SKOS is an open, web-compatible standard

• Semantic Web research can help Cultural Heritage • Vision: a global network of interconnected collections

and vocabularies that can be exploited by standard tools?• Or somewhere in-between present situation and the

vision

Page 29: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Discussion: UDC and Semantic Interoperability?• UDC as pivot language (spine) for multilingual

access• Ideal for multilingual scenarios• Compatible with common information needs

• “Front-office” scenario• Aligning initial vocabularies to UDC• Using UDC in the access system• MSAC

• Multilingual Subject Access to Catalogues of National Libraries• UDC as a searching/browsing means, with other vocabularies

Page 30: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Discussion: UDC and Semantic Interoperability?

• “Back-office” scenario?• UDC as a background resource for automatic

pairwise alignment between the initial vocabularies• Multilingual information, rich semantic structure

• Both scenarios require more accessible UDC• And experimentation…

Page 31: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Thanks!

Page 32: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Links• STITCH http://stitch.cs.vu.nl• Demo collections

• BNF Mangragore http://mandragore.bnf.fr• KB illuminated manuscripts http://www.kb.nl/manuscripts/

• Library-originated integration projects:• MSAC search interface http://sigma.nkp.cz• MACS project http://macs.cenl.org

• Semantic web links• Semantic Web at W3C http://www.w3.org/2001/sw/• SKOS http://www.w3.org/2004/02/skos/

• Semantic Web projects dealing with Cultural Heritage• MuseumFinland http://www.museosuomi.fi/ • eCulture http://e-culture.multimedian.nl/

Page 33: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH CollectionsDemo (1)

Subject vocabulary, collection 1

Subjects

Page 34: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Demo (2)Hierarchical path

from root to selected subject

Possible specialization for selected subject

Page 35: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Document from Collection 2

Semantic alignment of subjects activated

Demo (3)

Page 36: Aligning Thesauri for an integrated Access to Cultural Heritage Collections

Aligning Thesauri for an integrated Access to CH Collections

Demo (4)

Subject from voc2 aligned to voc1:amphibians”

Back