towards a persistent identifier infrastructure for european e-research daan broeder clarin / mpg...

21
Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

Upload: blake-robertson

Post on 27-Mar-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

Towards aPersistent Identifier Infrastructure for

European e-Research

Daan BroederCLARIN / MPG

2008 CNRI Handle System Workshop

Page 2: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

Content

• Domain & Scope

• Organizational embedding

• Further requirements

• Services for e-research with PIDs

2008 CNRI Handle System Workshop

Page 3: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

Domain & Scope

• Reliable references & citations of web accessible resources• Language resource domain

– Audio & video recordings, pictures, primary texts, annotations– Lexica, grammar descriptions, …– Concepts in terminology registries and ontology's– …

• Number of resources very big, dependent on how you approach the granularity issue

• References and citations – embedded in (web) documents– In data structures– In DBs– …

2008 CNRI Handle System Workshop

Page 4: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

CLARIN Common Language Resources and Technology Infrastructure

The CLARIN project is a large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily useable. As one of its goals CLARIN will create a federation of LR repositories and aims to create a unified resource registry using persistent identifiers.

2008 CNRI Handle System Workshop

Page 5: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

CLARIN Common Language Resources and Technology Infrastructure

• Preparatory phase 2008-2011 (Construction phase 2011-2020)

• European dimension (ICT FP7)

– 112 members from 35 countries, – Prep. Phase Funded with 4.2 ME

• National dimension:– Funding until now 6.5 ME, more to come– …

2008 CNRI Handle System Workshop

Page 6: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

DAM-LR Distributed Access Management for Language Resources

• (Small 4 partners) European Project aimed at federation building in LR repository domain, 2005-2007

• Unified metadata catalogue

• Identity federation using Shibboleth

• Single resource identifier system for all “published” resources using the Handle System

2008 CNRI Handle System Workshop

broeder
FP6 construction studyInvestigate the applicability of grid technologywithin the community of language resourcerepositories.
Page 7: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

Developed special tools

• Mover

– Updates Handle DB + catalogue

– Updates metadata XML files*

• Restore operations

– Recreate the Handle DB (and others) from scratch

Lessons learned

– Fed. Tech not for all organizations

Lund archive

R

MPI archive

R

primary1839

sec.10050

primary10050

INL archive

R

primary10032

R

R

R

R

R

sec.10032

sec.1839

DAM-LR HS infrastructure

Page 8: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

User benefits

Page 9: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

MPGMax-Planck Society

• Proposal within the MPG to support a MPG wide PID registration service based on the HS.

• Run by MPG computing center GWDG

• Will also give support for non-MPG German scientific organizations and (hopefully) CLARIN.

2008 CNRI Handle System Workshop

Page 10: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

Requirements

• (Political) Independence: European GHR mirror & proxy + no single point of failure

• Wide(r) acceptance of PID scheme

• Support for object part addressing, from ISO TC37/SC4 CITER work.

• Support for (secure) management of resource copies

2008 CNRI Handle System Workshop

Page 11: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

proxy

MPG/CLARIN@GWDG

MPI archiveClass A

R

primary1839

primary1111

.. ArchiveClass C

R

R RR

CLARIN PID Infrastructure

sec.…

sec.…

1839/R1

GHR mirror

1111/R5

sec.1839

PID registration

service

Page 12: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

PID Scheme

• Difficult to gain acceptance– Without PID syntax being “official”– W3C seems to have problems with anything

else but HTTP (see recent XRI events)

• Can the HS user community help? • Possibly only acceptance via urlified

handles: http://hdl.handle.net/1039/R5• Perhaps follow ARK for elegance:

– http://hdl.handle.net/hdl:/1039/R5

2008 CNRI Handle System Workshop

Page 13: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

A

y

x z

• Wasteful to issue a pid for each part (think of 100k entries in a lexicon). So use part identifiers.

• Resolver can make an adequate translation “A#z” -> “objectA?part=z” This requires enough

flexibility from the resolver to accommodate the object server.

• The syntax of “Z” should be standard for the specific data type. Loan from existing fragment identifier syntax standards.

1839/A1839/x1839/y1839/z

1839/A: + 1839/A#x, 1839/A#y, 1839/A#z

pidresolver

objectserver

1839/A#z

http://oserver/objectA?part=z

1839/A

http://oserver/objectA A

y

x z

z

2008 CNRI Handle System Workshop

PIDs & Resource Parts

Page 14: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

Lund archive

R

MPI archivecopy

10050/R -> http://lund/lund_url

primary1839

primary10050

R

• What if MPI moves the resource copy?

• MPI should have wrt access to the Lund Handle record

• This would enable changing the Lund URL record too!

-> http://mpi/mpi_url

move

LHSAccessmonitor

MPIManager

R

2008 CNRI Handle System Workshop

Resource duplicates

Page 15: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

Lund archive

R

MPI archive

R

copy

10050/R -> http://lund/lund_url

primary1839

primary10050

R

indirect handles*• TYPE = URL

– IE-Plugin: ok.– HS proxy: not-ok

• TYPE = HS_ALIAS (problem*) – IE-Plugin: ok.– HS-Proxy ok

• Status of 1839/Rcpy handle?– Use in documents?

-> hdl:1839/Rcpy

1839/Rcpy -> http://mpi/mpi_url

MPIManager

move

Resource duplicates

2008 CNRI Handle System Workshop

Page 16: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

Possible Added PID Services

• Establishing resource authenticity

• Resource Collection Registration

• Resource Citation Information

• Lost Resource Detective

• …

2008 CNRI Handle System Workshop

Page 17: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

Collection Registration Service

• Much scientific works depends on seemingly “accidental” distributed collections of material that has no independent embodiment.

• Needs to be citable with one single PID– encode the collection’s resource uris directly in a

handle record– attach a link to a map of the collection’s uris

• Compare recent “Aggregation Map” concept from ORE

2008 CNRI Handle System Workshop

Page 18: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

Citation Information Service

• (Collections of) resources need to be cited in documents. Acknowledgement & credit also important for primary scientific data E.g. “Dutch Spoken Corpus, © Institute for Dutch Lexicography, ….”

• Make this citation information part of the with the PID associated metadata.

2008 CNRI Handle System Workshop

Page 19: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

Establishing Provenance

• If by accident the handle <-> URI mapping was not properly maintained, special metadata could be available from the handle record to establish its location or find a copy.– URI history, Repository, Depositor, …

• Labor intensive• Only for limited number of resources

unless there is a pattern

2008 CNRI Handle System Workshop

Lost Resource Detective

Page 20: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

2008 CNRI Handle System Workshop

The End

Page 21: Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop

Integration • it should be an optional

extension• Make sure HS is not SPF• IMDI/LAT SW functions also

without HS

Issue handles for objects• Only for local resources

Need special tools• Mover

– Updates Handle DB + catalogue

– Updates IMDI XML files*• Restore operations

– Recreate the Handle DB (and others) from scratch

MPI1001# mpi_url

1839/087-D mpi_url LHS

LATwebapps

sync

Handle DB

catalogue

mover

IMDIharvester

CC

S S S S S

C

DAM-LR HS infrastructure