requirements for long- term preservation david giaretta 1 st october 2009, helsinki

Post on 31-Mar-2015

216 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Requirements for Long-Term Preservation

David Giaretta1st October 2009, Helsinki

Digital Preservation…

Easy to do… …as long as you can provide money forever Easy to test claims about repositories… …as long as you live a long time

Digital Preservation

activities

Infrastructure

Information about

users and practices

ISO standard: OAIS

ISO standard: OAIS update

ISO standards: Audit and Certification

Tools

Relationship to related work and

community practices

Alliance for Permanent Access• The Alliance

aims to develop a shared vision and framework for a sustainable organisational infrastructure for permanent access to scientific information

The British Library European Organization for Nuclear Research [CERN]CSC — IT Center for ScienceDelegation of the Finnish Academies of Science and Letters Deutsche Nationalbibliothek Digital Preservation Coalition European Science Foundation [ESF] European Space Agency [ESA] Helmholtz-Gemeinschaft Deutscher Forschungszentren International Association of Scientific, Technical & Medical Publishers Joint Information Systems Committee [JISC] Koninklijke Bibliotheek Max-Planck-Gesellschaft NESTOR Kompenteznetzwerk Nationale Coalitie Digitale Duurzaamheid [NCDD] Portico Science & Technology Facilities Council [STFC]

http://www.alliancepermanentaccess.org/

Alliance for Permanent Access• The Alliance

aims to develop a shared vision and framework for a sustainable organisational infrastructure for permanent access to scientific information

The British Library European Organization for Nuclear Research [CERN]CSC — IT Center for ScienceDelegation of the Finnish Academies of Science and Letters Deutsche Nationalbibliothek Digital Preservation Coalition European Science Foundation [ESF] European Space Agency [ESA] Helmholtz-Gemeinschaft Deutscher Forschungszentren International Association of Scientific, Technical & Medical Publishers Joint Information Systems Committee [JISC] Koninklijke Bibliotheek Max-Planck-Gesellschaft NESTOR Kompenteznetzwerk Nationale Coalitie Digitale Duurzaamheid [NCDD] Portico Science & Technology Facilities Council [STFC]

http://www.alliancepermanentaccess.org/PARSE.Insight

Preservation is a Social activity

Sometimes are activities are personal “preserve for your future self” [Australia]

In the short term for re-use by colleagues and other people

In the long term for re-use by future generations

Neeri 20091-2 Oct 2009, Helsinki

Definitions (OAIS)

Long Term Preservation: The act of maintaining information, Independently Understandable by a Designated Community, and with evidence supporting its Authenticity, over the Long Term.

Long Term: A period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing Designated Community, on the information being held in an OAIS. This period extends into the indefinite future.

Neeri 20091-2 Oct 2009, Helsinki

Not just BIT preservation

Not just rendering

Information not just DATA or Documents

Authenticity

Things change/disappear

Software Hardware Environment

E.g. Network links to related information People

What is “common knowledge”

How can we ensure that the information trapped in the “bits” remains understandable despite all these changes?

Just Format?

sfqsftfoubujpo jogpsnbujpo svmftrepresentation information rules

You have a file

JHOVE tells you it is WORD version 7

Format – necessary but not sufficient:

formats can be used for multiple purposes e.g. audio files used to store configuration parameters

XML enough?

<family> <father>John</father> <mother>Mary</mother> <son>Paul</son></family>

<VOTABLE version="1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ivoa.net/xml/VOTable/v1.1 http://www.ivoa.net/xml/VOTable/v1.1" xmlns="http://www.ivoa.net/xml/VOTable/v1.1"><RESOURCE><TABLE name="6dfgs_E7_subset" nrows="875"><PARAM arraysize="*" datatype="char" name="Original Source"

value="http://www-wfau.roe.ac.uk/6dFGS/6dfgs_E7.fld.gz"><DESCRIPTION>URL of data file used to create this table.</DESCRIPTION></PARAM><PARAM arraysize="*" datatype="char" name="Comment" value="Cut down 6dfGS dataset for TOPCAT

demo usage."/><FIELD arraysize="15" datatype="char" name="TARGET"><DESCRIPTION>Target name</DESCRIPTION></FIELD><FIELD arraysize="11" datatype="char" name="DEC" unit="DMS"><DATA><FITS><STREAM encoding='base64'>U0lNUExFICA9ICAgICAgICAgICAgICAgICAgICBUIC8gU3RhbmRhcmQgRklUUyBmb3JtYXQgICAgICAgICAgICAgICAgICAgICAgICAgICBCSVRQSVggID0gICAgICAgICAgICAgICAgICAgIDggLyBDaGFyYWN0ZXIgZGF0YSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIE5BWElTICAgPSAgICAgICAgICAgICAgICAgICAgMCAvIE5vIGltYWdlLCBqdXN0IGV4dGVuc2lvbnMgICAgICAgICAgICAgICAgICAgICAg

Data…

Level 2 GOME Satellite instrument data

Complex container objects

Neeri 20091-2 Oct 2009, Helsinki

Key OAIS Concepts

Claiming “This is being preserved” is untestable Essentially meaningless

Except “BIT PRESERVATION” How can we make it testable?

Claim to be able to continue to“do something” with it Understand/use

Need Representation Information Still meaningless…

Things are too interrelated Representation Information potentially unlimited

Designated Community Many other concepts identified Finer grained taxonomy than simply saying

Allows one to ask if one has all the required typesAvailable from: http://public.ccsds.org/publications/archive/650x0b1.pdf

“Metadata”

Representation Information

The Information Model is key

Recursion ends at KNOWLEDGEBASE of the DESIGNATED COMMUNITY

(this knowledge will change over time and region)

OAIS Archival Information Package (AIP)

Neeri 20091-2 Oct 2009, Helsinki

Archival

Package

Contentfurther described by

Package Packaging

derivedfrom

describedby

delimitedby

DataObject

PhysicalObject

DigitalObject

StructureReferenceOther

Interpretedusing

Interpretedusing*

1

11...*

Bit

addsmeaning

to

Provenance Context Fixity AccessRights

Representation Information Network

Neeri 20091-2 Oct 2009, Helsinki

Preservation and Re-use Unfamiliar information

Preservation Digitally encoded information which must be

usable and understandable Unfamiliar because of separation in time

E-Science/GRID/CyberInfrastructure for data Digitally encoded information which must be

usable and understandable Unfamiliar because of separation in discipline or

location – even if created yesterday

Support automated usage where possible

•Rep

•Info

/DISCIPLINE

•Virtualisation

Insight: stakeholders

Research• Research institutes (non-profit)• Universities• Academic libraries

Data management (preservation)• Data centres (profit / non-profit)• Libraries• Archives

Funding/policy• National Funding organisations• European funding• Corporate funding

Publishing• General (cross-community) publishers• Specific (community) publishers

Surveys to stakeholders

ResearchElsevier mailinglist (35,000 people), ESF, MCFA, Eurodoc, ALLEA, YEAR, Digital Humanities Observatory, etc.

Data management (preservation)LIBER, DPE, DPC, NCDD, DCC, D-lib Magazine, PADI, JISC mailing lists, CASPAR, Planets, etc.

Funding/policyESF, Alliance for Permanent Access, national funding agencies

PublishingInternational Association of STM publishers, Directory of Open Access Journals (DOAJ)

Surveys to stakeholders

Research

1397 responses

Data management (preservation)

273 responses

Funding/policy

< responses

Publishing

186 responses

Threats to preservation

1. Users may be unable to understand or use the data e.g. the semantics, format or algorithms involved.

2. Lack of sustainable hardware, software or support of computer environment may make the information inaccessible.

3. Evidence may be lost because the origin and authenticity of the data may be uncertain.

4. Access and use restrictions (e.g. Digital Rights Management) may not be respected in the future.

5. Loss of ability to identify the location of data.6. The current custodian of the data, whether an organisation

or project, may cease to exist at some point in the future.7. The ones we trust to look after the digital holdings may let

us down.

Threats to preservation (R)

The ones we trust to look after the digital holdings may let us down

The current custodian of the data may cease to exist

Loss of ability to identify the location of data

Access and use restrictions may not be respected in the future

Evidence may be lost

Lack of sustainable hardware/software

Users may be unable to understand or use the data

Threat Requirement for solution

Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved

Ability to create and maintain adequate Representation Information

Non-maintainability of essential hardware, software or support environment may make the information inaccessible

Ability to share information about the availability of hardware and software and their replacements/substitutes

The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity

Ability to bring together evidence from diverse sources about the Authenticity of a digital object

Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future

Ability to deal with Digital Rights correctly in a changing and evolving environment

Loss of ability to identify the location of data An ID resolver which is really persistent

The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future

Brokering of organisations to hold data and the ability to package together the information needed to transfer information between organisations ready for long term preservation

The ones we trust to look after the digital holdings may let us down

Certification process so that one can have confidence about whom to trust to preserve data holdings over the long term

FUTURE

• Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved

• Non-maintainability of essential hardware, software or support environment may make the information inaccessible

• The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity

• Access and use restrictions may not be respected in the future• Loss of ability to identify the location of data• The current custodian of the data, whether an organisation or

project, may cease to exist at some point in the future• The ones we trust to look after the digital holdings may let us

down

Roadmap

PARSE.Insight produced draft Preservation Infrastructure Roadmap

Now a SCIENCE DATA INFRASTRUCTURE ROADMAP after consultation with EU

Infrastructures for preservation

Social / Legal / Financial / Organisational

Agreements / Trust / Standards Costs/ Benefits/ RewardsTechnical components

Lessons from other Infrastructures

Need to “grow”, “encourage”, “foster” rather than “build”

include organisational, financial, legal & marketing

Provide services rather than specific technologies

Tackle “choke points” Various phases of development

Encouraging Organisational and Social change

Policies: mandates for depositing research data and funding agencies requirements:

Robust and reliable deposit places, where researchers can be sure their data will not get lost, be corrupted or misused with correct right access mechanisms.

Elements that increase comfort levels so that new users will know how to use and interpret the available data. .

Communication and awareness around these issues. Have publication of data as valued and as

referencable as is a publication of a paper in a journal.

Repository Audit and Certification

Standard for certification in OAIS Roadmap Initial work produced TRAC Now an official CCSDS Working Group Open virtual meetings, notes and documents:

http://www.digitalrepositoryauditandcertification.org Draft standard submitted to CCSDS/ISO to

form the basis of an international audit and certification process

36

CASPAR Consortium

http://www.casparpreserves.eu

EU FP6 Integrated Project

Total spend approx. 16MEuro (8.8 MEuro from EU)

Started April 2006, for 42 months

http://developers.casparpreserves.eu:8080

Preservation Data Flows and Strategies

More strategies than just “emulate or transform”

Creating an OAIS Archival Information Package

Modules and Dependencies:defining the Designated Community

README.txt

TEXT EDITORENGLISH

LANGUAGE

WINDOWS XP

FITS FILE

FITS STANDARD

PDF STANDARD

FITSJAVA s/w

JAVA VMPDF s/w

FITS DICTIONARY

DICTIONARYSPECIFICATION

UNICODESPECIFICATION

XMLSPECIFICATION

MULTIMEDIA PERFORMANCE DATA

C3D DirectX MAX/MSP

3D motiondata files

3D scenedata files

motion to musicmapping strategy

Modules and Dependencies: Examples (Semantic Web data)

ns4

ns2

ns1

ns3

RDF/S

modules and dependencies

Scenario: Intelligibility-aware Packaging

FITS

FITS STANDARD

PDF STANDARD

FITS DICTIONARY

DICTIONARYSPECIFICATION

UNICODESPECIFICATION

XMLSPECIFICATION

o2o1

P1

P2

C3D DirectX MAX/MSP

o3

P3

ZIP

• Gap(o2,P1) = • Gap(o2,P2) =

– {FITS, FITS_STANDARD, FITS_DICTIONARY, DICTIONARY_SPECIFICATION}

• Gap(o2,P3) = – {FITS, FITS_STANDARD, FITS_DICTIONARY,

DICTIONARY_SPECIFICATION, PDF_STANDARD, XML_SPECIFICATION, UNICODE_SPECIFICATION}

• Gap(o3,P3) = – {ZIP}

• Gap(o3, ) = – {ZIP, C3D, DirectX, MAX/MSP}

E39. ActorKia Ng Activity of

Improvisation on the Violin

Expression of theImprovisation on the Violin

CR20. PerformSingleton

has_type

CR51. Attribution_RightSingleton

generates

LF1. Written_NormArt. X of Law Y

is_documented_in

Kia’s right to claim authorship

became_owner_of

is_on

created

carried_out

Work’s Provenance

Legislation

Rights Ontology CIDOC-CRM

E72. Legal Object

FRBRoo

F22. Self_contained_Expression

E7. Activity

F28. Expression_Creation

E30. Right

CR.Ownership Right

Derived Property

Rights

E7. ActivityKia claiming authorship

CR. Activity_TypeTo claim authorship

allows

has_type

performed_by

has_right_type

100% recall, <100% precision

100% precision

Example: Identification of an Attribution Right

Thanks to MetaWare

Provenance: Performing Arts

Thanks to ULeeds and CNRS

Authenticity

Neeri 20091-2 Oct 2009, Helsinki

Neeri 20091-2 Oct 2009, Helsinki

Threat Requirements for solutions

Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved

Ability to create and maintain adequate Representation Information

Non-maintainability of essential hardware, software or support environment may make the information inaccessible

Ability to share information about the availability of hardware and software and their replacements/substitutes

The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity

Ability to bring together evidence from diverse sources about the Authenticity of a digital object

Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future

Ability to deal with Digital Rights correctly in a changing and evolving environment

Loss of ability to identify the location of data An ID resolver which is really persistent

The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future

Brokering of organisations to hold data and the ability to package together the information needed to transfer information between organisations ready for long term preservation

The ones we trust to look after the digital holdings may let us down

Certification process so that one can have confidence about whom to trust to preserve data holdings over the long term

Neeri 20091-2 Oct 2009, Helsinki

Threat CASPAR ComponentUsers may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved

RepInfo toolkit, Packager and Registry – to create and store Representation Information.

In addition the Orchestration Manager and Knowledge Gap Manager help to ensure that the RepInfo is adequate.

Non-maintainability of essential hardware, software or support environment may make the information inaccessible

Registry and Orchestration Manager to exchange information about the obsolescence of hardware and software, amongst other changes.

The Representation Information will include such things as software source code and emulators.

The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity

Authenticity toolkit will allow one to capture evidence from many sources which may be used to judge Authenticity.

Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future

Digital Rights and Access Rights tools allow one to virtualise and preserve the DRM and Access Rights information which exist at the time the Content Information is submitted for preservation.

Loss of ability to identify the location of data Persistent Identifier system: such a system will allow objects to be located over time.

The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future

Orchestration Manager will, amongst other things, allow the exchange of information about datasets which need to be passed from one curator to another.

The ones we trust to look after the digital holdings may let us down

The Audit and Certification standard to which CASPAR has contributed will allow a certification process to be set up.

Conclusions Preservation

Is a complex process involves more than just bits and formats metadata is too vague a term Transparency is vital

What is being preserved For whom For how long

OAIS is a good basis for preservation Recursion is an important concept in preservation Preservation threats must be countered by specific

tools and shared infrastructure componentsNeeri 2009

1-2 Oct 2009, Helsinki

Additional links CASPAR:

www.casparpreserves.eu PARSE.Insight:

www.parse-insight.eu Alliance for Permanent Access:

www.alliancepermanentaccess.eu Digital Curation Centre:

www.dcc.ac.uk Audit and certification:

wiki.digitalrepositoryauditandcertification.org OAIS:

http://public.ccsds.org/publications/archive/650x0b1.pdf http://public.ccsds.org/sites/cwe/rids/Lists/CCSDS%206500P11/Overview.aspx

END

Summary What is digital preservation? Transparency What is needed for digital preservation?

• Many strategies– Need to be clear about the scope of each

• Document/rendered object?

• Scientific data – processed/combined to produce new results?

• Other?

– How are all of the threats being addressed?

• What exactly is being preserved?

• For whom is it being preserved? – Designated Community must be specified

– Testability through understandability/usability

• How will it be handed on to future custodians

Umbrella framework Need to integrate in some sense many different

Systems Disciplines Funding Requirements

Projects producing preservation artefacts Representation Information Significant Properties Provenance etc

About researchers

EU 44%, USA 33%, Other 23%

Per category

Data spectrum (R)

Cross-disciplinary use of research data

Sharing of data (R)Did you ever need digital research data gathered by other researchers that was not available?

Sharing of data (R)Do you presently make use of research data gathered by other researchers?

Sharing of data (R)Would you like to make use of research data gathered by other researchers?

Within discipline Outside discipline

Sharing of data (R)How open is your data?

Sharing of data (R)Which constrains do you see in making data open?

Sharing of data (R)How do you locate and access digital research data?

Linking of data (R)As researcher, do you think it is useful to link underlying research to formal literature?

Linking of data (P)Do you link references in your journals to underlying digital research data?

Linking of data (P)Do you as publisher charge separate fees when users want to access data associated with publications?

Linking of data (P)Can authors submit their underlying digital research data with their publication to the publisher?

About fundingResearchers say :

Data managers say :

Publishers say :

Government (national funding)

Government (national funding)

Government (national funding)

Who should pay for data preservation?

Who should pay for preservation of publications?

Researchers say :

Data managers say :

Publishers say :

Government (national funding)

Government (national funding)

Government (national funding)

Who should pay? (P)For preservation of other research output

top related