manfred thaller universität zu köln [email protected] preserving for 2016, 2106, 3006...

60
Manfred Thaller Universität zu Köln [email protected] Preserving for 2016, 2106, 3006 Or Is there a life for an object outside a digital library? DELOS International Summer School 2006 San Miniato 4-9 June 2006

Upload: randall-tuff

Post on 13-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Manfred ThallerUniversität zu Köln

[email protected]

Preserving for 2016, 2106, 3006Or

Is there a life for an object outside a digital library?

DELOS International Summer School 2006San Miniato

4-9 June 2006

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 2

A persistent object

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 3

A persistent object

•Authenticity

•Integrity

•Metadata

•Context

•Easily usable

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 4

A persistent object

•Authenticity

•Integrity

•Metadata

•Context

•Easily usable

•Discussable

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 5

A persistent object

•Authenticity

•Integrity

•Metadata

•Context

•Easily usable

•Discussable

•No

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 6

A persistent object

•Authenticity

•Integrity

•Metadata

•Context

•Easily usable

•Discussable

•No

•No

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 7

A persistent object

•Authenticity

•Integrity

•Metadata

•Context

•Easily usable

•Discussable

•No

•No

•No

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 8

A persistent object

•Authenticity

•Integrity

•Metadata

•Context

•Easily usable

•Discussable

•No

•No

•No

•1799 - 1821

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 9

Persistent until: 2016

•No major breakdown of civil society.

•'Library' system continues to function without serious interruption.

•No fundamental change in underlying technology.

•No major 'holes' in the relevant WWW.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 10

Persistent until: 2016

Assumption therefore:Persistency is a function of the overall system.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 11

Persistent until: 2106

•No major breakdown of civil society.

•'Library' system continues to function without serious interruption.

•No fundamental change in underlying technology.

•No major 'holes' in the relevant WWW.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 12

Persistent until: 2106

•No major breakdown of civil society.

•'Library' changes functional assumptions without serious interruption of service.

•No fundamental change in underlying technology.

•No major 'holes' in the relevant WWW.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 13

Persistent until: 2106

•No major breakdown of civil society.

•'Library' changes functional assumptions without serious interruption of service.

•Fundamental changes in underlying technology.

•No major 'holes' in the relevant WWW.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 14

Persistent until: 2106

•No major breakdown of civil society.

•'Library' changes functional assumptions without serious interruption of service.

•Fundamental changes in underlying technology.

•Major 'holes' in the relevant WWW.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 15

Persistent until: 2106

Assumption: Persistent storage media "around the corner". (Holographic storage, storage crystals.)

Question: Can a digital object be revived in 2106, if library does not care for it after 2016?

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 16

Persistent until: 2106

Why not?

•Bit stream deterioration.•Authenticity not guaranteed.•Meta data get lost.•Context gets lost.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 17

Bit stream deterioration

An Image filebefore ….

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 18

Bit stream deterioration

... and afterone byte ischanged.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 19

Bit stream deterioration

... and afterone byte ischanged.

Undetectableby software.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 20

Bit stream deterioration

Sketch of a technical solution.

Underlying assumption:bit stream deterioration becomes less of aproblem, if "files" are designed for persistency.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 21

Bit stream deterioration

Proposal 1: Measure File Robustness

Proposed metric: A file is m / n robust, if you can change m arbitrarily selected bytes ofthe stored data without affecting more than n bytes of the payload bytesof the file. Background terminology: Any file format can be described as consisting of a processing dictionary(roughly: technical metadata) and a payload, which represents theinformation presented to the user. Proposed implementation: Apply at least one thousand / one million timesrandom change to n randomly selected byte and get mean number ofaffected bytes..

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 22

Bit stream deterioration

Proposal 2: Measure Error awareness

Proposed metric: A file / file reader is n error aware, if you can change at the mostn arbitrarily selected bytes of the stored data without it becomingdetected during every attempt at reading.

Background terminology: Any file format which has predicted lenghts for each reading operationplus some additional info on the result of the reading operation has thisproperty to some degree.

Proposed implementation: Experiment to understand the situation better.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 23

Bit stream deterioration

Proposal 3: Improve relevant file qualities - Hardening

Proposed metric: A file is n hardened, if it contains n synchronized redundant copies of theprocessing dictionary.

Background terminology: Two chunks of data are synchronized, if a processing environmentguarantees, that they are always changed in parallel.

Proposed implementation: Create TIFF / PNG writers / readers, whichsignal by additional tag / chunk the further copies of the processingdictionary.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 24

Bit stream deteriorationProposal 4: Improve processing capabilities – Self repairing

Definition: A file is self repairing, if a reader is able to recover, after discoveringthat internal data are missing.

Example: PDF files tolerate modest distortions, as they are able to identifythe beginning of major sections within the file.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 25

Authenticity not guaranteed.

Problem:While paper has physical properties, which can be evaluated,digital documents do not.

Solution:Add digital signatures, recognisable by dedicated software,registered with suitable authority.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 26

Authenticity not guaranteed.

Problem:While paper has physical properties, which can be evaluated,digital documents do not.

Solution:Add digital signatures, recognisable by dedicated software,registered with suitable authority.

Violates assumption of change of framework.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 27

Authenticity not guaranteed.

Problem:While paper has physical properties, which can be evaluated,digital documents do not.

Proposal:Insert fingerprint of institution (potentially individual PC) Implicitly into every file generated.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 28

Authenticity not guaranteed.

Binary file sealing:

1) Modify payload to provide parity within small formin byte stream.

2) Select arbitrary start address within payload.

3) Build path of parity forms.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 29

Authenticity not guaranteed.

Problem:While paper has physical properties, which can be evaluated,digital documents do not.

Proposal:Insert fingerprint of institution (potentially individual PC) Implicitly into every file generated.

Problem: Incompatible with logic of storing text as XML.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 30

Meta data get lost.

"Metadata" and data are stored separately in currentInformation system designs.

Take an image data base as example.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 31

Meta data get lost.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 32

Meta data get lost.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 33

Meta data get lost.

"thumbs.db, but more so"

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 34

Meta data get lost.

MA thesis Jan Schnasse: http://lehre.hki.uni-koeln.de/~schnasse/ediod/; [email protected]

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 35

Meta data get lost.

MA thesis Jan Schnasse: http://lehre.hki.uni-koeln.de/~schnasse/ediod/; [email protected]

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 36

Meta data get lost.

MA thesis Jan Schnasse: http://lehre.hki.uni-koeln.de/~schnasse/ediod/; [email protected]

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 37

Meta data get lost.

MA thesis Jan Schnasse: http://lehre.hki.uni-koeln.de/~schnasse/ediod/; [email protected]

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 38

Meta data get lost.

MA thesis Jan Schnasse: http://lehre.hki.uni-koeln.de/~schnasse/ediod/; [email protected]

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 39

Context gets lost ...

More on properties of preservation aware files: Localized

Definition: A file is localized, if a reader can process it without accessing aremote server.

Counterexample: Virtually all XML-based standards of the DL community assume,that a program processing the file has access to a fully operational web,preferably in the structure of 2005, and / or the functioning ofauthorities like a URN resolution mechanism.

Solution: Snapshot of refered components.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 40

Context gets lost ...

More on properties of preservation aware files: Autonomous

Definition: A file is autonomous, if a reader can process it without accessinganother file.

Counterexample: A PDF is usually not autonomous in the strict sense, as it assumes thatfont information is available. Could be discussed at length, as it comesdown to the question, which resources are defined as part of theprocessing environment.

Solution: "Discuss at length".

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 41

Context gets lost ...

More on properties of preservation aware files: Selfdocumenting

Definition: A file is selfdocumenting, if it contains as part of the processingdictionary a complete set of metadata.

Solution: Register appropriate tags / chunks with the TIFF / PNG authorities.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 42

Context gets lost ...

More on properties of preservation aware files: Preservation encapsulated

Definition: A file is preservation encapsulated, if it starts with a preservationheader, acting as processing dictionary for a subset of the capabilitiesdefined from "hardening" to "self documenting", and continues with astandard file of a recognized standard format.

Solution: Well, if we can register URNs, why should we not be able to maintain anencapsulation format?

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 43

Context gets lost ...

More on properties of preservation aware files: Preservation encapsulated

Definition: A file is preservation encapsulated, if it starts with a preservationheader, acting as processing dictionary for a subset of the capabilitiesdefined from "hardening" to "self documenting", and continues with astandard file of a recognized standard format.

Solution: Well, if we can register URNs, why should we not be able to maintain anencapsulation format?

Registry violates autonomy, however.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 44

Group Exercise

Background:(1) Assume that sometime betweennow and 2106 your institution gets into a seriousfinancial crisis. ("Serious" are e.g. budgetcuts to, not by, 5 % of previous year.)(2) Your digital data have to survivefor 30 years on their own under extremely badconditions, where random deletions area certainty.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 45

Group Exercise

Planning for that now:(1)What would be the most sensible autonomousunit to divide your holdings into, making eachunit able to survive on its own.(2) Units which are to small are disastrouslyredundant; units which are to big, are disastrouslyvulnerable.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 46

Group Exercise

Notice to people with IT training:The usual warnings against redundancy indata base design are related to "living" databases, avoiding "anomalies", which can notoccur in longterm storage.(If you do not understand the above, youmay safely ignore it.)

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 47

Persistent until: 2016

•No major breakdown of civil society.

•'Library' system continues to function without serious interruption.

•No fundamental change in underlying technology.

•No major 'holes' in the relevant WWW.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 48

Persistent until: 3006

•Major breakdown of civil society.

•'Library' system continues to function without serious interruption.

•No fundamental change in underlying technology.

•No major 'holes' in the relevant WWW.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 49

Persistent until: 3006

•Major breakdown of civil society.

•'Library' system seriously interrupted.

•No fundamental change in underlying technology.

•No major 'holes' in the relevant WWW.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 50

Persistent until: 3006

•Major breakdown of civil society.

•'Library' system seriously interrupted.

•'n' fundamental changes in underlying technology.

•No major 'holes' in the relevant WWW.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 51

Persistent until: 3006

•Major breakdown of civil society.

•'Library' system seriously interrupted.

•'n' fundamental changes in underlying technology.

•WWW completely replaced by another type of connectivity

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 52

Persistent until: 3006

Any chance at all?

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 53

Persistent until: 3006

Any chance at all?

No real answer, but some stuff for thinking.

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 54

Persistent until: 3006

Is this Information ?

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 55

Persistent until: 3006

Is this Information ?

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 56

Persistent until: 3006

Is this Information ?

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 57

Persistent until: 3006

Is this Information ?

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 58

Persistent until: 3006

1. Recognizing information

2. Technological assumptions

3. Cultural assumptions

4. Processing assumptions

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 59

Persistent until: 3006

1. Announcement headers introducing hierarchically into the assumptions?

2. Preservation encapsulation with different horizons?

DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 60

Group Exercise

(1) List all the assumptions, the material which isstored in printed form in your institution makesof the background knowledge if the reader.

(2) Design an "announcement header" for suchinformation.