manfred thaller universität zu köln [email protected] preserving for 2016, 2106, 3006...
TRANSCRIPT
Manfred ThallerUniversität zu Köln
Preserving for 2016, 2106, 3006Or
Is there a life for an object outside a digital library?
DELOS International Summer School 2006San Miniato
4-9 June 2006
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 3
A persistent object
•Authenticity
•Integrity
•Metadata
•Context
•Easily usable
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 4
A persistent object
•Authenticity
•Integrity
•Metadata
•Context
•Easily usable
•Discussable
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 5
A persistent object
•Authenticity
•Integrity
•Metadata
•Context
•Easily usable
•Discussable
•No
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 6
A persistent object
•Authenticity
•Integrity
•Metadata
•Context
•Easily usable
•Discussable
•No
•No
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 7
A persistent object
•Authenticity
•Integrity
•Metadata
•Context
•Easily usable
•Discussable
•No
•No
•No
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 8
A persistent object
•Authenticity
•Integrity
•Metadata
•Context
•Easily usable
•Discussable
•No
•No
•No
•1799 - 1821
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 9
Persistent until: 2016
•No major breakdown of civil society.
•'Library' system continues to function without serious interruption.
•No fundamental change in underlying technology.
•No major 'holes' in the relevant WWW.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 10
Persistent until: 2016
Assumption therefore:Persistency is a function of the overall system.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 11
Persistent until: 2106
•No major breakdown of civil society.
•'Library' system continues to function without serious interruption.
•No fundamental change in underlying technology.
•No major 'holes' in the relevant WWW.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 12
Persistent until: 2106
•No major breakdown of civil society.
•'Library' changes functional assumptions without serious interruption of service.
•No fundamental change in underlying technology.
•No major 'holes' in the relevant WWW.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 13
Persistent until: 2106
•No major breakdown of civil society.
•'Library' changes functional assumptions without serious interruption of service.
•Fundamental changes in underlying technology.
•No major 'holes' in the relevant WWW.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 14
Persistent until: 2106
•No major breakdown of civil society.
•'Library' changes functional assumptions without serious interruption of service.
•Fundamental changes in underlying technology.
•Major 'holes' in the relevant WWW.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 15
Persistent until: 2106
Assumption: Persistent storage media "around the corner". (Holographic storage, storage crystals.)
Question: Can a digital object be revived in 2106, if library does not care for it after 2016?
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 16
Persistent until: 2106
Why not?
•Bit stream deterioration.•Authenticity not guaranteed.•Meta data get lost.•Context gets lost.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 17
Bit stream deterioration
An Image filebefore ….
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 18
Bit stream deterioration
... and afterone byte ischanged.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 19
Bit stream deterioration
... and afterone byte ischanged.
Undetectableby software.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 20
Bit stream deterioration
Sketch of a technical solution.
Underlying assumption:bit stream deterioration becomes less of aproblem, if "files" are designed for persistency.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 21
Bit stream deterioration
Proposal 1: Measure File Robustness
Proposed metric: A file is m / n robust, if you can change m arbitrarily selected bytes ofthe stored data without affecting more than n bytes of the payload bytesof the file. Background terminology: Any file format can be described as consisting of a processing dictionary(roughly: technical metadata) and a payload, which represents theinformation presented to the user. Proposed implementation: Apply at least one thousand / one million timesrandom change to n randomly selected byte and get mean number ofaffected bytes..
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 22
Bit stream deterioration
Proposal 2: Measure Error awareness
Proposed metric: A file / file reader is n error aware, if you can change at the mostn arbitrarily selected bytes of the stored data without it becomingdetected during every attempt at reading.
Background terminology: Any file format which has predicted lenghts for each reading operationplus some additional info on the result of the reading operation has thisproperty to some degree.
Proposed implementation: Experiment to understand the situation better.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 23
Bit stream deterioration
Proposal 3: Improve relevant file qualities - Hardening
Proposed metric: A file is n hardened, if it contains n synchronized redundant copies of theprocessing dictionary.
Background terminology: Two chunks of data are synchronized, if a processing environmentguarantees, that they are always changed in parallel.
Proposed implementation: Create TIFF / PNG writers / readers, whichsignal by additional tag / chunk the further copies of the processingdictionary.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 24
Bit stream deteriorationProposal 4: Improve processing capabilities – Self repairing
Definition: A file is self repairing, if a reader is able to recover, after discoveringthat internal data are missing.
Example: PDF files tolerate modest distortions, as they are able to identifythe beginning of major sections within the file.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 25
Authenticity not guaranteed.
Problem:While paper has physical properties, which can be evaluated,digital documents do not.
Solution:Add digital signatures, recognisable by dedicated software,registered with suitable authority.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 26
Authenticity not guaranteed.
Problem:While paper has physical properties, which can be evaluated,digital documents do not.
Solution:Add digital signatures, recognisable by dedicated software,registered with suitable authority.
Violates assumption of change of framework.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 27
Authenticity not guaranteed.
Problem:While paper has physical properties, which can be evaluated,digital documents do not.
Proposal:Insert fingerprint of institution (potentially individual PC) Implicitly into every file generated.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 28
Authenticity not guaranteed.
Binary file sealing:
1) Modify payload to provide parity within small formin byte stream.
2) Select arbitrary start address within payload.
3) Build path of parity forms.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 29
Authenticity not guaranteed.
Problem:While paper has physical properties, which can be evaluated,digital documents do not.
Proposal:Insert fingerprint of institution (potentially individual PC) Implicitly into every file generated.
Problem: Incompatible with logic of storing text as XML.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 30
Meta data get lost.
"Metadata" and data are stored separately in currentInformation system designs.
Take an image data base as example.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 33
Meta data get lost.
"thumbs.db, but more so"
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 34
Meta data get lost.
MA thesis Jan Schnasse: http://lehre.hki.uni-koeln.de/~schnasse/ediod/; [email protected]
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 35
Meta data get lost.
MA thesis Jan Schnasse: http://lehre.hki.uni-koeln.de/~schnasse/ediod/; [email protected]
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 36
Meta data get lost.
MA thesis Jan Schnasse: http://lehre.hki.uni-koeln.de/~schnasse/ediod/; [email protected]
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 37
Meta data get lost.
MA thesis Jan Schnasse: http://lehre.hki.uni-koeln.de/~schnasse/ediod/; [email protected]
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 38
Meta data get lost.
MA thesis Jan Schnasse: http://lehre.hki.uni-koeln.de/~schnasse/ediod/; [email protected]
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 39
Context gets lost ...
More on properties of preservation aware files: Localized
Definition: A file is localized, if a reader can process it without accessing aremote server.
Counterexample: Virtually all XML-based standards of the DL community assume,that a program processing the file has access to a fully operational web,preferably in the structure of 2005, and / or the functioning ofauthorities like a URN resolution mechanism.
Solution: Snapshot of refered components.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 40
Context gets lost ...
More on properties of preservation aware files: Autonomous
Definition: A file is autonomous, if a reader can process it without accessinganother file.
Counterexample: A PDF is usually not autonomous in the strict sense, as it assumes thatfont information is available. Could be discussed at length, as it comesdown to the question, which resources are defined as part of theprocessing environment.
Solution: "Discuss at length".
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 41
Context gets lost ...
More on properties of preservation aware files: Selfdocumenting
Definition: A file is selfdocumenting, if it contains as part of the processingdictionary a complete set of metadata.
Solution: Register appropriate tags / chunks with the TIFF / PNG authorities.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 42
Context gets lost ...
More on properties of preservation aware files: Preservation encapsulated
Definition: A file is preservation encapsulated, if it starts with a preservationheader, acting as processing dictionary for a subset of the capabilitiesdefined from "hardening" to "self documenting", and continues with astandard file of a recognized standard format.
Solution: Well, if we can register URNs, why should we not be able to maintain anencapsulation format?
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 43
Context gets lost ...
More on properties of preservation aware files: Preservation encapsulated
Definition: A file is preservation encapsulated, if it starts with a preservationheader, acting as processing dictionary for a subset of the capabilitiesdefined from "hardening" to "self documenting", and continues with astandard file of a recognized standard format.
Solution: Well, if we can register URNs, why should we not be able to maintain anencapsulation format?
Registry violates autonomy, however.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 44
Group Exercise
Background:(1) Assume that sometime betweennow and 2106 your institution gets into a seriousfinancial crisis. ("Serious" are e.g. budgetcuts to, not by, 5 % of previous year.)(2) Your digital data have to survivefor 30 years on their own under extremely badconditions, where random deletions area certainty.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 45
Group Exercise
Planning for that now:(1)What would be the most sensible autonomousunit to divide your holdings into, making eachunit able to survive on its own.(2) Units which are to small are disastrouslyredundant; units which are to big, are disastrouslyvulnerable.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 46
Group Exercise
Notice to people with IT training:The usual warnings against redundancy indata base design are related to "living" databases, avoiding "anomalies", which can notoccur in longterm storage.(If you do not understand the above, youmay safely ignore it.)
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 47
Persistent until: 2016
•No major breakdown of civil society.
•'Library' system continues to function without serious interruption.
•No fundamental change in underlying technology.
•No major 'holes' in the relevant WWW.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 48
Persistent until: 3006
•Major breakdown of civil society.
•'Library' system continues to function without serious interruption.
•No fundamental change in underlying technology.
•No major 'holes' in the relevant WWW.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 49
Persistent until: 3006
•Major breakdown of civil society.
•'Library' system seriously interrupted.
•No fundamental change in underlying technology.
•No major 'holes' in the relevant WWW.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 50
Persistent until: 3006
•Major breakdown of civil society.
•'Library' system seriously interrupted.
•'n' fundamental changes in underlying technology.
•No major 'holes' in the relevant WWW.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 51
Persistent until: 3006
•Major breakdown of civil society.
•'Library' system seriously interrupted.
•'n' fundamental changes in underlying technology.
•WWW completely replaced by another type of connectivity
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 52
Persistent until: 3006
Any chance at all?
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 53
Persistent until: 3006
Any chance at all?
No real answer, but some stuff for thinking.
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 54
Persistent until: 3006
Is this Information ?
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 55
Persistent until: 3006
Is this Information ?
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 56
Persistent until: 3006
Is this Information ?
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 57
Persistent until: 3006
Is this Information ?
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 58
Persistent until: 3006
1. Recognizing information
2. Technological assumptions
3. Cultural assumptions
4. Processing assumptions
DELOS Int SS2006, San Miniato, © Manfred Thaller, Universität zu Köln 59
Persistent until: 3006
1. Announcement headers introducing hierarchically into the assumptions?
2. Preservation encapsulation with different horizons?