analysing the impact of file formats on data integrity volker heydegger university of cologne...
TRANSCRIPT
Analysing the Impact of File Formats on Data Integrity
Volker Heydegger
University of Cologne
Archiving 2008
Bern, 23rd – 27th June 2008
Overview
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
• Introduction• File format data and information loss
What happens if data is corrupted in files? Categories of file format data
• Measuring Information Loss Robustness Indicators Study results for different file formats
Overview
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
• Introduction• File format data and information loss
What happens if data is corrupted in files? Categories of file format data
• Measuring Information Loss Robustness Indicators Study results for different file formats
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Background
• EU-founded project “Planets”
characterisation of file format content
www.planets-project.eu
University of Cologne, Computer Science for the Humanities
(Historisch-Kulturwissenschaftliche Informationsverarbeitung (HKI))
Planets partner
www.hki.uni-koeln.de
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Context
• Long-term preservation of digital informationWhich file format to choose?
Criteria, e.g.:
Open standard
Spread of usage
Hard-/Software-Dependencies
Authenticity
…
Robustness
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Robustness::= Error resilience of file formats against bit-stream corruption
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Issues/ Research topics
• Is there any correlation between file format and data integrity?
• If so, are there any differences among file formats concerning the degree of robustness?
• Which file format based factors are responsible for varying degrees of robustness?
• How can we improve the robustness of file formats?
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Benefits
• Digital preservation: Decision support for choosing file format for long-term preservation
• Contribution to file format research
• Improvement of existing file formats
• Design of future file formats
Overview
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
• Introduction• File format data and information loss
What happens if data is corrupted in files? Categories of file format data
• Measuring Information Loss Robustness Indicators Study results for different file formats
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
File Format Data and Information loss
What is “File Format” in our context?• Set of rules, constituting the logical organisation of
data
• Set of rules, indicating how to interpret data
• Set of rules file format specification
• File Format Data::= Binary data, formatted according to the rules of a file format
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
What happens if data is corrupted in files?
GTestimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
G
First 224 Byte of testfile
FF
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
G
Plain information loss: 1 byte data = = 1 Pixel
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
What happens if data is corrupted in files?
GTestimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
G
Part of the TIF Image File Directory, Tag: Photometric
Interpretation
00
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
G
Conditional information loss: 1 bit changes == 100% information changed
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Categories of File Format Data• Technical data (data for processing):
Image width: 277
Image length: 339
Compression: uncompressed
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
• “Payload” data (basic data of usage):
Pixel data, starting from byte #0x008
Overview
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
• Introduction• File format data and information loss
What happens if data is corrupted in files? Categories of file format data
• Measuring information loss Robustness Indicators Study results for different file formats
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Robustness Indicators
(1) RB = Δ (b0 ,b1) / m
where
i. b0 is the basic data of usage before being corrupted,
ii. b1 is the basic data of usage after being corrupted,
iii. m is the number of corruption procedures.
RB indicates an average information loss.
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
ExampleA file X may have 2000 byte of payload data. Presuming the number of byte changed after the file has been corrupted 3 times is per each corruption procedure
1. Δ (b0 ,b1) = 200 byte
2. Δ (b0 ,b1) = 150 byte
3. Δ (b0 ,b1) = 250 byte
The average information loss for file X based on 3 corruption procedures is then
RB= 600 / 3 = 200
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
RB related to the total number of payload data:
(2) RBt= RB / n where
n is the total number of basic data of usage (payload data).
(3) RBt= RB / n * 100
= RBt expressed in percentage
Interpretation: RBt = 0 % : max. Robustness
(min. Information loss)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example (continued)
(2) RBt= 200 / 2000 = 0.1
(3) RBt= 200 / 2000 * 100 = 10 (%)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Study on Robustness for various file formats: Example Results
TIF
- uncompressed
- LZW
- JPEG (2 different compression levels)
- ZIP
PNG (filtered, unfiltered)
JPEG2000 (lossless, lossy)
BMP (uncompressed)
G
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Study on Robustness for various file formats: Example Results
Method- simulation of file corruption: every file is corrupted up to 3000 times (3000 corruption procedures)
- applying 3-5 different corruption ratios: less than 0.01% 0.01% 0.1% 1.0% more than 1.0%G
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Method
- compressed payload data is decompressed
- original payload data and corrupted one is compared
- computing Robustness Indicators Values
G
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
G
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example: Jp2 formatted image, corruption of 1 Byte, “bad case”
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example: Jp2 formatted image, corruption of 1 Byte, “good case”
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example: Jp2 formatted image, corruption of 1 Byte, “good case” with visualized differences in pixel data
Thank you very much!
Volker Heydegger
University of Cologne
Archiving 2008
Bern, 23rd – 27th June 2008