analysing the impact of file formats on data integrity volker heydegger university of cologne...

31
Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Upload: ralph-baldwin

Post on 02-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Analysing the Impact of File Formats on Data Integrity

Volker Heydegger

University of Cologne

Archiving 2008

Bern, 23rd – 27th June 2008

Page 2: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Overview

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

• Introduction• File format data and information loss

What happens if data is corrupted in files? Categories of file format data

• Measuring Information Loss Robustness Indicators Study results for different file formats

Page 3: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Overview

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

• Introduction• File format data and information loss

What happens if data is corrupted in files? Categories of file format data

• Measuring Information Loss Robustness Indicators Study results for different file formats

Page 4: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Background

• EU-founded project “Planets”

characterisation of file format content

www.planets-project.eu

University of Cologne, Computer Science for the Humanities

(Historisch-Kulturwissenschaftliche Informationsverarbeitung (HKI))

Planets partner

www.hki.uni-koeln.de

Page 5: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Context

• Long-term preservation of digital informationWhich file format to choose?

Criteria, e.g.:

Open standard

Spread of usage

Hard-/Software-Dependencies

Authenticity

Robustness

Page 6: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Robustness::= Error resilience of file formats against bit-stream corruption

Page 7: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Issues/ Research topics

• Is there any correlation between file format and data integrity?

• If so, are there any differences among file formats concerning the degree of robustness?

• Which file format based factors are responsible for varying degrees of robustness?

• How can we improve the robustness of file formats?

Page 8: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Benefits

• Digital preservation: Decision support for choosing file format for long-term preservation

• Contribution to file format research

• Improvement of existing file formats

• Design of future file formats

Page 9: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Overview

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

• Introduction• File format data and information loss

What happens if data is corrupted in files? Categories of file format data

• Measuring Information Loss Robustness Indicators Study results for different file formats

Page 10: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

File Format Data and Information loss

What is “File Format” in our context?• Set of rules, constituting the logical organisation of

data

• Set of rules, indicating how to interpret data

• Set of rules file format specification

• File Format Data::= Binary data, formatted according to the rules of a file format

Page 11: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

What happens if data is corrupted in files?

GTestimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel

Page 12: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

G

First 224 Byte of testfile

FF

Page 13: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

G

Plain information loss: 1 byte data = = 1 Pixel

Page 14: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

What happens if data is corrupted in files?

GTestimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel

Page 15: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

G

Part of the TIF Image File Directory, Tag: Photometric

Interpretation

00

Page 16: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

G

Conditional information loss: 1 bit changes == 100% information changed

Page 17: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Categories of File Format Data• Technical data (data for processing):

Image width: 277

Image length: 339

Compression: uncompressed

Page 18: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

• “Payload” data (basic data of usage):

Pixel data, starting from byte #0x008

Page 19: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Overview

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

• Introduction• File format data and information loss

What happens if data is corrupted in files? Categories of file format data

• Measuring information loss Robustness Indicators Study results for different file formats

Page 20: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Robustness Indicators

(1) RB = Δ (b0 ,b1) / m

where

i. b0 is the basic data of usage before being corrupted,

ii. b1 is the basic data of usage after being corrupted,

iii. m is the number of corruption procedures.

RB indicates an average information loss.

Page 21: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

ExampleA file X may have 2000 byte of payload data. Presuming the number of byte changed after the file has been corrupted 3 times is per each corruption procedure

1. Δ (b0 ,b1) = 200 byte

2. Δ (b0 ,b1) = 150 byte

3. Δ (b0 ,b1) = 250 byte

The average information loss for file X based on 3 corruption procedures is then

RB= 600 / 3 = 200

Page 22: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

RB related to the total number of payload data:

(2) RBt= RB / n where

n is the total number of basic data of usage (payload data).

(3) RBt= RB / n * 100

= RBt expressed in percentage

Interpretation: RBt = 0 % : max. Robustness

(min. Information loss)

Page 23: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Example (continued)

(2) RBt= 200 / 2000 = 0.1

(3) RBt= 200 / 2000 * 100 = 10 (%)

Page 24: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Study on Robustness for various file formats: Example Results

TIF

- uncompressed

- LZW

- JPEG (2 different compression levels)

- ZIP

PNG (filtered, unfiltered)

JPEG2000 (lossless, lossy)

BMP (uncompressed)

G

Page 25: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Study on Robustness for various file formats: Example Results

Method- simulation of file corruption: every file is corrupted up to 3000 times (3000 corruption procedures)

- applying 3-5 different corruption ratios: less than 0.01% 0.01% 0.1% 1.0% more than 1.0%G

Page 26: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Method

- compressed payload data is decompressed

- original payload data and corrupted one is compared

- computing Robustness Indicators Values

G

Page 27: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

G

Page 28: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Example: Jp2 formatted image, corruption of 1 Byte, “bad case”

Page 29: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Example: Jp2 formatted image, corruption of 1 Byte, “good case”

Page 30: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Example: Jp2 formatted image, corruption of 1 Byte, “good case” with visualized differences in pixel data

Page 31: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008

Thank you very much!

Volker Heydegger

University of Cologne

Archiving 2008

Bern, 23rd – 27th June 2008