the xcl languages digital preservation – the planets way dresden, april 23 rd 2010 manfred...

77
The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Upload: sarah-mosley

Post on 12-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

The XCL Languages

Digital Preservation – The Planets WayDresden, April 23rd 2010

Manfred Thaller, Universität zu Köln

Page 2: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

1. The vision

Page 3: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Observation

► Photoshop ►

► Photoshop ►

Works only, if you are examining the actual image data …

Page 4: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

png

tiff

Extractor

Comparator

image info 2

image info 1

the same?

Format conversion

Vision stage 1

Page 5: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

png

tiff

Extractor

Comparator

image info 2

image info 1

the same?

Format conversion

Vision stage 2

png rules tiff rules

Page 6: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Obj 1

Obj 2

Extractor

Comparator

object info 2

object info 1

the same?

Format conversion

Vision stage 3

rule set 1 rule set 2

Page 7: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Obj 1

Obj 2

Extractor

Comparator

XCDL 2

XCDL 1

the same?

Format conversion

Vision stage 4

XCEL 1 XCEL 2

Machine readable form of a file format specification: „eXtensible Characterisation Extraction Language“

(XCEL), able to describe any machine readable format in a formal language, processible by a software tool

for extraction of content as XCDL.

Abstract description of file content: „eXtensible Characterisation Definition Language“ (XCDL), able to describe the

content of digital objects (=1 + n more files), processible by a software tool for further

analysis.

Specification of „similiarity“ to be used: „comparator comparison [Language] “ (coco).

Specification of „similiarity“ observed: „comparator results [Language] “ (copra).

Page 8: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

2. Examples I

Page 9: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Image width: 277

Image length: 339

XCL by Example

Page 10: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

XCEL representation<!-- Tag 256: ImageWidth (XCL: imageWidth) --> <item xsi:type="structuringItem" identifier="IFDE_256" optional="true"> <symbol interpretation="uint16" length="2" value="256"/> <item xsi:type="structuringItem" order="choice"> <item xsi:type="structuringItem" order="sequence"> <!– Data type (value ‚3‘ means uint16)--> <symbol interpretation="uint16" length="2" value="3"/> <!– number of values (N)-> <symbol interpretation="uint32" length="4" value="1"/> <!-- the value and name of property --> <symbol interpretation="uint16" length="2" name="imageWidth"/> <!-- wasted space--> <symbol interpretation="uint16" length="2"/> […] </item> </item> </item>

Page 11: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

XCEL representation<!-- Tag 256: ImageWidth (XCL: imageWidth) --> <item xsi:type="structuringItem" identifier="IFDE_256" optional="true"> <symbol interpretation="uint16" length="2" value="256"/> <item xsi:type="structuringItem" order="choice"> <item xsi:type="structuringItem" order="sequence"> <!– Data type (value ‚3‘ means uint16)--> <symbol interpretation="uint16" length="2" value="3"/> <!– number of values (N)-> <symbol interpretation="uint32" length="4" value="1"/> <!-- the value and name of property --> <symbol interpretation="uint16" length="2" name="imageWidth"/> <!-- wasted space--> <symbol interpretation="uint16" length="2"/> […] </item> </item> </item>

Page 12: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

XCEL representation<!-- Tag 256: ImageWidth (XCL: imageWidth) --> <item xsi:type="structuringItem" identifier="IFDE_256" optional="true"> <symbol interpretation="uint16" length="2" value="256"/> <item xsi:type="structuringItem" order="choice"> <item xsi:type="structuringItem" order="sequence"> <!– Data type (value ‚3‘ means uint16)--> <symbol interpretation="uint16" length="2" value="3"/> <!– number of values (N)-> <symbol interpretation="uint32" length="4" value="1"/> <!-- the value and name of property --> <symbol interpretation="uint16" length="2" name="imageWidth"/> <!-- wasted space--> <symbol interpretation="uint16" length="2"/> […] </item> </item> </item>

Page 13: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

…<property id="p5">

<name id="id30" >imageWidth</name> <valueSet id="i_i1_s4" > <labValue> <val>277</val> <type>int</type> </labValue> </valueSet> </property>

...

XCDL representation

XCEL entry:<!-- the value and name of property --> <symbol interpretation="uint16" length="2" name="imageWidth"/>

Page 14: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

…<property id="p5">

<name id="id30" >imageWidth</name> <valueSet id="i_i1_s4" > <labValue> <val>277</val> <type>int</type> </labValue> </valueSet> </property>

...

XCDL representation

XCEL entry:<!– Data type (value ‚3‘ means uint16)--> <symbol interpretation="uint16" length="2" value="3"/>

Page 15: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

XCDL representations can now be compared…

Page 16: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

3. Syntactical aspects of XCL processing

Page 17: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

The XCEL tree

The XCEL tree describes a format.

Page 18: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

The result tree

Parsing a file produces a result tree.

Page 19: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

XCDL: models

All file contained is understood as instances of “higher order data types”:

image text [ sound ] [[ vector graphics ]]

Page 20: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

XCDL: text model

A text (= <object>) is composed ofdata (= <normData>) plusInterpretations / properties of data according to the underlying format specification (= <property>).

Page 21: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

This is a text<refData id="1">54 68 69 73 20 69 73 20 61 20 74 65 78 74</refData>…<property><name>fontsize</name><rawVal><val>48</val><type>unsignedInt8</type></rawVal><dataRef> <!-- property refers to discrete part of reference data--><ref id="1" start="0" end="3"/><ref id="1" start=“10" end="12"/></dataRef></property>

Representing a text in XCDL

Page 22: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

XCDL: recursiveness

XCDL is fully recursive

An arbitrarily complex image can be a property of a textual position.

Aka: Illustrations in a text file

Page 23: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

XCDL: recursiveness

XCDL is fully recursive

An arbitrarily complex text can be a property of a textual position.

Aka: footnotes

Page 24: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

XCDL: recursiveness

XCDL is fully recursive

An arbitrarily complex text can be a property of an image segment.

Aka: embedded image descriptions

Page 25: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

3. Semantic aspects of processing

Page 26: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Are the following two items equal:

VIII 8

How do Humans do it?

Page 27: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

VIII 8

eight eight

How do Humans do it?

Page 28: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

VIII 8

eight eight

otto

otto

How do Humans do it?

Page 29: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

VIII 8

eight eight

otto

otto

acht

acht

How do Humans do it?

Page 30: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

VIII 8

eight eight

otto

otto

acht

acht

8.0

How do Humans do it?

Page 31: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

VIII 8

eight eight

otto

otto

acht

acht

Information model: „an image“ / „a text“

Replicating the approach in a machine:

Page 32: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

VIII 8

Format ontology: „what terms are used in formats to describe image / textual properties“.

Replicating the approach in a machine:

Information model: „an image“ / „a text“

Page 33: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Extraction language: “how to get the terms describing an image / a text out of a file encoded in a specific format”.

Replicating the approach in a machine:

Information model: „an image“ / „a text“

Format ontology: „what terms are used in formats to describe image / textual properties“.

Page 34: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

The Planets XCL Approach – The Ontology

Page 35: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

4. Conceptual aspects of processing

Page 36: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Data which represent stored information do so in two forms:

1. As a set of tokens, which describe atomic items of information.

2. By a set of independent parameters, which describe, in a formalized way, the semantic interpretation of these items of information.

Assumption I

Page 37: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

1. Most algorithms today are based on “data types”, which are reflecting hardware characteristics (char, int, float ...).

2. “Objects”, which are constructed from these data types, are transient concepts, which are meaningful only within a specific implementation / environment.

3. What we would need are considerably higher order objects, which are persistent by themselves and independent of a specific implementation / environment.

Assumption II

Page 38: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

The need formulated as assumption II can be fulfilled using assumption I.

Assumption III

Page 39: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

(1) I = i (D, S, t)

(2) I2 = i (I1, S2, t)

(3) Ix = i (Ix-1, Sx, t)

(4) Sx = s (Ix-1, t)

(5) Ix = i (Ix-α, Sx-β, t)

(6) Ix = i (Ix-α, s(Ix-β, t), t)

Generalisation of Langefors “Infological Equation”

I = Informationi(…) = interpretative processD = dataS = previous knowledget = time

Page 40: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

5. Inclusion of rendering results

Page 41: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Observation: A file in Word 2003

Page 42: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Observation: A file in Word 2007

Page 43: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Observation: A file in Open Office

Page 44: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Observation: A file in Acrobat

Page 45: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Proposal to measure layout

Cut out page from rendering surface.

Scale to common dimensions: 371 +/- 1 x 521 +/- 1

Measure1. The leftmost and lowest completely black pixel in the letter “A” starting

the first line of the main text.2. The leftmost and highest completely black pixel in the letter “E” starting

the first line of the text in the footnote.3. The geometrical centre of the period at the end of the main sentence.4. The geometrical centre of the period at the end of the footnote text.

Page 46: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Proposal to measure layout

Page 47: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

<significantPoints> <point name =“i” x=”45” y=”134” /> <point name =“ii” x=”57” y=”470” /> <point name=“iii” x=”215” y=”322” /> <point name=“iv” x=”254” y=”483” /></significantPoints>

Could (will ?) be done algorithmically by the way.

Page 48: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

(i) = 45 / 134;

(ii) = 57 / 470;

(iii) = 215 / 322 ;

(iv) = 254 / 483

Measuring Word 2003

Page 49: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Measuring Word 2007

(i) = 45 / 134;

(ii) = 57 / 470;

(iii) = 215 / 322 ;

(iv) = 254 / 483

Page 50: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Measuring Open Office

(i) = 45 / 134;

(ii) = 52 / 470;

(iii) = 215 / 322 ;

(iv) = 247 / 483

Page 51: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Measuring Open Office

(i) = 45 / 134;

(ii) = 52 / 470; 57(iii) = 215 / 322 ;

(iv) = 247 / 483 254

Page 52: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

(i) = 45 / 132; 45 / 133;(ii) = 59 / 469; 57 / 470;(iii) = 215 / 321 ; 215 / 322 ;(iv) = 254 / 481 254 / 483

Measuring Acrobat Reader

Page 53: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Automated by image segmentation

Page 54: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Automated by image segmentation

Page 55: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Automated by image segmentation

Page 56: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Used within the comparison logic as described before. The layout characteristics will presumably become part of the AIP in a distributed long term preservation system we may become responsible for. Proof of concept implementation for static content will become part of Planets final deliverables. Proof of concept implementation for dynamic content may become part of Planets final deliverables.

Usage

Page 57: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Thank you!

[email protected]://planetarium.hki.uni-koeln.de

Page 58: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

4. Some Examples

Page 59: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Extraction DOCX and PDF (text only)

DOCX-extraction PDF-extraction

Page 60: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

DOCX-extraction PDF-extraction

Extraction DOCX and PDF (text only)

Page 61: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

PDF-extractionDOCX-extraction

Extraction DOCX and PDF (text only)

comparison results

Page 62: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Font-Changes in DOCX and PDF

Word 2007 Adobe Acrobat Reader

Page 63: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Font-Changes in DOCX and PDF

DOCX-extraction PDF-extraction

Page 64: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Font-Changes in DOCX and PDF

DOCX-extraction PDF-extraction

Page 65: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Font-Changes in DOCX and PDF

DOCX-extraction PDF-extraction

document comparison

Page 66: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

… more fonts …

Word 2007 Adobe Acrobat Reader

Page 67: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

symbol-fonts (images or text)

DOCX-extraction PDF-extraction

Page 68: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

symbol-fonts (images or text)

DOCX-extraction PDF-extraction

Page 69: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

symbol-fonts (images or text)

DOCX-extraction PDF-extraction

document comparison

Page 70: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Extraction DOCX and PDF (text AND image)

Word 2007DOCX with embedded image

Adobe Acrobat ReaderPDF with embedded image

Page 71: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Extraction DOCX and PDF (text AND image)

DOCX-extraction PDF-extraction

Page 72: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Extraction DOCX and PDF (text AND image)

DOCX-extraction PDF-extraction

Page 73: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Extraction DOCX and PDF (text AND image)

main documentcomparison

Page 74: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Extraction DOCX and PDF (text AND image)

recursive(image) documentcomparison

Page 75: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

Audio

Page 76: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

XCDLs extracted from audio

WAV-extraction MP3-extraction

Page 77: The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

XCDLs extracted from audio

WAV-extraction MP3-extraction