the xcl languages digital preservation – the planets way dresden, april 23 rd 2010 manfred...

The XCL Languages

Digital Preservation – The Planets WayDresden, April 23rd 2010

Manfred Thaller, Universität zu Köln

1. The vision

Observation

► Photoshop ►

Works only, if you are examining the actual image data …

Extractor

Comparator

image info 2

image info 1

the same?

Format conversion

Vision stage 1

Extractor

Comparator

image info 2

image info 1

the same?

Format conversion

Vision stage 2

png rules tiff rules

Extractor

Comparator

object info 2

object info 1

the same?

Format conversion

Vision stage 3

rule set 1 rule set 2

Extractor

Comparator

XCDL 2

XCDL 1

the same?

Format conversion

Vision stage 4

XCEL 1 XCEL 2

Machine readable form of a file format specification: „eXtensible Characterisation Extraction Language“

(XCEL), able to describe any machine readable format in a formal language, processible by a software tool

for extraction of content as XCDL.

Abstract description of file content: „eXtensible Characterisation Definition Language“ (XCDL), able to describe the

content of digital objects (=1 + n more files), processible by a software tool for further

analysis.

Specification of „similiarity“ to be used: „comparator comparison [Language] “ (coco).

Specification of „similiarity“ observed: „comparator results [Language] “ (copra).

2. Examples I

Image width: 277

Image length: 339

XCL by Example

XCEL representation <item xsi:type="structuringItem" identifier="IFDE_256" optional="true"> <symbol interpretation="uint16" length="2" value="256"/> <item xsi:type="structuringItem" order="choice"> <item xsi:type="structuringItem" order="sequence"> <!– Data type (value ‚3‘ means uint16)--> <symbol interpretation="uint16" length="2" value="3"/> <!– number of values (N)-> <symbol interpretation="uint32" length="4" value="1"/>  <symbol interpretation="uint16" length="2" name="imageWidth"/>  <symbol interpretation="uint16" length="2"/> […] </item> </item> </item>

…<property id="p5">

<name id="id30" >imageWidth</name> <valueSet id="i_i1_s4" > <labValue> <val>277</val> <type>int</type> </labValue> </valueSet> </property>

XCDL representation

XCEL entry: <symbol interpretation="uint16" length="2" name="imageWidth"/>

…<property id="p5">

<name id="id30" >imageWidth</name> <valueSet id="i_i1_s4" > <labValue> <val>277</val> <type>int</type> </labValue> </valueSet> </property>

XCDL representation

XCEL entry:<!– Data type (value ‚3‘ means uint16)--> <symbol interpretation="uint16" length="2" value="3"/>

XCDL representations can now be compared…

3. Syntactical aspects of XCL processing

The XCEL tree

The XCEL tree describes a format.

The result tree

Parsing a file produces a result tree.

XCDL: models

All file contained is understood as instances of “higher order data types”:

image text [ sound ] [[ vector graphics ]]

XCDL: text model

A text (= <object>) is composed ofdata (= <normData>) plusInterpretations / properties of data according to the underlying format specification (= <property>).

This is a text<refData id="1">54 68 69 73 20 69 73 20 61 20 74 65 78 74</refData>…<property><name>fontsize</name><rawVal><val>48</val><type>unsignedInt8</type></rawVal><dataRef> <ref id="1" start="0" end="3"/><ref id="1" start=“10" end="12"/></dataRef></property>

Representing a text in XCDL

XCDL: recursiveness

XCDL is fully recursive

An arbitrarily complex image can be a property of a textual position.

Aka: Illustrations in a text file

XCDL: recursiveness

An arbitrarily complex text can be a property of a textual position.

Aka: footnotes

XCDL: recursiveness

An arbitrarily complex text can be a property of an image segment.

Aka: embedded image descriptions

3. Semantic aspects of processing

Are the following two items equal:

VIII 8

How do Humans do it?

VIII 8

eight eight

VIII 8

eight eight

VIII 8

eight eight

VIII 8

eight eight

VIII 8

eight eight

Information model: „an image“ / „a text“

Replicating the approach in a machine:

VIII 8

Format ontology: „what terms are used in formats to describe image / textual properties“.

Extraction language: “how to get the terms describing an image / a text out of a file encoded in a specific format”.

Format ontology: „what terms are used in formats to describe image / textual properties“.

The Planets XCL Approach – The Ontology

4. Conceptual aspects of processing

Data which represent stored information do so in two forms:

1. As a set of tokens, which describe atomic items of information.

2. By a set of independent parameters, which describe, in a formalized way, the semantic interpretation of these items of information.

Assumption I

1. Most algorithms today are based on “data types”, which are reflecting hardware characteristics (char, int, float ...).

2. “Objects”, which are constructed from these data types, are transient concepts, which are meaningful only within a specific implementation / environment.

3. What we would need are considerably higher order objects, which are persistent by themselves and independent of a specific implementation / environment.

Assumption II

The need formulated as assumption II can be fulfilled using assumption I.

Assumption III

(1) I = i (D, S, t)

(2) I2 = i (I1, S2, t)

(3) Ix = i (Ix-1, Sx, t)

(4) Sx = s (Ix-1, t)

(5) Ix = i (Ix-α, Sx-β, t)

(6) Ix = i (Ix-α, s(Ix-β, t), t)

Generalisation of Langefors “Infological Equation”

I = Informationi(…) = interpretative processD = dataS = previous knowledget = time

5. Inclusion of rendering results

Observation: A file in Word 2003

Observation: A file in Word 2007

Observation: A file in Open Office

Observation: A file in Acrobat

Proposal to measure layout

Cut out page from rendering surface.

Scale to common dimensions: 371 +/- 1 x 521 +/- 1

Measure1. The leftmost and lowest completely black pixel in the letter “A” starting

the first line of the main text.2. The leftmost and highest completely black pixel in the letter “E” starting

the first line of the text in the footnote.3. The geometrical centre of the period at the end of the main sentence.4. The geometrical centre of the period at the end of the footnote text.

Proposal to measure layout

Could (will ?) be done algorithmically by the way.

(i) = 45 / 134;

(ii) = 57 / 470;

(iii) = 215 / 322 ;

(iv) = 254 / 483

Measuring Word 2003

Measuring Word 2007

(i) = 45 / 134;

(ii) = 57 / 470;

(iii) = 215 / 322 ;

(iv) = 254 / 483

Measuring Open Office

(i) = 45 / 134;

(ii) = 52 / 470;

(iii) = 215 / 322 ;

(iv) = 247 / 483

Measuring Open Office

(i) = 45 / 134;

(ii) = 52 / 470; 57(iii) = 215 / 322 ;

(iv) = 247 / 483 254

(i) = 45 / 132; 45 / 133;(ii) = 59 / 469; 57 / 470;(iii) = 215 / 321 ; 215 / 322 ;(iv) = 254 / 481 254 / 483

Measuring Acrobat Reader

Automated by image segmentation

Used within the comparison logic as described before. The layout characteristics will presumably become part of the AIP in a distributed long term preservation system we may become responsible for. Proof of concept implementation for static content will become part of Planets final deliverables. Proof of concept implementation for dynamic content may become part of Planets final deliverables.

Thank you!

manfred.thaller@uni-koeln.dehttp://planetarium.hki.uni-koeln.de

4. Some Examples

Extraction DOCX and PDF (text only)

DOCX-extraction PDF-extraction

PDF-extractionDOCX-extraction

comparison results

Font-Changes in DOCX and PDF

Word 2007 Adobe Acrobat Reader

document comparison

… more fonts …

Word 2007 Adobe Acrobat Reader

symbol-fonts (images or text)

document comparison

Extraction DOCX and PDF (text AND image)

Word 2007DOCX with embedded image

Adobe Acrobat ReaderPDF with embedded image

main documentcomparison

recursive(image) documentcomparison

XCDLs extracted from audio

WAV-extraction MP3-extraction

XCDLs extracted from audio

WAV-extraction MP3-extraction

the xcl languages digital preservation – the planets way dresden, april 23 rd 2010 manfred...

text file xcdl

complex image

file format specification

image length

image segment

image textual properties

format conversionvision

machine readable format

Documents

köln partner: powerscourt hotel

kjw köln jahresbericht 2010

prof. m. thaller (universität köln) - toward a reference...

microcredit - universität zu köln

einführung in die informationsverarbeitung teil thaller...

digital video camera module · 2018. 5. 11. · technical...

einführung in die informationsverarbeitung teil thaller...

messstand köln/mailand 2008

manfred thaller, universität zu köln köln 23. januar 2014

xcl summative evaluation report - 21-tech.org · chm test...

azita uni klinik köln

softwaretechnologie für fortgeschrittene teil thaller...

xcl-tools in relation to significant characteristics in...

universität zu köln historisch-kulturwissenschaftliche...

xcl-s series · defect correction xcl-s series cameras can...

titel manfred thaller universität zu köln dhd...

xcl series explosion-proof ultraswitch - flowserve...

einführung in die informationsverarbeitung teil thaller...

file formats and significant properties manfred thaller...

scott thaller