xcl-tools in relation to significant characteristics in planets manfred thaller universität zu*...

45
XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Upload: leon-oliver

Post on 19-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

XCL-Tools in relation to Significant characteristics in Planets

Manfred Thaller

Universität zu* Köln

*University at not of Cologne

Page 2: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

What are “significant characteristics”?

Those properties of a digital file which have to be known to enable the processing of the file within a specific setup.

Page 3: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Why extract them by software?

To create technical metadata as required by organizational models for long term preservation. (NLNZ)

Page 4: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Within Planets …

… served by solutions to identify formats: formats registry / PRONOM / DROID.

… and a solution for extracting and processing such characteristics: XCL.

Page 5: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Migrator

tiff

png

Extractor

tiff XCEL png XCEL

Comparator

png XCDL

tiff XCDL

93%

A Vision

Page 6: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Extractor

Appropriate XCELsComparator

C-Set

A Vision

Page 7: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

1 million objects: use one second for each.

== 16666.7 minutes == 277.8 hours

== 11.57 working days of a computer

== 34.7 8-hour days for a Human

== 7 working weeks

Why automate?

Page 8: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

1 million objects: use five minutes for each.

== 416 666.7 hours

== 52 803.4 8-hour days for a Human

Why automate?

Page 9: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Assumption: Preservation is only feasible, if the content of two digital objects can be compared without human intervention, giving a numerical estimate of their degree of similarity.

Why automate?

Page 10: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

(1) Language to represent the complete content of a digital object. XCDL(2) Language to describe any machine readable format in a formal

language. XCEL(3) Software to extract the content of a file based upon a description

as under (2) and express it in the language as specified under (1). “extractor”(4) Software to compare two such content descriptions. “comparator”

Abstract solution I

Page 11: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

<XCELDocument...> ...<formatDescription>....<symbol identifier="ID01_I01_I01_S02"

originalName="height“ interpretation="uint32">

<range><startposition xsi:type="sequential“> </startposition>

<length xsi:type="fixed">4</length></range>

<name>height</name></symbol><symbol identifier="ID01_I01_I01_S04"

originalName="colourType"> <range> <startposition xsi:type="sequential">

</startposition> <length xsi:type="fixed">1</length></range> <valueInterpretation> <valueLabel>greyscale</valueLabel> <value>0</value></valueinterpretation>

<name>imageType</name></symbol><symbol identifier="ID01_I01_I01_S05"

originalName="compressionMethod"> <range> <startposition xsi:type="sequential“>

</startposition> <length

xsi:type="fixed">1</length></range> <valueInterpretation> <valueLabel>zlibDeflateInflate</valueLabel> <value>0</value></valueInterpretation>

<name>compression</name></symbol>...

<xcdl> <object id="o1" > <normData id="nd1" > ... </normData> <property id="p1" source="raw"

cat="descr" >

<name>compression</name>

<valueSet id="i_i1_s6" > <rawValue>0 </rawValue> <labValue>...</labValue> <dataRef ind="normAll" /> <propRel/> </valueSet> </property> <property id="p2" source="raw"

cat="descr" >

<name>height</name> <valueSet id="i_i1_s3" > <rawValue>0 0 1 ad </rawValue> <labValue> <val>429</val> <type>uint32</type> </labValue> <dataRef ind="normAll" /> <propRel/> </valueSet> </property> <property id="p3" source="raw"

cat="descr" >

<name>imageType</name> .....

Page 12: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

<request2>

<measurementRequest>

<source name="XCDL1.xml"/> <target name="XCDL2.xml"/> <property id="45" name="rgbPalette">

<metric id="10" name="hammingDistance"/> </property>

<property id="300" name="normData">

<metric id="10" name="hammingDistance"/>

<metric id="50" name="RMSE"/>

</property>

<property id="2" name="imageHeight" unit="pixel">

<metric id="200" name="equal"/> <metric id="201" name="intDiff"/> <metric id="210" name="percDev"/> </property> <property id="30" name="imageWidth"

unit="pixel">

<metric id="200" name="equal"/>

<metric id="201" name="intDiff"/>

<metric id="210" name="percDev"/>

</property>

<property id="2" name="imageHeight" unit="pixel" compStatus="complete">

<values type="int"> <src>32</src> <tar>32</tar> </values> <metric id="200"

name="equal" result="true"/> <metric id="201"

name="intDiff" result="0"/> <metric id="210"

name="percDev" result="0.000000"/>

</property>

Page 13: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

(1) Language to represent the complete content of a digital object. XCDL(2) Language to describe any machine readable format in a formal

language. XCEL(3) Software to extract the content of a file based upon a description

as under (2) and express it in the language as specified under (1). “extractor”(4) Software to compare two such content descriptions. “comparator”

Abstract solution I

Page 14: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Are the following two items equal:

VIII 8

Page 15: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

VIII 8

eight eight

Page 16: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

VIII 8

eight eight

otto

otto

Page 17: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

VIII 8

eight eight

otto

otto

acht

acht

Page 18: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

VIII 8

eight eight

otto

otto

acht

acht

8.0

Page 19: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

VIII 8

eight eight

otto

otto

acht

acht

Information model: „an image“

Page 20: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

VIII 8

information model: „an image“

format ontology: „what terms are used in formats to describe image

properties“

Page 21: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Extraction language: “how to get the terms describing an image out of a file”

Information model: „what is an image“

Format ontology: „what terms are used in formats to describe image

properties“

Page 22: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

(1) A theoretical model of information (not: data) types – “image”, “text”, “audio” ...

(2) Ontologies, which map existing file format terminologies onto these model.

(3) A language – XCDL – which allows to express the content of files in different formats using the vocabulary of the ontologies and the “grammar” of the information model.

Abstract solution II

Page 23: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

eXtensible Characterisation Definition Language

Purpose: Describe the contents of a file in terms of an abstract model.

XCDL

Page 24: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

XCDL: text model (1)

A text (= <object>) is composed ofdata (= <normData>) plusinterpretations of data according to the underlying format specification (= <property>).

Page 25: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

XCDL: text model (2)

Or, one level of abstraction higher, a text is composed of content carrying tokens,accompanied by rendering info plusdeployment info plus historical info.

Page 26: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

This is a text<refData id="1">54 68 69 73 20 69 73 20 61 20 74 65 78 74</refData>…<property><name>fontsize</name><rawVal><val>48</val><type>unsignedInt8</type></rawVal><dataRef> <!-- property refers to discrete part of reference data--><ref id="1" start="0" end="3"/><ref id="1" start=“10" end="12"/></dataRef></property>

Page 27: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

This is a text<refData id="1">54 68 69 73 20 69 73 20 61 20 74 65 78 74</refData>…<property><name>fontsize</name><rawVal><val>48</val><type>unsignedInt8</type></rawVal><dataRef> <!-- property refers to discrete part of reference data--><ref id="1" start="0" end="3"/><ref id="1" start=“10" end="12"/></dataRef></property>

Page 28: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Thank you!

Questions?

[email protected]

Page 29: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

XC(E/D)L - & related issues

(originally from Sebastian Beyl)

Page 30: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Already known

XCELMachine readable format description

XCDLNormdatas and

propertiesfrom original file

ORIGINAL FILE

Extractor

Page 31: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Problem: propertySets and relation to normdatas

normdatasoriginal

file property 1 property 1

property 2 property 2

Page 32: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Problem: propertySets and relation to normdatas

pSet.

3pSet.

3propertySet2 again!

propertySet2 again!

propertySet 2

propertySet 2

propertySet1 again!

propertySet1 again!

propertySet1

propertySet1

normdatasXCDL property 1 property 1

property 2 property 2

Page 33: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Problem: propertySets and relation to normdatas

pSet.

3

pSet.

3propertySet2 again!

propertySet2 again!

propertySet 2

propertySet 2

propertySet1 again!

propertySet1 again!

propertySet1

propertySet1

normdatasXCDL property 1 property 1

property 2 property 2

Rules:- Relation to normdata ONLY with propertySet- No overlapping relations- every propertySet-definition (in one object) only once

Page 34: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Problem: recursive structures

Footnote example from koffice.org

Page 35: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Problem: recursive structures

Footnote example from koffice.org

normdatanormdata

normdata

Page 36: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Problem: recursive structures

Footnote example from koffice.org

Property fontsize

Property fontsize

normdata

Property fontSize

Page 37: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Problem: recursive structures

Footnote example from koffice.org

normdata

Property fontSize

Property footnote

Page 38: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Problem: recursive structures

Footnote example from koffice.org

normdata

Property fontSize

Property footnote

normdata of property?

Page 39: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Problem: recursive structures

Footnote example from koffice.org

normdata

Property fontSize

Property footnote

property of normdata of property?How to bring it in XCDL?

Page 40: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Problem: recursive structures

Property „Object B“ as footnote

Footnote example from koffice.org

Rules:properties and propertySets only for ONE objectUpper object always points to lower object, so lower object can exists itself

Object AObject A

normdata

Property fontSize

Object BObject B

normdata

Property fontSize

Object AObject A Object AObject A

Object BObject B

Page 41: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Problem: embedded objects

Example from wikipedia.de

Page 42: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Problem: embedded objects

Example from wikipedia.de

Original(container)file

Text datas

Picture datas as

embedded file

Page 43: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Problem: embedded objects

Example from wikipedia.de

extraction

XCDL-Object A (text datas)XCDL-Object A (text datas)

XCDL-Object B (image datas)XCDL-Object B (image datas)

Object A handles object B as an „image property“

Original(container)file

Text datas

Picture datas as

embedded file

Page 44: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

XCDL-Object A (text datas)XCDL-Object A (text datas)

XCDL-Object B (image datas)XCDL-Object B (image datas)

Object A handles object B as an „image property“

Problem: embedded objects

Example from wikipedia.de

Standalone Image-XCDLStandalone Image-XCDL

Rules:If upper object (A) is not readableor cannot use for comparison,the embedded object can beHandled as a „Standalone“-XCDL

Page 45: XCL-Tools in relation to Significant characteristics in Planets Manfred Thaller Universität zu* Köln *University at not of Cologne

Problem: embedded objects

Example from wikipedia.de

XCDL-Object A (text datas)XCDL-Object A (text datas)

XCDL-Object BUNKNOWN IMAGE FORMAT

XCDL-Object BUNKNOWN IMAGE FORMAT

Second Parsing,if known Image format

Second Parsing,if known Image format

Rules:If lower object (B) cannot be parsed, raw datas can be stored for later parsing, without data-loss or comparison problems for upper object (A)