xcl-tools in relation to significant characteristics in planets manfred thaller universität zu*...

Post on 19-Jan-2016

222 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

XCL-Tools in relation to Significant characteristics in Planets

Manfred Thaller

Universität zu* Köln

*University at not of Cologne

What are “significant characteristics”?

Those properties of a digital file which have to be known to enable the processing of the file within a specific setup.

Why extract them by software?

To create technical metadata as required by organizational models for long term preservation. (NLNZ)

Within Planets …

… served by solutions to identify formats: formats registry / PRONOM / DROID.

… and a solution for extracting and processing such characteristics: XCL.

Migrator

tiff

png

Extractor

tiff XCEL png XCEL

Comparator

png XCDL

tiff XCDL

93%

A Vision

Extractor

Appropriate XCELsComparator

C-Set

A Vision

1 million objects: use one second for each.

== 16666.7 minutes == 277.8 hours

== 11.57 working days of a computer

== 34.7 8-hour days for a Human

== 7 working weeks

Why automate?

1 million objects: use five minutes for each.

== 416 666.7 hours

== 52 803.4 8-hour days for a Human

Why automate?

Assumption: Preservation is only feasible, if the content of two digital objects can be compared without human intervention, giving a numerical estimate of their degree of similarity.

Why automate?

(1) Language to represent the complete content of a digital object. XCDL(2) Language to describe any machine readable format in a formal

language. XCEL(3) Software to extract the content of a file based upon a description

as under (2) and express it in the language as specified under (1). “extractor”(4) Software to compare two such content descriptions. “comparator”

Abstract solution I

<XCELDocument...> ...<formatDescription>....<symbol identifier="ID01_I01_I01_S02"

originalName="height“ interpretation="uint32">

<range><startposition xsi:type="sequential“> </startposition>

<length xsi:type="fixed">4</length></range>

<name>height</name></symbol><symbol identifier="ID01_I01_I01_S04"

originalName="colourType"> <range> <startposition xsi:type="sequential">

</startposition> <length xsi:type="fixed">1</length></range> <valueInterpretation> <valueLabel>greyscale</valueLabel> <value>0</value></valueinterpretation>

<name>imageType</name></symbol><symbol identifier="ID01_I01_I01_S05"

originalName="compressionMethod"> <range> <startposition xsi:type="sequential“>

</startposition> <length

xsi:type="fixed">1</length></range> <valueInterpretation> <valueLabel>zlibDeflateInflate</valueLabel> <value>0</value></valueInterpretation>

<name>compression</name></symbol>...

<xcdl> <object id="o1" > <normData id="nd1" > ... </normData> <property id="p1" source="raw"

cat="descr" >

<name>compression</name>

<valueSet id="i_i1_s6" > <rawValue>0 </rawValue> <labValue>...</labValue> <dataRef ind="normAll" /> <propRel/> </valueSet> </property> <property id="p2" source="raw"

cat="descr" >

<name>height</name> <valueSet id="i_i1_s3" > <rawValue>0 0 1 ad </rawValue> <labValue> <val>429</val> <type>uint32</type> </labValue> <dataRef ind="normAll" /> <propRel/> </valueSet> </property> <property id="p3" source="raw"

cat="descr" >

<name>imageType</name> .....

<request2>

<measurementRequest>

<source name="XCDL1.xml"/> <target name="XCDL2.xml"/> <property id="45" name="rgbPalette">

<metric id="10" name="hammingDistance"/> </property>

<property id="300" name="normData">

<metric id="10" name="hammingDistance"/>

<metric id="50" name="RMSE"/>

</property>

<property id="2" name="imageHeight" unit="pixel">

<metric id="200" name="equal"/> <metric id="201" name="intDiff"/> <metric id="210" name="percDev"/> </property> <property id="30" name="imageWidth"

unit="pixel">

<metric id="200" name="equal"/>

<metric id="201" name="intDiff"/>

<metric id="210" name="percDev"/>

</property>

<property id="2" name="imageHeight" unit="pixel" compStatus="complete">

<values type="int"> <src>32</src> <tar>32</tar> </values> <metric id="200"

name="equal" result="true"/> <metric id="201"

name="intDiff" result="0"/> <metric id="210"

name="percDev" result="0.000000"/>

</property>

(1) Language to represent the complete content of a digital object. XCDL(2) Language to describe any machine readable format in a formal

language. XCEL(3) Software to extract the content of a file based upon a description

as under (2) and express it in the language as specified under (1). “extractor”(4) Software to compare two such content descriptions. “comparator”

Abstract solution I

Are the following two items equal:

VIII 8

VIII 8

eight eight

VIII 8

eight eight

otto

otto

VIII 8

eight eight

otto

otto

acht

acht

VIII 8

eight eight

otto

otto

acht

acht

8.0

VIII 8

eight eight

otto

otto

acht

acht

Information model: „an image“

VIII 8

information model: „an image“

format ontology: „what terms are used in formats to describe image

properties“

Extraction language: “how to get the terms describing an image out of a file”

Information model: „what is an image“

Format ontology: „what terms are used in formats to describe image

properties“

(1) A theoretical model of information (not: data) types – “image”, “text”, “audio” ...

(2) Ontologies, which map existing file format terminologies onto these model.

(3) A language – XCDL – which allows to express the content of files in different formats using the vocabulary of the ontologies and the “grammar” of the information model.

Abstract solution II

eXtensible Characterisation Definition Language

Purpose: Describe the contents of a file in terms of an abstract model.

XCDL

XCDL: text model (1)

A text (= <object>) is composed ofdata (= <normData>) plusinterpretations of data according to the underlying format specification (= <property>).

XCDL: text model (2)

Or, one level of abstraction higher, a text is composed of content carrying tokens,accompanied by rendering info plusdeployment info plus historical info.

This is a text<refData id="1">54 68 69 73 20 69 73 20 61 20 74 65 78 74</refData>…<property><name>fontsize</name><rawVal><val>48</val><type>unsignedInt8</type></rawVal><dataRef> <!-- property refers to discrete part of reference data--><ref id="1" start="0" end="3"/><ref id="1" start=“10" end="12"/></dataRef></property>

This is a text<refData id="1">54 68 69 73 20 69 73 20 61 20 74 65 78 74</refData>…<property><name>fontsize</name><rawVal><val>48</val><type>unsignedInt8</type></rawVal><dataRef> <!-- property refers to discrete part of reference data--><ref id="1" start="0" end="3"/><ref id="1" start=“10" end="12"/></dataRef></property>

Thank you!

Questions?

Manfred.thaller@uni-koeln.de

XC(E/D)L - & related issues

(originally from Sebastian Beyl)

Already known

XCELMachine readable format description

XCDLNormdatas and

propertiesfrom original file

ORIGINAL FILE

Extractor

Problem: propertySets and relation to normdatas

normdatasoriginal

file property 1 property 1

property 2 property 2

Problem: propertySets and relation to normdatas

pSet.

3pSet.

3propertySet2 again!

propertySet2 again!

propertySet 2

propertySet 2

propertySet1 again!

propertySet1 again!

propertySet1

propertySet1

normdatasXCDL property 1 property 1

property 2 property 2

Problem: propertySets and relation to normdatas

pSet.

3

pSet.

3propertySet2 again!

propertySet2 again!

propertySet 2

propertySet 2

propertySet1 again!

propertySet1 again!

propertySet1

propertySet1

normdatasXCDL property 1 property 1

property 2 property 2

Rules:- Relation to normdata ONLY with propertySet- No overlapping relations- every propertySet-definition (in one object) only once

Problem: recursive structures

Footnote example from koffice.org

Problem: recursive structures

Footnote example from koffice.org

normdatanormdata

normdata

Problem: recursive structures

Footnote example from koffice.org

Property fontsize

Property fontsize

normdata

Property fontSize

Problem: recursive structures

Footnote example from koffice.org

normdata

Property fontSize

Property footnote

Problem: recursive structures

Footnote example from koffice.org

normdata

Property fontSize

Property footnote

normdata of property?

Problem: recursive structures

Footnote example from koffice.org

normdata

Property fontSize

Property footnote

property of normdata of property?How to bring it in XCDL?

Problem: recursive structures

Property „Object B“ as footnote

Footnote example from koffice.org

Rules:properties and propertySets only for ONE objectUpper object always points to lower object, so lower object can exists itself

Object AObject A

normdata

Property fontSize

Object BObject B

normdata

Property fontSize

Object AObject A Object AObject A

Object BObject B

Problem: embedded objects

Example from wikipedia.de

Problem: embedded objects

Example from wikipedia.de

Original(container)file

Text datas

Picture datas as

embedded file

Problem: embedded objects

Example from wikipedia.de

extraction

XCDL-Object A (text datas)XCDL-Object A (text datas)

XCDL-Object B (image datas)XCDL-Object B (image datas)

Object A handles object B as an „image property“

Original(container)file

Text datas

Picture datas as

embedded file

XCDL-Object A (text datas)XCDL-Object A (text datas)

XCDL-Object B (image datas)XCDL-Object B (image datas)

Object A handles object B as an „image property“

Problem: embedded objects

Example from wikipedia.de

Standalone Image-XCDLStandalone Image-XCDL

Rules:If upper object (A) is not readableor cannot use for comparison,the embedded object can beHandled as a „Standalone“-XCDL

Problem: embedded objects

Example from wikipedia.de

XCDL-Object A (text datas)XCDL-Object A (text datas)

XCDL-Object BUNKNOWN IMAGE FORMAT

XCDL-Object BUNKNOWN IMAGE FORMAT

Second Parsing,if known Image format

Second Parsing,if known Image format

Rules:If lower object (B) cannot be parsed, raw datas can be stored for later parsing, without data-loss or comparison problems for upper object (A)

top related