xcl-tools in relation to significant characteristics in planets manfred thaller universität zu*...
TRANSCRIPT
XCL-Tools in relation to Significant characteristics in Planets
Manfred Thaller
Universität zu* Köln
*University at not of Cologne
What are “significant characteristics”?
Those properties of a digital file which have to be known to enable the processing of the file within a specific setup.
Why extract them by software?
To create technical metadata as required by organizational models for long term preservation. (NLNZ)
Within Planets …
… served by solutions to identify formats: formats registry / PRONOM / DROID.
… and a solution for extracting and processing such characteristics: XCL.
Migrator
tiff
png
Extractor
tiff XCEL png XCEL
Comparator
png XCDL
tiff XCDL
93%
A Vision
Extractor
Appropriate XCELsComparator
C-Set
A Vision
1 million objects: use one second for each.
== 16666.7 minutes == 277.8 hours
== 11.57 working days of a computer
== 34.7 8-hour days for a Human
== 7 working weeks
Why automate?
1 million objects: use five minutes for each.
== 416 666.7 hours
== 52 803.4 8-hour days for a Human
Why automate?
Assumption: Preservation is only feasible, if the content of two digital objects can be compared without human intervention, giving a numerical estimate of their degree of similarity.
Why automate?
(1) Language to represent the complete content of a digital object. XCDL(2) Language to describe any machine readable format in a formal
language. XCEL(3) Software to extract the content of a file based upon a description
as under (2) and express it in the language as specified under (1). “extractor”(4) Software to compare two such content descriptions. “comparator”
Abstract solution I
<XCELDocument...> ...<formatDescription>....<symbol identifier="ID01_I01_I01_S02"
originalName="height“ interpretation="uint32">
<range><startposition xsi:type="sequential“> </startposition>
<length xsi:type="fixed">4</length></range>
<name>height</name></symbol><symbol identifier="ID01_I01_I01_S04"
originalName="colourType"> <range> <startposition xsi:type="sequential">
</startposition> <length xsi:type="fixed">1</length></range> <valueInterpretation> <valueLabel>greyscale</valueLabel> <value>0</value></valueinterpretation>
<name>imageType</name></symbol><symbol identifier="ID01_I01_I01_S05"
originalName="compressionMethod"> <range> <startposition xsi:type="sequential“>
</startposition> <length
xsi:type="fixed">1</length></range> <valueInterpretation> <valueLabel>zlibDeflateInflate</valueLabel> <value>0</value></valueInterpretation>
<name>compression</name></symbol>...
<xcdl> <object id="o1" > <normData id="nd1" > ... </normData> <property id="p1" source="raw"
cat="descr" >
<name>compression</name>
<valueSet id="i_i1_s6" > <rawValue>0 </rawValue> <labValue>...</labValue> <dataRef ind="normAll" /> <propRel/> </valueSet> </property> <property id="p2" source="raw"
cat="descr" >
<name>height</name> <valueSet id="i_i1_s3" > <rawValue>0 0 1 ad </rawValue> <labValue> <val>429</val> <type>uint32</type> </labValue> <dataRef ind="normAll" /> <propRel/> </valueSet> </property> <property id="p3" source="raw"
cat="descr" >
<name>imageType</name> .....
<request2>
<measurementRequest>
<source name="XCDL1.xml"/> <target name="XCDL2.xml"/> <property id="45" name="rgbPalette">
<metric id="10" name="hammingDistance"/> </property>
<property id="300" name="normData">
<metric id="10" name="hammingDistance"/>
<metric id="50" name="RMSE"/>
</property>
<property id="2" name="imageHeight" unit="pixel">
<metric id="200" name="equal"/> <metric id="201" name="intDiff"/> <metric id="210" name="percDev"/> </property> <property id="30" name="imageWidth"
unit="pixel">
<metric id="200" name="equal"/>
<metric id="201" name="intDiff"/>
<metric id="210" name="percDev"/>
</property>
<property id="2" name="imageHeight" unit="pixel" compStatus="complete">
<values type="int"> <src>32</src> <tar>32</tar> </values> <metric id="200"
name="equal" result="true"/> <metric id="201"
name="intDiff" result="0"/> <metric id="210"
name="percDev" result="0.000000"/>
</property>
(1) Language to represent the complete content of a digital object. XCDL(2) Language to describe any machine readable format in a formal
language. XCEL(3) Software to extract the content of a file based upon a description
as under (2) and express it in the language as specified under (1). “extractor”(4) Software to compare two such content descriptions. “comparator”
Abstract solution I
Are the following two items equal:
VIII 8
VIII 8
eight eight
VIII 8
eight eight
otto
otto
VIII 8
eight eight
otto
otto
acht
acht
VIII 8
eight eight
otto
otto
acht
acht
8.0
VIII 8
eight eight
otto
otto
acht
acht
Information model: „an image“
VIII 8
information model: „an image“
format ontology: „what terms are used in formats to describe image
properties“
Extraction language: “how to get the terms describing an image out of a file”
Information model: „what is an image“
Format ontology: „what terms are used in formats to describe image
properties“
(1) A theoretical model of information (not: data) types – “image”, “text”, “audio” ...
(2) Ontologies, which map existing file format terminologies onto these model.
(3) A language – XCDL – which allows to express the content of files in different formats using the vocabulary of the ontologies and the “grammar” of the information model.
Abstract solution II
eXtensible Characterisation Definition Language
Purpose: Describe the contents of a file in terms of an abstract model.
XCDL
XCDL: text model (1)
A text (= <object>) is composed ofdata (= <normData>) plusinterpretations of data according to the underlying format specification (= <property>).
XCDL: text model (2)
Or, one level of abstraction higher, a text is composed of content carrying tokens,accompanied by rendering info plusdeployment info plus historical info.
This is a text<refData id="1">54 68 69 73 20 69 73 20 61 20 74 65 78 74</refData>…<property><name>fontsize</name><rawVal><val>48</val><type>unsignedInt8</type></rawVal><dataRef> <!-- property refers to discrete part of reference data--><ref id="1" start="0" end="3"/><ref id="1" start=“10" end="12"/></dataRef></property>
This is a text<refData id="1">54 68 69 73 20 69 73 20 61 20 74 65 78 74</refData>…<property><name>fontsize</name><rawVal><val>48</val><type>unsignedInt8</type></rawVal><dataRef> <!-- property refers to discrete part of reference data--><ref id="1" start="0" end="3"/><ref id="1" start=“10" end="12"/></dataRef></property>
XC(E/D)L - & related issues
(originally from Sebastian Beyl)
Already known
XCELMachine readable format description
XCDLNormdatas and
propertiesfrom original file
ORIGINAL FILE
Extractor
Problem: propertySets and relation to normdatas
normdatasoriginal
file property 1 property 1
property 2 property 2
Problem: propertySets and relation to normdatas
pSet.
3pSet.
3propertySet2 again!
propertySet2 again!
propertySet 2
propertySet 2
propertySet1 again!
propertySet1 again!
propertySet1
propertySet1
normdatasXCDL property 1 property 1
property 2 property 2
Problem: propertySets and relation to normdatas
pSet.
3
pSet.
3propertySet2 again!
propertySet2 again!
propertySet 2
propertySet 2
propertySet1 again!
propertySet1 again!
propertySet1
propertySet1
normdatasXCDL property 1 property 1
property 2 property 2
Rules:- Relation to normdata ONLY with propertySet- No overlapping relations- every propertySet-definition (in one object) only once
Problem: recursive structures
Footnote example from koffice.org
Problem: recursive structures
Footnote example from koffice.org
normdatanormdata
normdata
Problem: recursive structures
Footnote example from koffice.org
Property fontsize
Property fontsize
normdata
Property fontSize
Problem: recursive structures
Footnote example from koffice.org
normdata
Property fontSize
Property footnote
Problem: recursive structures
Footnote example from koffice.org
normdata
Property fontSize
Property footnote
normdata of property?
Problem: recursive structures
Footnote example from koffice.org
normdata
Property fontSize
Property footnote
property of normdata of property?How to bring it in XCDL?
Problem: recursive structures
Property „Object B“ as footnote
Footnote example from koffice.org
Rules:properties and propertySets only for ONE objectUpper object always points to lower object, so lower object can exists itself
Object AObject A
normdata
Property fontSize
Object BObject B
normdata
Property fontSize
Object AObject A Object AObject A
Object BObject B
Problem: embedded objects
Example from wikipedia.de
Problem: embedded objects
Example from wikipedia.de
Original(container)file
Text datas
Picture datas as
embedded file
Problem: embedded objects
Example from wikipedia.de
extraction
XCDL-Object A (text datas)XCDL-Object A (text datas)
XCDL-Object B (image datas)XCDL-Object B (image datas)
Object A handles object B as an „image property“
Original(container)file
Text datas
Picture datas as
embedded file
XCDL-Object A (text datas)XCDL-Object A (text datas)
XCDL-Object B (image datas)XCDL-Object B (image datas)
Object A handles object B as an „image property“
Problem: embedded objects
Example from wikipedia.de
Standalone Image-XCDLStandalone Image-XCDL
Rules:If upper object (A) is not readableor cannot use for comparison,the embedded object can beHandled as a „Standalone“-XCDL
Problem: embedded objects
Example from wikipedia.de
XCDL-Object A (text datas)XCDL-Object A (text datas)
XCDL-Object BUNKNOWN IMAGE FORMAT
XCDL-Object BUNKNOWN IMAGE FORMAT
Second Parsing,if known Image format
Second Parsing,if known Image format
Rules:If lower object (B) cannot be parsed, raw datas can be stored for later parsing, without data-loss or comparison problems for upper object (A)