extensible characterisation languages (xcl) manfred thaller, (university at cologne) dpp meeting,...

31
eXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006

Upload: dominic-wilkinson

Post on 01-Jan-2016

223 views

Category:

Documents


1 download

TRANSCRIPT

eXtensible Characterisation Languages (XCL)

Manfred Thaller, (University at Cologne)

DPP meeting, Glasgow, Nov. 23rd 2006

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Vision:

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Vision:

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Vision:

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Vision:

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Vision:

Questions …

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

1. Is all information contained within oldFormat also contained within newFormat?

Questions …

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

1. Is all information contained within oldFormat also contained within newFormat?

2. Is all information, which is relevant for the usage of the information, within oldFormat also contained within newFormat?

Questions …

* M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

1. Is all information contained within oldFormat also contained within newFormat?

2. Is all information, which is relevant for the usage of the information, within oldFormat also contained within newFormat?

3. Is the conversion process a(oldFormat, newFormat) better than b(oldFormat, newFormat) , i.e. does it preserve more of the information contained within oldFormat?

Building Block I: XCEL

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

A language, which allows a program to read "any file specification" based on a

==> "eXtensible Characterisation Extraction Language"

Formulate the humanly readable specifications of TIFF, RTF, WAV …in a language, which a general purpose program can read.

General enough that any existing format specification can be expressed in it. (LATeX, MAX, VRML …)

XCEL – Structuring Elements

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

range

item

subitem

<startposition><length>

item

symbol

property

XCEL – Structuring Elements

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

<startposition><length>

Byte offsets: 1000, 1248

Truly binary files: Most sound, image formatsBinary addressable files: PDF, Max

XCEL – Structuring Elements

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

<startposition><length>

Procedures:p(begin, trigger) q(trigger,filter,implication)

Encoded / mark up files: RTF, TeX, SVG, VRML …

XCEL – Structuring Elements

* M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

<startposition><length>

Procedures:p(current_Position, <someTag>”).q(“</someTag>”,pair(“<[a-zA-0-9]*>”,”</&>”), implyBy(“</someOtherTag>”))

Encoded / mark up files: RTF, TeX, SVG, VRML …

Building Block II: XCDL

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

A language, which allows a program to describe "any file content" using a

==> "eXtensible Characterisation Definition Language"

Formulate the content of any file in an abstract language, which captures the complete information contained in it.

General enough that any existing content can be expressed in it.

XCDL: Basic Architecture

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

1. Sequences of bytes

2. With properties applicable to subsequences

XCDL: Basic Architecture

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Ashes to Ashes once more

<data id=”1”> {\rtf1\ansi\ansicpg1252\deff0\deflang1031{\fonttbl{\f0\fswiss\fcharset0 Arial;}}\viewkind4\uc1\pard\f0\fs20 \b Ashes\b0 to \b Ashes\b0 once \b more\b0.\par} </data>

XCDL: Basic Architecture

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

<normData id=”1” type=”text”> Ashes to Ashes once more. </normData>

<property id=”5” source=”raw” cat=”descr”> <name>boldFace</name> <valueSet id=”1”> <rawVal>Ashes</rawVal> <dataRef ind=”normSpecific”> <ref id=”1” start=”0” end=”4”/> <ref id=”1” start=”9” end=”13”/> </dataRef> </valueSet> <valueSet id=”2”> <rawVal>more</rawVal> <dataRef ind=”normSpecific”> <ref id=”1” start=”20” end=”23”/> </dataRef> </valueSet> </property>

XCDL: Basic Architecture

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Assumption 1: A file format is a set of rules which formalize all knowledge needed to process the binary information contained within a distinct and complete block of binary information, traditionally called a file.

XCDL: Basic Architecture

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Assumption 2: The extensible characterisation extraction language is designed to be able to express all such rules within a given file format. The extensible characterisation definition language is designed to be able to describe all the information contained within a file the format of which is described by a valid XCEL description.

XCDL: Basic Architecture

* M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Assumption 3: A specific XCEL description is not required to express all the rules within a specific file format. A XCDL derived from such a partial XCEL will, therefore, potentially also contain only part of the information of a file encoded in that format.

Even when the XCEL describes a format completely, an extractor is not required to extract all characteristics of a file.

Some characteristics are only important for processing: compression method not important, after decompression succeeded.

Building Block III: Metrics

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Building Block III: Metrics

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Starting in month 13.

However ...

Metrics: Basic Assumptions

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Currently bottom up approach:

Observe characteristics occuring within files …

… and build name libraries from them.

{"color depth", "# of planes"} => colorDepth

Metrics: Basic Assumptions

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Later parallel top down approach:

Create file characteristics ontology …

… and link it to the name libraries.

"width" in image file != "width" in text file.

Metrics: Example I

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Percentage of bytes in a binary stream which are preserved within range of +/- 5 of original.

(Images: Would scarcely be observable on screen.)

E.g. relevant when colorspace appropriate for printing is transformed into a colorspace optimized for screen.

Metrics: Example II

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Degree to which font applied recreates the original typesetting characteristics.

(Texts:Derived metric from comparison of font metrics.)

Metrics: Problem

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Problem not so much individual metrics butsummation rules.

An image migration step preserves 98 % of the image bytes within +/- 1 %.

It also preserves 4 of 20 ( = 25 %) boolean properties (creator, scanning equipment …).

Quality of the migration: (0.98 + 0.25) / 2 = .615?

Metrics: Problem

* M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Possible solution: " weights derived from PP.

An image migration step preserves 98 % of the image bytes within +/- 1 %.

It also preserves 4 of 20 ( = 25 %) boolean properties (creator, scanning equipment …). Weight engineering metrics by "arbitrary

Quality of the migration: 0.98*w1 + 0.25*w2 / 2 =

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Thank you!

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006