transforming tei with oxgaragetei.oucs.ox.ac.uk/talks/2014-11-warsaw/talk-3-01-oxgarage.pdf ·...

25
Transforming TEI with OxGarage James Cummings @jamescummings 14 September 2014 1/25

Upload: others

Post on 19-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Transforming TEI with OxGarage

James Cummings@jamescummings

14 September 2014

1/25

How hard is it to convert a Word file to other formats?

It is relatively easy to

Save to HTML from Word (its not as bad as it used to be)

Save to HTML and then tidy up (many utilities, eg tidy)

Read Word into OpenOffice and use its slightly better export

but we can also access Word 2007 (and later) files more directly and moreflexibly.

2/25

A simple document in Word

3/25

Semantically, this is:

.

......

<div><head>Cats</head><p>Cats are nice. Really quite <hi rend="bold">nice</hi>.</p>

</div><div><head>Dogs</head><p>Dogs are horrid, because</p><list type="ordered"><item>They jump around</item><item>They bark</item><item>They steal the sofa</item>

</list></div>

4/25

Rendered as web page

5/25

Choices when saving as HTML

6/25

The hideous result (1)

7/25

The hideous result (2)

8/25

The contents of the package

9/25

What are the files for?

[Content_Types].xml mime types of files_rels/.rels links between names and ob-

jectsword/_rels/document.xml.rels links between names and sup-

port filesword/document.xml document bodyword/media/image1.jpeg pictureword/theme/theme1.xmldocProps/thumbnail.jpeg document thumbnailword/settings.xml settingsword/webSettings.xml settings for HTML exportword/styles.xml style definitionsword/numbering.xml numbering schemesdocProps/core.xml document propertiesword/fontTable.xml font detailsdocProps/app.xml application details

Most of these are XML files.

10/25

A list item in DOCX

.

......

<p rsidR="00272A5B"rsidRDefault="00272A5B" rsidP="00272A5B"><r><t>Dogs are horrid, because</t>

</r></p><p rsidR="00272A5B"rsidRDefault="00272A5B" rsidP="00272A5B"><pPr><pStyle val="ListParagraph"/><numPr><ilvl val="0"/><numId val="1"/>

</numPr></pPr><r><t>They jump around</t>

</r></p>

11/25

OxGarage to the rescue!

OxGarage is a web app(http://oxgarage.oucs.ox.ac.uk:8080/ege-webclient) whichprovides document transformations, featuring

Web and REST interface

Chained XSLT conversions

Uses headless OpenOffice for binary conversions

Uses TEI XML as pivot format

Supports Stylesheets “profiles” for variations

Open source across the board

12/25

History and dependencies

Built at the Poznań Supercomputing and Networking Center forENRICH, an EU-funded eContent+ project.

It was called the EGE '(ENRICH Garage Engine') and designed as apipeline conversion for converting manuscript descriptions, usingconversions and libraries from University of Oxford

Now much further developed and maintained as a fork by theUniversity of Oxford

Java servlet, running under Tomcat in current instances

Almost all work done as XSLT transforms using Saxon processor

Uses headless OpenOffice to read/write .doc, .xls, .ppt files etc.

http://www.github.com/sebastianrahtz/oxgarage

13/25

OxGarage in OxfordOxGarage is used:

As a data/text cleanup/nornalization tool by humanities researchers(eg converting doc to TEI XML, TEI to Excel, Wordpress blog to LaTeX,TEI to Word)

As an enabling technology for IT Services course booking system(creating Word files for download)

As a component of teaching in Digital Humanities Summer Schooland other IT Learning Programme where the Text Encoding Initiativeis covered (teaching students how to make different outputs)

As an enabling technology for schema creation by TEI usersworldwide, underlying the Roma application(http://www.tei-c.org/Roma/)

.

......

OxGarage is currently unofficial in its support and maintenance -- we areapplying for an internal project to transition it to being a proper service.

14/25

15/25

16/25

17/25

18/25

19/25

Matrix of OxGarage conversions (1)

20/25

Matrix of OxGarage conversions (2)

21/25

OxGarage: constructing a path

http://oxgarage.oucs.ox.ac.uk:8080/ege-webservice/Conversions/format/format/?properties

‘formats’ are a name followed by a mime type. For example:

format codeePub application%3Aepub+zipXSL FO application%3Axslfo+xmlLaTeX application%3Ax-latexTEI LITE text%3AxmlODD HTML application%3Axhtml+xmlODD Json application%3AjsonODT application%3Avnd.oasis.opendocument.textRDF application%3Ardf+xmlRELAX NG application%3Axml-relaxngTEI ODD ODD%3Atext%3Axml/TEI P5 TEI%3Atext%3Axml/Word docx%3Aapplication%3Avnd.openxmlformats-

officedocument.wordprocessingml.document/

22/25

OxGarage web service example (1)

Process ODD to compiled ODD, then to TEI Lite, then to DOCX

curl -s -F [email protected] -o test.docxhttp://oxgarage.oucs.ox.ac.uk:8080/ege-webservice/Conversions/ODD%3Atext%3Axml/ODDC%3Atext%3Axml/TEI%3Atext%3Axml/docx%3Aapplication%3Avnd.openxmlformats-officedocument.wordprocessingml.document/

23/25

OxGarage web service example (2)

ODD to HTML, in French

curl -s -F [email protected] -o test.htmlhttp://oxgarage.oucs.ox.ac.uk:8080/ege-webservice/Conversions/ODD%3Atext%3Axml/ODDC%3Atext%3Axml/oddhtml%3Aapplication%3Axhtml%2Bxml/?properties=<conversions><conversion%20index='1'><property%20id='oxgarage.lang'>fr</property></conversion></conversions>

24/25

Using OxGarage for your project

You might use OxGarage by:

Doing one-off conversions using the web interface

Scripting conversions using the REST web service

Scripting conversions pipelining the underlying stylesheets

Installing the OxGarage Debian package and hosting it locally

25/25