TeX2Star
A System for Converting TeX to OpenOfficeBy Jeffrey Starr
Overview
● Why does conversion matter?● Why has it not already been done?
– Why is it difficult?
● Proposal: TeX->OpenOffice● Proposal: TeX->DVI->OpenOffice● Solution● Unsolved problems
What is OpenOffice?
● Open Source office suite● Based on StarOffice, currently owned by Sun
Microsystems● Cross-Platform● XML based, standards driven● Semantic-based format
What is TeX?
● Written by Donald E. Knuth● Solution to declining standards
in mathematical typography● Heavily used in mathematics
and physics● Both a program and a
programming language● Presentation-based format
Why Bother to Convert?
● TeX rare outside mathematical circles● Conflicts with publishing software● Does not fit within current word processing
model● TeX's purpose to is to produce journal-quality
typography, not facilitate editing of content.
Aside: Editable Output
● TeX has many presentation outputs:– DVI– PostScript– PDF– PNG– TIFF– Fax
● TeX has no direct editable outputs.
Solution: TeX->OpenOffice
● Why use the outputs? Read the original document.
● Perfect knowledge of content and (presentational) intent
● Write a program that reads TeX and outputs OpenOffice, instead of DVI
Problems with TeX->OpenOffice
● TeX is a large system– Eight years development– Too large for a semester
● Irregular● Non-Balanced● Many special cases
TeX is Irregular
● An irregular language is one in which typical rules of processing are violated
● Irregular '\atop': (TeX)– {numerator \atop denominator}
● Regular '\frac': (LaTeX)– \frac{numerator}{denominator}
TeX is not balanced
● A language that is balanced will have an explicit beginning and end to each grouping
● Non-balanced font commands: (TeX)– \bf this is bold \rm this is normal, roman text
● Balanced font commands: (LaTeX)– \textbf{this is bold} this is back to normal
TeX has many special cases
● \par may either:– explicitly end a paragraph– do nothing (if in math mode)– do nothing (if in restricted horizontal mode)– tell TeX to build the current page
● \par is also irregular (acts on material already processed and in the reverse direction) and unbalanced (may or may not be proceeded by \indent, a primitive to start a paragraph)
Solution: TeX->DVI->OpenOffice
● Let TeX deal with TeX● Run TeX on the original text● Read the resultant DVI output● Process the DVI output to OpenOffice
Problem: Lack of semantic data
● DVI contains font definitions, text stream, and description of black boxes
● Fonts contain characters, but do not say what those characters are– Especially a problem with kerning “ff” vs. “ff”– Also a problem with bold and italics text --- bold and
italics are their own fonts
Solution: Add Annotations
● Use interpositioning and the TeX primitive '\special' to send extra information to DVI file
● \special leaves comments that can be read later● Reading the DVI with proper annotation allows
the text to retain some level of semantic information
● Difference between knowing that the next character is smaller and raised versus knowing that the next character is a superscript
Problem: Unbalanced Tags
● Some primitives are balanced, but many are not● Tags may affect the document for an arbitrary
length of time or are local to a paragraph or specific block of text
Solution: Balancing
● Algorithm:– Given: database of tags
● start tag, end tag, 'insert end tag' tags
– Go through list of tags, find one that needs help balancing
– Go forward along list, finding nearest tag that closes the previous tag, or end of document
– Insert end of tag into the list of tags
Post Document Editing
● Further balancing and insertion of tags may be necessary after first sweep through file
● Tables:– OpenOffice format requires number of columns to be
specified– We don't know how many columns will be needed
until after we read the entire table– Solution: After processing, go back and insert the
needed information
Unsolved Problems
● Footnotes:– Defined by position in the page– Automatic positioning conflicts with paragraph
detection tool– Unable to discern between footnotes, extra paragraph,
header, or footer
● Non-English alphabets
Conclusion
● Semantics of document are lost in TeX itself, so no hope of recovery
● Overt presentation can be recovered for editing● Method works to translate an irregular, non-well
formed language into a regular, well-formed language (XML)