office openxml: a technical approach for ooo
TRANSCRIPT
Office Open XML: a technical approach for OOo
OOoCon 2007, Barcelona, September 21st, 2007
Hubert FIguiere
Software Engineer, OpenOffice.org
Novell - [email protected]
Getting Started
What is Office Open XML?
An office application file format
XML based
Created by Microsoft...
...for Microsoft Office 2007
ECMA standard 376
Proposed to ISO
What Office Open XML is not?
Office Open XML is not OpenDocument (ISO 26300)
... nor the previous XML formats for Microsoft Office introduced in the last few MS-Office release
... nor an ISO standard though it has been proposed
Why supporting Open XML?
Support = importing from (and/or exporting to)
For interoperability reasons with Microsoft Office 2007
Overview of the format
The specification
Available to anybody as ECMA standard 376
5 PDF documents
Fundamentals
Open Packaging Conventions
Primer
Markup Language Reference
Markup Compatibility and Extensibility
173 + 129 + 472 + 5129 + 43 = 5946 pages
The specification (cont.)
Some have printed it.
OpenXML printed specphoto by Pavel Janikphoto by Pavel Janikhttp://blog.janik.cz/archives/2007/05/19/T20_32_07/
Packaging Conventions
A zip file: Open Package
Contain the main content...
... and the embedded content
Same container used for other Microsoft format like XPS
Replace the old OLE structured storage
In principle similar to OpenDocument, but not really.
Content
DrawingML
Diagrams, Charts, etc.
WordprocessingML
Word document
SpreadsheetML
Excel document
PresentationML
PowerPoint presentation
Heavily relies on DrawingML
Content (cont.)
Relationships
Maps embedded objects
Set the relationships between fragments
Content (cont.)
VML
Legacy format from Office 2000
Embedded objects
Sound files
Images
Can be anything !
I have seen some PowerPoint document with an OpenDocument chart in an OLE container that was referenced from a slide
OpenOffice implementation
Plans
Implement a native filter for Office Open XML
Import (in progress)
Export (Novell is committed to do it)
Split in 2 modules
Target is tentatively 2.4
Novell's ooo-build 2.3 has it:
Ship with openSUSE 10.3
Will ship with other Linux distros
Joint effort between
Sun and Novell
[...] a team of 5 developers will implement 25 handlers a week, which means that we'd have all the XML handlers written in 44 weeks.
[...] Nevertheless, weve taken a little less than a year to get the converters reading the new file format.
[...] This is just for Word.
-- Rick Schaut, Mac Office team, about implementing the Office 2007 importer for Word for Mac, December 2006.
http://blogs.msdn.com/rick_schaut/archive/2006/12/07/open-xml-converters-for-mac-office.aspx
Microsoft released the beta version of the Word 2007 to RTF converter for MacOS in May 2007...
...and PowerPoint support was released July 31st 2007
Modules
Writerfilter
Word import
Refactoring of the RTF and binary doc filter
See Fridrich Strba presentation for all the details
OOX
Excel and PowerPoint, but not Word
CWS xmlfilter02
implements VML as well
called by the writerfilter if needed.
No XSLT
OOX is not an XSLT based filter.
Process XML to input into OpenOffice.org internal model
Written in C++
The fast SAX parser
5568 tokens are listed in our code
String comparisons for tokens are slow
The fast SAX parser is designed to
reduce the number of string comparisons by using a 32-bits hash for string tokens (including the xml namespace)
offer that API through UNO
It lives in the sax module
Off course it is generic and could be used anywhere
Fast parser details
Hash tokens are generated by gperf at compile time
From a compile time generated list (OOX)
Each know string token is referenced by a const like XML_token
XML namespace in the high order bits of token
Allow selecting the namespace with a simple bit-mask
Example
switch( aToken )
{
case NMSP_DRAWINGML|XML_lnSpc:
break;
case NMSP_DRAWINGML|XML_spcBef:
break;
case NMSP_DRAWINGML|XML_spcAft:
break;
default:
}
API
The OOX module only depend on UNO API
Can't always get inspiration from the binary filters that mostly use the internal APIs
Some UNO API are incomplete or missing
They need to be implemented
The data model
The Office Open XML data model is somewhat very close to the one from the binary format
[...] XLSX may be ugly, but its concepts were very familiar from XLS. We already had much of the code required to handle it.
-- Jody Goldberg about Gnumeric Excel 2007 support,
http://blogs.gnome.org/jody/2007/09/10/odf-vs-oox-asking-the-wrong-questions/
Excel vs Calc
Excel 2007 has more feature difference than Calc
Dealing with missing features in Calc:
Find a workaround
Downgrade the data
Problem with round-trip conversions
Implement the missing feature
Excel 2007 vs Excel 2003
No notable new feature into the core
Overall structures are very similar
shared string table that contains cell string
Sheet protection options data contain the identical set of options.
Autofilter uses internal cell range names (not visible to the user) that are identical both in xlsx and xls.
Excel 2007 vs Excel 2003 (cont.)
Overall structures are very similar (cont.)
In both xls and xlsx formats, pivot table record contains a cached source data.
Excel allows rich text and field objects in the header and footer, and they are encoded. In both xls and xlsx, the same encoding scheme is used.
PowerPoint vs Impress
Pixel perfect rendering
People spend hours in airport to refine their PowerPoint...
...so the import has to be perfect
SmartArt
This is a big feature in PowerPoint 2007
Animation / transition
Both based on SMIL
PowerPoint 2007 vs PowerPoint 2003
Not much changes
SmartArt
Saving in PowerPoint 2007 as binary PPT makes it an embedded OLE
Off course this require having the engine
DrawingML
A shared ML
Used directly by PresentationML
Encountered in WordprocessingML and SpreadsheetML documents.
Defines styles, shapes, text, charts, diagrams, audio/video, etc
Supposed to be more functional than VML, therefore to replace it.
VML
Legacy Microsoft XML format
Still generated by 2007 version if MS applications
Replace the binary EMF for OLE
Used by annotations in Excel
and a lot of drawing features in Word
supposed to be superseded by DrawingML
Alternative Implementations
odf-converter (Free Software)
Microsoft sponsored ODF to Office OpenXML converter
XSLT based
Written in C# / .Net
Also runs with Mono (Free Software platform)
Free Software (MIT style license)
Currently shipped by Novell for SUSE and Windows
GNOME (Free Software)
libgsf
Implement OpenPackage reading and writing
Gnumeric
Import .xlsx files
Export .xlsx files (somewhat)
AbiWord
Import .docx
Both run on non-GNOME platforms like Windows
The initial importer was written on the flight to London for the ECMA meeting, and export was added on the flight back. Toss in a few hours of debugging and the sample file [...] was under a week of effort to read and write.
-- Jody Goldberg about Gnumeric Excel 2007 support,
http://blogs.gnome.org/jody/2007/09/10/odf-vs-oox-asking-the-wrong-questions/
Apple iWork '08 (non-Free)
Pages
Import and export .docx
Numbers
Import and export .xlsx
Keynote
Import and export .pptx
Questions?
Unpublished Work of Novell, Inc. All Rights Reserved.
This work is an unpublished work and contains confidential, proprietary, and trade secret information of Novell, Inc. Access to this work is restricted to Novell employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of Novell, Inc. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.
General Disclaimer
This document is not to be construed as a promise by any
participating company to develop, deliver, or market a product. It
is not a commitment to deliver any material, code, or
functionality, and should not be relied upon in making purchasing
decisions. Novell, Inc. makes no representations or warranties with
respect to the contents
of this document, and specifically disclaims any express or implied
warranties of merchantability or fitness for any particular
purpose. The development, release, and timing of features or
functionality described for Novell products remains at the sole
discretion of Novell. Further, Novell, Inc. reserves the right to
revise this document and to make changes to its content, at any
time, without obligation to notify any person or entity of such
revisions or changes. All Novell marks referenced in this
presentation are trademarks or registered trademarks of Novell,
Inc. in the United States and other countries. All third-party
trademarks are the property of their respective owners.
Click to enter the title (44pt)
Second line or subtitle (22pt)
Presenter Name (16pt)
Presenter Title (14pt)
Company/email (14pt)
Click to Edit Section Break Text (32pt)
Right Justified
piece in master that I can't get rid of
Click to edit the outline text format
Second Outline Level
Third Outline Level
Fourth Outline Level
Fifth Outline Level
Sixth Outline Level
Seventh Outline Level
Eighth Outline Level
Ninth Outline Level
Click to edit the title text format
Click to edit the title text format (32pt)
Click to edit the outline text format (24pt)
Second Outline Level (20pt)
Third Outline Level (16 pt)
Fourth Outline Level (14pt)
Fifth Outline Level (12pt)
Novell Inc. All rights reserved
Click to edit the title text format
Click to edit the outline text format
Second Outline Level
Third Outline Level
Fourth Outline Level