Transcript
Page 1: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

Open XML Formats Training 2/20/2007 2-1

Architecture Overview 2 This module provides an introduction to the Ecma Office Open XML Formats. It includes an overview of

the history, features, and benefits of Open XML Formats and discusses how they have paved the way for

more sophisticated and powerful use of Microsoft Office products.

Goal & Objectives

The goal of this module is to provide a foundation for understanding Open XML File Formats.

After completing this module, participants will be able to:.

Identify the components that comprise Open XML Formats documents.

Discuss OPC and their relationship to the Open XML Formats.

Discuss how the Open XML Formats architecture enables advanced scenarios for

document management and integration.

Key Concepts

Open Packaging Conventions

OPC as defined by Ecma outline the generic conventions used in packaging the Open XML Formats

Package

The package is a ZIP container (.zip) that holds the components (parts) that comprise the document,

as defined by the Open Packaging Conventions specification.

Parts

A part corresponds to one .xml file in a package. A package may contain several parts, for example,

workbook.xml and several sheet<n>.xml files.

Relationships

The method used to specify how the collection of parts come together to form a document.

Relationships specify the connection between a source part and a target resource and are stored in

standard XML files and directories in the document package.

Page 2: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-2

XML Schema Definition Tool (XSD)

The XML Schema Definition tool generates XML schema or common language runtime classes from

XDR, XML, and XSD files, or from classes in a runtime assembly.

XML Paper Specification (XPS)

The XML Paper Specification (XPS) describes electronic paper in a way that can be read by hardware,

read by software, and read by humans.

XML Path Language (Xpath)

XML Path Language (XPath) is a query language that is used to search for and retrieve information

contained within the nodes of an XML document.

1. Open XML Formats Experiences for End Users and Administrators

The End-User Experience with the Open XML Formats

Users will continue to interact with the Open XML Format documents exactly as they do today and

will generally not be aware of any changes beyond the use of new file format extensions.

Organizations also have the capability to override defaults or to specify a compatibility mode

appropriate to their environment to ensure document exchange is a seamless experience for users

in their organizations.

The Open XML Formats are backward compatible to Office 2000 so everyone can benefit from this

innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users

to open, edit and save existing documents to the new formats.

For Administrators: New File Extensions & Macro Management

Familiar fixtures to Microsoft Office, the .doc, .xls, and .ppt, file extensions have been used since the

release of Office 97. When the 2007 versions of Word, Excel or PowerPoint are installed, by default

new documents will be saved using the new Open XML Formats.

Along with the new file formats, the 2007 Office system introduces new file name extensions. The

intent of the new extensions is to make it easy to differentiate the older binary file formats from the

new XML-based file formats, to identify files with embedded code, thus making for easy

interrogation or conversion, and to avoid confusion and save the added step of looking within the

file itself to determine compatibility.

The new default extensions borrow from the existing binary file extensions by appending the letter

“x” letter to the end of the suffix. Other Office document format types that leverage the new file

format, including templates, add-ins, and PowerPoint shows, have also been given new extensions.

Also introduced in the 2007 Office release are new extensions for files macro-enabled versus those

that are macro-free. Macro-enabled documents include a file name extension that ends with the

letter "m" instead of an "x." For example, a macro-enabled Word 2007 document uses the .docm

extension. This allows any user or software application, before a document opens, to identify that it

Page 3: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-3

contains potentially dangerous code. Files containing VBA, macros, or other executable code are

given the macro-enabled extension.

The new file extensions for the 2007 Office system document types include:

Default macro-free files end with ‘x’.

Macro-enabled files end with ‘m’.

Excel Binary Workbooks end with ‘b’.

The table below shows the 2007 file extension changes.

Table 1: New 2007 Microsoft Office system File Extensions

Word 2007 File Formats

.docx Word Document (.docx)

.docm Word Macro-enabled Document (.docm)

.dotx Word Template (.dotx)

.dotm Word Macro-enabled Document Template (.dotm)

Excel 2007 File Formats

.xlsx Excel Workbook (.xlsx)

.xlsm Excel Macro-enabled Workbook (.xlsm)

.xltx Excel Template (.xltx)

.xltm Excel Macro-enabled Workbook Template (.xltm)

.xlsb Excel Binary Workbook (.xlsb)

.xlam Excel Add-in (.xlam)

PowerPoint 2007 File Formats

.pptx PowerPoint Presentation (.pptx)

.pptm PowerPoint Macro-enabled Presentation (.pptm)

.ppsx PowerPoint Slide Show (.ppsx)

.ppsm PowerPoint Macro-enabled Slide Show (.ppsm)

.potx PowerPoint Template (.potx)

.potm PowerPoint Macro-enabled Presentation Template (.potm)

.ppam PowerPoint Add-in (.ppam)

Page 4: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-4

Macro-Enabled Files vs. Macro-free Files

Another change in the 2007 Office system extensions is support for macro-enabled versus macro-

free files. Macro-enabled formats separate documents that are allowed to execute embedded

macros / VBA projects. The file extension for Macro-enabled documents ends with the letter “m”

instead of an “x.” For example, a macro-enabled Word document will have a .docm extension, and

thereby allow any user or software program, before a document opens, to immediately identify if it

might contain macros or VBA projects.

Macro-Enabled vs. Macro-free files can be summed up as follows:

Macro-free files to ensure confidence that macros or VBA projects will not execute.

Separate macro-enabled file type for files containing executable macros or VBA projects.

Macro-enabled files include any file containing any VBA, Excel Macro-Sheets, or PowerPoint Action

Commands.

By default, the 2007 Office system documents saved in Office XML Formats are considered to be

macro-free files and therefore cannot contain macros or VBA projects. This behavior helps protect

organizations from unwanted macros and VBA projects. While documents can still contain and use

macros in the 2007 Office system, the user or developer will be required to specifically save these

documents as a macro-enabled document type. In fact, trying to save a file containing macros with a

macro-free extension will not succeed. This safeguard will not affect a developer’s ability to build

solutions, but will allow organizations to use documents with more confidence.

Macro-enabled files have the exact same file format as macro-free files, but contain additional parts

that macro-free files do not. The additional parts depend on the type of automation found in the

document. A macro-enabled file that uses VBA will contain a binary part that stores the VBA project.

Any Excel workbook that utilizes Excel 4.0-style macros (XLM macros) or any PowerPoint

presentation that contains action buttons are also saved as macro-enabled files. If a code-specific

part is found in a macro-free file, whether placed there accidentally or maliciously, the Office

applications will not allow the code to execute—without exception.

Developers can now determine if any code exists within a 2007 Office document file before opening

it. Previously this “advance notice” wasn’t something that could be easily accomplished outside of

Office. A developer can inspect the package file for the existence of any code-based parts and

relationships without running Office and potentially risky code. If a file looks suspicious, a developer

can remove any parts capable of executing code from the file, so the code can cause no harm.

Page 5: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

Open XML Formats Training 2/20/2007 2-5

2. Ecma Open Office XML Formats File Structure

Ecma Office Open XML Formats

The Ecma Office Open XML Formats are standard file formats that describe a family of XML

schemas, collectively called Office Open XML, which define the XML vocabularies for word-

processing, spreadsheet, and presentation documents, as well as the packaging of documents that

conform to these schemas. The goal is to enable the implementation of the Office Open XML

formats by the widest set of tools and platforms, fostering interoperability across office productivity

applications and line-of-business systems, as well as to support and strengthen document archival

and preservation.

Ecma International

Ecma International is a standards organization that is the approving authority for Ecma Office Open

XML Formats standards.

Find it here: http://www.ecma-international.org

Open XML Formats Specifications

The Open XML Formats specifications are modular in nature and are designed to provide the

appropriate level of depth to match the readers’ level of interest. The final draft of the Office Open

XML v1.0 format is available in five separate parts (in PDF format) as well as six accompanying

electronic annexes.

The five parts are:

Part 1 - Fundamentals

Part 2 - Open Packaging Conventions

Part 3 - Primer

Part 4 - Markup Language Reference

Part 5 - Markup Compatibility and Extensibility

The Open XML Formats specifications are located on the Ecma website:

http://www.ecma-international.org/news/TC45_current_work/TC45-2006-50_final_draft.htm

These specifications are freely available to those who wish to implement the Open XML Formats.

The specifications are protected under the Microsoft Open Specification Promise. The OPS

guarantees that Microsoft will not enforce any patent claims that are necessary to implement the

functionality enclosed within the Open XML Formats specifications. This ensures that developers

who choose to implement the full Open XML Formats specification can do so without fear of

intervention from Microsoft.

The CNS is located here: http://www.microsoft.com/interop/osp/default.mspx

Page 6: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-6

Alternative Representations of Final Draft

Because the Open XML Formats specifications are published by Ecma international, they are no

longer under the direct control of Microsoft. Ecma international will manage future updates to the

Open XML Formats, and ensures external organizations will have consistency in file format usage

because they can adhere to a known standard published by a 3rd party standards body. Let’s look at

some of the details.

File Structure

Open XML Formats use generic Open Packaging Conventions principles (described in Section 3), and

offer a specific method for implementing OPC. Open XML Formats utilize XML reference schemas

and a ZIP container called a package. The schema and all its parts are housed in the package. The

combination of XML and ZIP allows for a robust and modular format that enables a large number of

new scenarios.

Each file is composed of a collection of any number of parts. This collection of parts an their

relationship parts are what defines the document. Document parts are held together by the

container (package) using the industry standard ZIP format. Most parts are simply XML files that

describe application data, metadata, and customer data stored inside the container file.

The file structure is summed up as follows:

ZIP Package (container) with compression.

Document parts that define the document.

Non-XML document parts, such as binary images or OLE objects.

Relationship parts that define the file structure.

Subdirectories that help structure the document files.

The XML and ZIP combination results in robust, tightly integrated but modular and highly flexible

XML file formats for Office documents that enables a large number of new scenarios.

Figure 1: Open XML Formats File Container

Page 7: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-7

Let’s explore each component of the Office XML Formats in greater detail and discuss how the

three participating Office programs utilize the format.

ZIP Package

Many elements go into creating a Microsoft Office document. Some of these are commonly shared

across all Office applications, for example, document properties, style sheets, charts, hyperlinks,

diagrams, and drawings. Other elements are specific to each application, like worksheets in Excel,

slides in PowerPoint, or footnotes and references in Word.

When users save a document with either current or previous versions of Microsoft Office, a single

file is written to disk, which, subsequently, can be easily opened. The ‘single file’ metaphor is

important to how documents are stored, managed and shared in practice. By wrapping all of the

individual parts in a ZIP container, documents still remain a single file instance. The use of a single

package file to represent the entity of a single document means users enjoy the same experience as

today when saving and opening documents with the 2007 Office system and even though the

program sees many individual parts, users continue to work with just a single file.

The ZIP package file container is summed up as follows:

Industry standard format.

Compression reduces storage requirements.

Users continue to experience a ‘single file’.

Developers can process file with standard tools.

With previous Office versions, the formats were structured to mirror the in-memory structures of

the applications and to run on low memory machines with slow hard drives. Developers looking to

manipulate the content of an Office document had to know how to read and write data according to

the structured storage defined within the binary file. This process is known to be complex and

challenging, notably because the Office binary file formats were designed to be primarily accessed

through the Office programs that did not support open standards. A huge amount of time was spent

decoding the binary structures. Because of this, altering Office binary files programmatically without

the Office applications has also been identified as a leading cause of file corruption, and has

deterred some developers from even attempting to make alterations to the files.

ZIP was chosen because it is a well-understood industry standard. There are many tools available

today to work with the ZIP format, and using ZIP provides a flexible, modular structure that allows

for an expansion of functionality, going forward. Therefore, developers will have access to the

complete contents of the 2007 Office system documents by using any of the numerous tools and

technologies that work with industry-standard ZIP files.

Once a container package file has been opened, developers can manipulate any of the document

parts found within the package that define the document. For instance, a developer can open a

Word document that uses Office XML formats, locate the XML part that represents the body of the

Page 8: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-8

Word document, alter the part by using any technology capable of editing XML, and return the XML

part to the container package to create an updated Office document. This scenario is only one of the

countless others that will be possible as a result of new format.

Note: To understand the composition of an Office XML Formats file, you might want to extract a file

and view the contents. To open the file, use a ZIP application, such as WinZip from WinZip Computing

Corporation, installed on your computer. Windows XP includes a “compressed folder” feature that

supports the ZIP format.

Parts

Within an Open XML Formats package, many logical pieces of the file are stored as individual files,

called parts. A part corresponds to one file in the package. Document parts are stored in the

container file or package using the industry-standard ZIP format. Most parts are XML files that

describe application data, metadata, and even customer data, stored inside the container file. This

modularity is one of the key characteristics of the file format. Modularity enables a developer to

quickly locate a specific part and work directly with just that part. Parts can be edited, exchanged, or

even removed depending on the desired outcome of a specific business need.

All the Office programs share some of the same types of parts, such as the document properties,

thumbnail, metadata, and relationship parts. Some parts are unique to the application document

type they represent. For example, Word 2007 creates document-related parts, Excel creates

spreadsheet-related parts, and PowerPoint 2007 creates parts related to slide presentations. A

worksheet part will only be found in an Excel document, while a slide master part will only appear in

a PowerPoint document. The types of parts and the number of parts that a package contains will

depend on the application that creates the ZIP container file and the contents of the document.

Parts can be summed up as follows:

Modular pieces that make up an Office file.

Each part is a ‘file’ itself.

Primarily XML (exceptions are files preserved in native format).

Native formats used for developer convenience (binary images, OLE objects, VBA code, etc).

XML Parts

Parts can be different physical content types. Parts used to describe data used by Office applications

are stored as XML. These parts conform to the XML reference schema that defines the associated

Office feature or object.

For example, in an Excel file the data that represents a worksheet is found in an XML part that

adheres to the Office schema for an Excel Worksheet. Specifically, worksheets are stored in the

worksheets directory and are named sheet1.xml, sheet2.xml and so on. Then, by using standard XML

technologies, developers can apply their knowledge of the Office schemas to easily parse and create

the 2007 Office system documents.

Page 9: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-9

Non-XML Parts

In some cases, it is advantageous to have parts stored in their native content type. These parts are

not stored as XML. Non-XML parts supported as native files may also be included in the package,

such as binary files representing images, OLE objects embedded in the document, or Visual Basic

scripts. These non-XML parts are stored in the /media or /embeddings sub-directory.

Images in an Office document are stored as binary files (.png, .jpg, and so on.) within the document

package. Therefore, you can open the package container by using a ZIP utility and immediately view,

edit, or replace the image in its native format. Not only is this storage approach more accessible, but

it requires less internal processing and disk space than storing an image as encoded XML.

Other notable parts stored as binary parts are VBA projects and embedded OLE objects. Embedded

OLE objects are binary only if the associated OLE server provides only a binary representation.

Embedded documents embed their contents as another XML package.

Part Names

An Office Open XML part name contains only ASCII characters, in escaped or non-escaped form.

The following ASCII characters are permitted in non-escaped form:

"!", "$", "%", "&", "'", "(", ")", "*", "+", ",", "-", "."

the decimal digits "0"–"9", ":", ";", "=", "@",

the Latin alphabetic characters "A"–"Z" and "a"–"z", "_", and "~".

All other ASCII characters are permitted only when escaped as an encoded triplet of the form "%HH",

where H is a hexadecimal digit.

Relationship Parts

Parts can be related to other parts. Relationships are the glue that holds parts together.

Relationships provide the structure for an Office file.

While the parts make up the content of the file, the relationships describe how the pieces of content

work together.

Whereas parts are the individual elements that make up an Office document, relationships specify

how the collection of parts come together to form the actual document. Relationships are defined

by using XML, which specifies the connection between a source part and a target resource. For

example, the connection between a slide and an image that appears in that slide is identified by a

relationship. Relationships themselves are stored within XML parts or “relationship parts” in the

document container. If a source part has multiple relationships, all subsequent relationships are

listed in same XML relationship part.

Page 10: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-10

Relationships can be summed up as follows:

A part that provides the connection between two other parts.

Connections described using XML.

Defines the file format structure with easy navigation.

Can reference external, linked resources.

Relationships play a key role in Office XML Formats, and every part that appears in the document is

referenced by at least one relationship.

The implementation of relationships makes it possible for parts to never directly reference other

parts, and connections between parts are directly discoverable without having to look within the

content of parts. Within parts, all references to relationships are represented using a Relationship

ID, which allows all connections between parts to stay independent of content-specific schema.

Note: An easy way to edit an XML file is to just edit the relationships, however this process may leave

dangling XML code that could be retrieved. But there can (and probably will) be XML code that

doesn’t produce anything visible on the screen because its relationship has been modified or deleted.

Figure 2: High-level Relationship Diagram

Document Properties

Application Properties

Custom Doc. Props.

Workbook

Sheet 2

Sheet 3

Sheet 1 Styles

Chart

Strings

Relationship

...

...

Page 11: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-11

The following example shows a relationship part in an Excel 2007 workbook containing two

worksheets:

It is important to note that relationships represent not only internal document references but also

external resources. For example, linked pictures or objects in a document are represented using

relationships as well. The files in native format are kept in separate directories called \media and

\embeddings. This makes links in a document to external sources easy to locate, inspect and alter. It

offers developers the opportunity to repair broken external links, validate unfamiliar sources or

remove potentially harmful links.

The use of relationships in Office XML Formats benefits developers in a number of ways.

Relationships simplify the process of locating content within a document because you do not need

to parse document-specific XML to find parts — neither do you need to do so to find internal and

external document resources. Relationships allow you to quickly take inventory of all the content

within a document.

For example, if you need to count the number of worksheets in an Excel workbook, you can inspect

the relationships for how many sheet parts exist. You can also use relationships to examine the type

of content in a document. This is helpful in instances where you need to identify if a document

contains a particular type of content that may be harmful, such as an OLE object that is suspect, or

helpful, as in a scenario where you want to extract all JPEG images from a document for reuse

elsewhere.

Relationships also allow developers to manipulate documents without having to learn application

specific syntax or content markup. For example, without any knowledge of how to program

PowerPoint, a developer solution could easily remove extraneous slides for a presentation by

editing the document’s relationships.

The figure below shows the file structure of the ZIP container in a word processing document with

the files in the word folder:

Page 12: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-12

Figure 3: Word File Structure & Files

The ‘document.xml’ file contains most of the text. Other files have names that describe the

contents. The media sub-directory contains binary files as well as other files that are not part of the

Open XML specification. Some non-XML content may be stored in another directory called

embeddings. The _rels sub-directory contains the XML files that define the relationships. For

example, the beginning of document.xml (the body of a WordML file) might begin like this:

Figure 4: document.xml

Relationships are defined in files contained in the _rels directory. The beginning of the

document.xml file that defines relationships looks like this:

Page 13: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-13

Figure 5: Relationships

The same ZIP item can be the target of multiple relationships. It’s worthy to note that having

multiple paths to a target can make access to that target more convenient. Certain relationships are

explicit, meaning the resource is referenced from a source part’s XML using the ID attribute of a

Relationship tag.

For example, a document part can have a relationship to a hyperlink only if that hyperlink's

Relationship element’s Id attribute value is referenced explicitly by the document part’s XML.

Because this mechanism is used generically across multiple tag types, explicit relationships can be

extracted from an Office Open XML document without prior knowledge of tag semantics. All other

relationships are implicit. The syntax for specifying an implicit relationship varies among tag types.

As another example, consider a WordProcessingML document that contains the following footnote

sentence fragment, "… produced by Ecma (http://www.ecma-international.org).", which contains a

footnote and a hyperlink to a web site. The relationship from a source to a footnote is implicit while

that to a hyperlink is explicit.

The Main Document part’s relationship file contains the following:

<Relationships …>

<Relationship Id="rId5" Type="…/footnotes"

Target="footnotes.xml"/>

<Relationship Id="rId7" Type="…/hyperlink"

Target=http://www.ecma-international.org/ TargetMode="External"/>

</Relationships>

All footnotes for a WordProcessingML document are contained in the same Footnotes part. The

following shows how the Main Document refers to the footnote. At the point at which the footnote

reference is inserted, the following XML is present:

<w:r>

Page 14: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-14

<w:footnoteReference w:id="2"/>

</w:r>

The w:id=“2” refers to the footnote with id=2 in the Footnotes part. The relevant is:

<w:footnote w:id="2">

Ecma is an international standards development organization (SDO).

</w:footnote>

In the case of the hyperlink, the main document part makes an explicit reference to this relationship

when it refers to the hyperlink, by using the following:

<w:hyperlink r:id="rId7" w:history="1">

</w:hyperlink>

The following is an example of a relationship schema:

Page 15: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-15

Figure 6: Relationship Schema

Page 16: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

Open XML Formats Training 2/20/2007 2-16

3. Open XML Formats for Microsoft Office Word

WordProcessingML

Written almost entirely in XML, Word 2007 dramatically advances XML capabilities. For the end

user, 2007 Office improves its Word application with exciting innovations such as the new Ribbon

UI, and bibliography, citation, and equation features. The equation feature allows for professional-

looking formatting of complicated mathematical equations.

In addition to these new features, developers and template designers can achieve greater flexibility,

modularity and power into their solutions with the addition of XML mapping, content controls, and,

of coarse, the new WordProcessingML XML Format.

This format, like all the other Office Open XML Formats, utilizes a compressed ZIP package that

holds all of the Part items, Content Type items and Relationship items in the document. The

decreased file size is more robust and the application can better deal with errors in transmission and

handling, and makes it easier to manipulate file content using industry-standard ZIP-based tools. If a

portion of the file becomes corrupt, the compartmentalization of the different document parts

allows the file to open, even if one part is damaged.

Design Goals for WordProcessingML

Microsoft Office 2003 introduced the first WordML schema. The WordML schema was a huge step

forward because it was the first full-fidelity XML file format provided by Microsoft Office. With

Microsoft Office 2003 you can parse, manipulate, update, or add data to WordML files. However, a

few limitations exist. For example, you must encode binary data (such as images) as text within the

XML file itself, which increases file size when working with a document containing many images.

Additionally, Word 2003 embeds all custom XML data directly into the WordML that describes the

document, which can make custom XML difficult to access and manipulate from external processes.

The XML file format in Word 2007 solves these issues by dividing the file into document parts, each

of which defines individual pieces of the overall contents of the file. To change something in the file,

simply find the document part, such as the header, and edit it without having to worry about

accidentally modifying other document parts. Similarly, working with custom XML is easier because

it is also contained in its own part so documents can be generated programmatically with less code.

With the XML file format, developers have unprecedented access to Word files. Template designers

can create robust and rich templates in much easier and intuitive manner than previous versions of

Word. The possibilities and ease with which developers can program using the Word XML Format

are impressive and mark a significant advancement in Microsoft Office.

For example:

Page 17: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-17

In Word 2003, protecting only portions of a template required detailed work and the process was not

intuitive. In Word 2007, content controls simplify this process.

In Word 2003, including custom XML required the use of XML elements on the document's surface.

This made the document fragile because an end user unfamiliar with XML could accidentally break

the document by deleting an element. Custom XML parts and XML mapping prevent this scenario. In

Word 2007, documents can use events to update their content intelligently from an XML source with

no interaction on the part of the user.

Core Architecture and WordProcessingML Parts

To facilitate construction, assembly, and reuse of Word 2007 documents by third-party processes

and tools, Word divides the contents of the ZIP package into several logical parts that each store a

specific document part. The package parts, when aggregated, compose the document.

These parts can consist of XML files, such as the document parts that contain the markup for the

Word XML Format, as well as non-XML attached contents, such as pictures or OLE–embedded files

in their native format.

A Word 2007 file could contain (but is not limited to) the following folders and files:

docProps folder. Contains the application's properties parts.

_rels folder. Stores the relationship part for any given part and always contains the file

‘document.xml’.

.rels file. Called a relationship part, .rels files describe the relationships that begin the document

structure. Relationship files are regular XML files and do not use any special extension. You can

recognize relationship files because they are always located in a _rels directory.

datastore folder. Contains custom XML data parts within the document. A custom XML data part is

an XML file from which you can bind nodes to content controls in the document.

item1.xml file. Contains some of the data that appears in the document. Example of a custom XML

data part.

App.xml file. Contains application-specific properties.

Core.xml file. Contains common file properties for all files based on the Open Packaging Conventions

document format.

[Content_Types].xml. Describes the content type for each part that appears in the file.

Figure 7: Sample Hierarchical file structure of a typical Word 2007 document

Page 18: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-18

It is important to note that, with a few exceptions defined in the Open Packaging Conventions, the

actual file directory structure is arbitrary.

You can replace entire document parts in order to change the content, properties, or formatting of

Word 2007 documents. You can rearrange and rename the parts of a Word file inside its ZIP

container. However, the relationships of the files within the package, not the file structure itself, are

what determine file validity. The relationships need to be properly updated so the document parts

continue to relate to one another as designed. If the relationships are accurate, the file opens

without error. As long as the relationships are kept current, the file structure can change.

Developers can easily determine the composition of the document. An easy way to look inside the

new file format is to rename the .docx file with a .zip extension. Double-click the renamed file (or

choose the “Open with” feature), to open the file and look at its contents. Inside the file, you can

see the document parts that make up the file, along with the relationships that describe how the

parts interact with one another.

The following figure displays the contents of a sample report document in a ZIP file and shows the

logical parts and document parts contained within that document. Circled is the logical part called

the –rels folder, which contains several document parts, including the document.xml. This

relationship folder, as well as other logical parts, can contain any number of documents parts and is

not limited to the files shown in the example.

Figure 8: Contents of a Sample Project

Page 19: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-19

Word 2007 Content Types

Each document part contains a specific content type. The content type of a part describes the

contents of that file type. In the case of the XML parts that contain the markup defined by the Word

XML Format, the content type can help determine its composition. Content types can be used when

programmatically manipulating the contents of a Word 2007 file.

A typical content type begins with the word application and is followed by the vendor name. In the

content type, the word vendor is abbreviated to vnd. All content types that are specific to Word begin

with application/vnd.ms-word. If a content type is an XML file, then the URI ends with +xml. Non-XML

content types, such as images, do not use this addition.

Some typical content types include:

Endnotes: Content type for a document part that describes endnotes within a Word document. The

+xml indicates that it is an XML file

application/vnd.openxmlformats-

officedocument.wordprocessingml.endnotes+xml

Core document properties: Content type for a part that describes the core document properties. The

+xml indicates that it is an XML file.

application/vnd.openxmlformats-package.core-properties+xml

Image: Content type for an image. The +xml portion is not present, which indicates that this content

type is not an XML file.

image/png

Page 20: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-20

Identifying Non-XML Parts in Word 2007 Documents

All embedded parts in a Word 2007 document are in their native, default Word XML Format.

Therefore, if you add a picture to a document, you can rename the document with a .zip extension,

and open it as you would any ZIP file.

Within the package, you can locate the picture and open it as well. If the picture is in a .png format,

you can see and open a .png file directly from the package. Similarly, if you embed a Microsoft Office

Visio document inside a Word 2007 document, you can locate the file as a .bin file inside the package.

This creates many possibilities for developers with files stored on a server. Consider a company with

hundreds of documents on a server that all contain the same corporate logo image. If the corporate

logo changes, you can implement a simple script to replace the old logo with the new logo for every

document.

The default location for images in a package is the /word/media directory and the default location for

embedded objects in a package is /word/embeddings.

The figure below shows the directory hierarchical file structure of a Word 2007 document that

contains images and embedded objects.

Figure 9: Directory Structure

It is easy to change, add, or delete data in a Word 2007 file both programmatically or manually. The

file is easily accessible using the Microsoft WinFX System.IO.Packaging class. Documents can be

modified on a server with only a few lines of code and readily access and manipulate custom XML

data from its own separate parts. Even events can be used to trigger the change of XML data. For

example, you can map a content control to an XML element containing a stock quote and then

retrieve the most recent quote programmatically each time the document opens, thereby ensuring

that the user always sees the current price.

Page 21: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-21

Content Controls

Content controls are predefined pieces of content that can be positioned anywhere in the document.

There are numerous types of content controls, including text boxes, drop-down menus, calendar

controls, combo boxes, and pictures.

You can map these content controls to an element in an XML file. This ability to map content

eliminates certain vulnerabilities present when working with XML in Word 2003 and results in more

robust documents. Using XML Path Language (XPath), you can programmatically map content in an

XML file to a content control, which enables you to write a simple and short application to

manipulate and modify data in a document.

Word 2003 introduced the ability to attach an XML schema to a document. You could add elements

from an XML file, provided they conformed to the schema. This helped create a robust document

structure that allowed easier access to data. However, the most significant limitation was that the

presentation and custom XML data were linked through the document editing surface. Consequently,

an end user could accidentally delete part of the XML structure used to define the document,

thereby invalidating the document's XML structure according to its schema. Word 2007 addresses

this issue with the addition of content controls.

The new features in Word 2007 are designed to make the application a highly reliable platform for

document-based solutions, including structured document assembly, data capture/extraction, and

document construction. Content controls provide template creators the ability to more easily

structure arbitrary pieces of a Word 2007 document by using semantics, content restrictions, and

behaviors.

Also, in previous versions of Word, it was difficult to lock pieces of content in a document. In Word

2007, content controls simplify this process, allowing you to programmatically or through the UI to

lock content to prevent users from editing or deleting them. This is a great improvement in template

creation..

The following figure shows a plain-text content control.

Figure 10: Plain-text Content Control

Page 22: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-22

XML mapping

The XML mapping feature in Word 2007 enables the ability to create a link between a document and

an XML file. This creates true data/view separation between the document formatting and custom

XML data.

Portions of a document template can be populated with data from an XML file using XML mapping.

Using the object model enables the ability to add structured custom data (stored in any number of

XML files) to the document and map the data to specific content controls. With the advent of the

Word 2007 XML Format, programmatic access to this data has never been easier.

XML mapping allows for many possible scenarios in which data behind the document is automatically

updated using events on Content Control objects. One scenario involves a document with stock data

attached to it. In this scenario, you can programmatically update stock quotes in XML format to

reflect new daily price changes so that the user does not have to do anything.

Events, such as the Open event of the Document object, can be used to trigger the document to

perform an action. In this scenario, when the user opens a document, add-ins can be used to retrieve

updated stock prices and push them into the document's XML data store. XPath can then be used to

map the elements in which these stock prices are stored to content controls in the document.

Imagine as the template author you create a table that is meant to contain stock data. Next, insert

text controls in the cells that display stock quotes, with one quote per cell. Programmatically, each

control is mapped to the appropriate element in the appropriate CustomXMLPart object. However,

think of a CustomXMLPart object as a data store. By default, CustomXMLPart objects are stored in a

directory called datastore.

The following figure shows a content control object that is meant to contain stock data.

Figure 11: Content Control Object

Page 23: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

Open XML Formats Training 2/20/2007 2-23

4. Open XML Formats for Microsoft Office Excel

SpreadsheetML

From an end-user perspective, Office Excel 2007 offers a range of capabilities to analyze,

communicate, share, and manage information. With the new, results-oriented interface, users have

easy access to relevant tools and can discern key trends at a glance by applying conditional

formatting that helps you visualize your business performance in a graphically rich way. With

PivotTable views that are much easier to assemble, Excel 2007 delivers powerful tools to help users

organize and understand business data. Users can summarize their analysis in professional-looking

charts by using intuitive galleries and publish spreadsheets to Office SharePoint Server 2007 to share

and manage sensitive business information with greater confidence and control.

From a developers perspective, Microsoft Office Open XML Formats reside at the heart of the new

Office Excel 2007. A compact and robust file format, Office XML Formats allow for better data

integration between documents and back-end systems..

Design Goals for SpreadsheetML

The binary file formats currently in use were designed in 1994, before XML became widely used and

before the widespread exchange of documents and data that is common today. Many organizations

have stated a clear preference for open, XML-based file formats for the future, while some users

continue to choose binary file formats. Excel 2007 offers a broad choice in file formats, either binary

or XML, to enable people to choose the format best suited for their needs.

Although the majority of users don't really care about what kind of format they are using, some

want our formats to play a more vital role in business processes, which is why Microsoft moved to

the XML formats. The Excel 2007 SpreadsheetML schema underwent serious work to establish it as

the default format. Developers can easily build solutions on top of the formats, but at the same

time, the average end user can take advantage of new features without feeling any noticeable

negative differences with the change.

Microsoft needed to restructure SpreadsheetML from the original design because the two issues

with the former SpreadsheetML was that it wasn't full fidelity, and it wasn't optimized for

performance/file size. The term "Full fidelity" means that everything in the file can be saved into the

format without fear of it being modified or lost. The old SpreadsheetML format didn't support a

number of feature, such as images, charts, and objects, so Microsoft needed to add all those

additional items to the format.

The second part (performance) is an important and challenging issue. The combination of ZIP and

XML enables Excel 2007 to achieve optimized performance and the XML is written in such a way that

it can be parsed extremely efficiently so the file open and save experiences would not get

significantly slower.

If, for any reason, an Excel 2007 workbook loads or performs slowly it could be saved using the new

.xlsb (binary) format for that file, which is optimized for extremely large workbooks. Most people,

Page 24: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-24

however, will not need to use this new binary format, because the overwhelming majority of Excel

workbooks will load in under a couple of seconds.

Core Architecture and SpreadsheetML Parts

A SpreadsheetML directory structure is similar to other structures in that it also contains document

and relationship parts.

Figure 12: Excel XML Directory Structure

Minimum Workbook Scenario

For the sake of simplicity, it is important to minimize the required set of workbook properties that

must be present to compose a valid workbook. The smallest possible (blank) workbook must contain

the following:

A single sheet

A sheet ID

A relationship Id that points to the location of the sheet definition

For example:

<workbook>

<sheets>

<sheet name="Sheet1" sheetId="1" r:id="rId1"/>

</sheets>

</workbook>

Consider the following graphical representation of a workbook:

Figure 13/l Excel 2007 Worksheet

Page 25: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-25

The above example will have the following workbook properties definition:

The elements and attributes used here are discussed in more detail in Part 3 of the Ecma

specifications documents.

Page 26: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-26

Sheets

Sheets are the central structures within a workbook, and are where a user does most of his

spreadsheet work. The most common type of sheet is the worksheet, which is represented as a grid

of cells. Worksheet cells can contain text, numbers, dates, and formulas. Cells can also be formatted.

A workbook usually contains more than one sheet. To aid in the analysis of data and the making of

informed decisions, spreadsheet applications often implement features and objects which help

calculate, sort, filter, organize, and graphically display information. Since these features are often

connected very tightly with the spreadsheet grid, these are also included in the sheet definition on

disk. Other types of sheets include chart sheets and dialog sheets.

The smallest possible (blank) sheet is as follows:

<worksheet>

<sheetData/>

</worksheet>

The empty sheetData collection represents an empty grid; this element is required. As defined in the

schema, some optional sheet property collections can appear before sheetData, and some can

appear after. To simplify the logic required to insert a new sheetData collection into an existing (but

empty) sheet, the sheetData collection is required, even when empty.

Shared String Table

A workbook may contain thousands of cells containing string (non-numeric) data. Furthermore, this

data is very likely to be repeated across many rows or columns. The goal of implementing a single

string table that is shared across the workbook is to improve performance in opening and saving the

file by only reading and writing the repetitive information once.

For example, consider a workbook summarizing information for cities within various countries. There

may be a column for the name of the country, a column for the name of each city in that country,

and a column containing the data for each city. In this case, the country name is repetitive, being

duplicated in many cells. In many cases, the repetition is extensive, and a tremendous savings is

realized by making use of a shared string table when saving the 3 workbook.

Figure 14: Spreadsheet Example

Page 27: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-27

The file architecture would look like this:

Figure 15: Excel File Architecture

There is a single shared strings part for all the strings in a workbook. This part is related to the

workbook. Each cell (in sheet1.xml, for example) containing a string value refers by index to a string

expressed in the shared strings part. The solid arrows represent relationships among the parts and

the dotted arrows represent references by index to a string in the shared strings part.

Page 28: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-28

5. Open XML Formats for Microsoft Office PowerPoint

PresentationML

PresentationML is new to PowerPoint 2007 and is based on the same Open Packaging Conventions

used to structure all Open XML files. As an XML standard for presentation software it offers many

opportunities to users and developers. Developers now have full access to slides and slide notes as

text. Solutions that require searching, indexing and creating presentation content are now possible.

Data-driven presentations can be easily produced using XML. Developers can directly access slide

masters and slide layouts through XML parts to programmatically format existing or new

PowerPoint presentations.

A developer can now take a different approach to assembling or reusing content from PowerPoint

presentations by building an application that uses a catalog of slides stored independently of

existing presentations. Since slides are represented as individual XML parts, a solution could

optimize the way an organization stores and manages PowerPoint slides as data. Developers could

write a slide “viewer” that allows a user to discover and select slides to build a presentation from

outside of PowerPoint. The application could even be Web-based to allow centralized management.

Design Goals for PresentationML

PresentationML has been designed around slide re-use scenarios. It's common practice to reuse

slides from multiple presentations when creating a new presentation. PresentationML makes it easy

to access a "slide library" so you can pull a slide or slides and its resources out of the larger

presentation without losing information.

Core Architecture and PresentationML Parts

The PresentationML file format can be broken down into the following subjects:

Presentation

Slides

Slide Content

Animation

In PresentationML, the file is broken into a collection of parts so the information is in a versatile

format. The centerpiece, or root node, of any presentation is the ‘presentation.xml’ file, which

contains a list of all the slides in the presentation, as well as their sizes and some additional

presentation properties.

The presentation part contains information about the presentation itself and contains the following

structural information:

Page 29: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-29

Slide lists ( e.g., slides, masters, IDs, custom shows, etc. ). While the contents for the various slides are

stored in separate parts, the actual ordering information for the slides is stored in the presentation

part.

Slide sizes (note that this applies to all slides).

In addition to the structural information, the presentation part also contains the following

properties:

Text Properties ( e.g., embedded font list, Kinsoku settings, etc. )

Save Properties ( e.g., flags for embedding fonts, compressing pictures, etc. )

Editor Properties ( e.g., flags for using Right-to-Left mode, etc. )

Content Properties ( e.g., first slide number for footers, etc. )

The diagram below shows the basic parts that make up a PresentationML file, and how they are

related to each other (the "Presentation" part in the middle is the start part):

Figure 14: Presentation Parts

Page 30: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-30

PowerPoint File Structure

The PresentationML file structure is similar to that of Excel and Word in that it contains document

parts and relationship parts contained in a ZIP package. The main difference is parts that are specific

to presentations, such as notes, slides, etc.

notesMasters1.xml

The following shows some of the parts and files that could be found in a PresentationML and an example of a notes master schema.

Figure 16: Presentation Note Master Part and File

Page 31: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-31

slideLayout.xml

The following shows some of the files that could be found in a slide layout in an PresentationML and an example of a slide layout schema.

Figure 17: Presentation Slide Layout Part and Files

Page 32: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

Open XML Formats Training 2/20/2007 2-32

6. Executive Summary

Open XML Formats Specifications

The Open XML Formats specifications are written by the standards organization Ecma International.

The Office Open XML format documentation is available in PDF format in five separate parts. The

five parts are:

Part 1 - Fundamentals

Part 2 - Open Packaging Conventions

Part 3 - Primer

Part 4 - Markup Language Reference

Part 5 - Markup Compatibility and Extensibility

The Open XML Formats specifications are freely available under a royalty-free license located on the

Ecma website:

http://www.ecma-international.org/news/TC45_current_work/TC45-2006-50_final_draft.htm

New XML Formats

The new XML Formats for the 2007 Office system include:

WordProcessingML for Office Word 2007

SpreadsheetML for Office Excel 2007

PresentationML for Office PowerPoint 2007

New File Extensions

The new file extensions for the 2007 Office system document types include:

Default macro-free files end with ‘x’.

Macro-enabled files end with ‘m’.

Excel Binary Workbooks end with ‘b’.

File Structure

The file structure is summed up as follows:

Page 33: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-33

ZIP Package (container) with compression.

Document parts that define the document.

Non-XML document parts, such as binary images or OLE objects.

Relationship parts that define the file structure.

Subdirectories that help structure the document files.

Page 34: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

Open XML Formats Training 2/20/2007 2-34

Appendix A. XML Tags, Name Lengths and Performance

Does tag size matter?1

Microsoft has taken great steps to improve performance when opening and saving Open XML files.

The move to these formats as the 2007 Office system default meant looking seriously about how

formats are constructed so they would open and save efficiently. Looking at large, complex

spreadsheet files with a lot of XML to parse, Microsoft recognized the need to optimize the formats

to make them faster. One of the issues is tag name lengths, which can impact performance, such as

memory issues, parsing times, and compression times. Adopting shorter tag names was one of the

first obvious ways to accomplish this, such as using "<c>" instead of "<table-cell>").

Impact of long tag names vs. short tag names

Over the years we've seen that simply using shorter tag names can significantly improve

performance depending on the type of file. For an application like Excel, there’s a potential for

millions of XML tags to represent a complex spreadsheet. Some of the issues include:

Compression

Since compression (ZIP) technology is used, there isn't much a difference in file size since a long tag

name and short tag name will typically compress to the same size. This means that time spent hitting

the hard drive or transmitting over the wire will be about equal. When you do the actual

compression, however, if the tag names are longer, there are a lot more bits to read through to

execute the compression. These bits may be in memory or on disk, but either way they still need to

be compressed. The same goes for decompression; the system will generate a lot more bits if the tag

names are longer, even if the final compressed bits are significantly smaller.

Parsing

Parsing can be done using a SAX2 parser to parse XML, and a Trie3 lookup, which, even though it is

memory intensive, is super fast. A hash can also be used. When using the hash, the full tag name

needs to be stored for a final comparison, because there is no known bound set of element values

coming in. Not only do the new formats allow for full extensibility, but they also allow for the fact

that people might make a mistake when generating the files. Microsoft needs to be able to catch

those errors. For those familiar with hashing, you know that unless you are guaranteed a perfect

1 http://blogs.msdn.com/brian_jones

2 SAX is the Simple API for XML, see SourceForge at http://www.saxproject.org/

3 prefix tree, an ordered tree data structure used to store an associative array where the keys are strings.

Page 35: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-35

hash, you also need to have a second straight string compare to ensure it was a proper match. So,

both for memory, as well as processing time, tag length has a direct impact. The time taken for a Trie

is directly proportional for the tag length. For a hash, it depends on how the hash and verification is

done.

Streamed Decompression and Parsing

As the XML part is being decompressed, Microsoft streams it to the SAX parser. is connected directly

to the part IStream4 which then decompresses on demand. On the compression side, most parts are

written as a whole, however there are some parts where that isn't the case. Instead we keep a single

deflate stream and flush the compressed data when the "current" part being written changes.

Typically, parsing isn't really something that you would think of as being a major part of the XML file

load times. This is not true with office document formats, and especially spreadsheet documents.

With the latest SpreadsheetML design, we've seen that the XML parsing alone (not including the

parsing numbers, refs, or formulas) can often range from 10-40% of the entire file load. That's just

the time it takes to read each tag and each attribute. This isn’t surprising since the internal memory

structures for a spreadsheet application should be fairly similar to the shapes used in the format

design. A big piece is reading the XML and interpreting the tags.

SpreadsheetML Example

SpreadsheetML was designed so that super short tag names could be used for any tag or attribute

that appears frequently. Elements that may only appear once in a file often have longer tag names,

since their size doesn't have nearly the same impact. Microsoft has established naming conventions

for abbreviations shared across all three formats. Currently, most frequently used tag names are no

more than a couple characters in length. Imagine if longer, more descriptive names are used so each

tag was 5 times larger. Consider the following small, simple table that looks like this:

1 2 3

4 5 6

Short tag example: XML Excel 2007 SpreadsheetML

4 Compression http://foam.sourceforge.net/doc/Doxygen/html/dd/d46/classFoam_1_1Istream-members.html

Page 36: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-36

<sheetData><row r="1" spans="1:3"><c r="A1"><v>1</v></c><c r="B1"><v>2</v></c><c r="C1"><v>3</v></c></row><row r="2" spans="1:3"><c r="A2"><v>4</v></c><c r="B2"><v>5</v></c><c r="C2"><v>6</v></c></row></sheetData>

Long tag example: OD - OpenOffice Calc 2.0.2

<table:table table:name="Sheet1" table:style-name="ta1" table:print="false"><table:table-column table:style-name="co1" table:number-columns-repeated="3" table:default-cell-style-name="Default"/><table:table-row table:style-name="ro1"><table:table-cell office:value-type="float" office:value="1"><text:p>1</text:p></table:table-cell><table:table-cell office:value-type="float" office:value="2"><text:p>2</text:p></table:table-cell><table:table-cell office:value-type="float" office:value="3"><text:p>3</text:p></table:table-cell></table:table-row><table:table-row table:style-name="ro1"><table:table-cell office:value-type="float" office:value="4"><text:p>4</text:p></table:table-cell><table:table-cell office:value-type="float" office:value="5"><text:p>5</text:p></table:table-cell><table:table-cell office:value-type="float" office:value="6"><text:p>6</text:p></table:table-cell></table:table-row></table:table>

Tag length impact

Most developers agree that it's important to keep tag names short, especially in structures that are

highly repetitive. In SpreadsheetML, many element names are actually pretty long and descriptive,

but only if they appear in a few places, and won't be much of a burden. Any element with potential

for a high frequency rate of occurrence is kept to a minimum length.

Imagine the file mentioned earlier had 7 million elements and 10 million attributes. If on average

each attribute and element is about 2 characters long, then there is 34 megabytes of data to parse

(which is a ton), just in tag names and element names. If instead, the average length of an attribute

and element were around 10 characters, the data to parse increases to over 170 megabytes. That is

a very significant difference.

Optimize Based on Your Design Goals

Remember, this format is supposed to be used by everyone, not just developers and most users

would not be happy with feature loss and performance degradation just so they can save their file as

XML, especially since the average user doesn't really care about XML. That's why the Ecma

standardization is so important, to ensure that everything is fully documented to allow developers

to use them effectively and efficiently.

Page 37: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

Open XML Formats Training 2/20/2007 2-37

For more information http://www.microsoft.com/office/preview

Introducing the Microsoft Office (2007) Open XML File Formats

Office Open XML Document Interchange Specification

Open Packaging Conventions

Setting Word Document Properties the Office 2007 Way

http://msdn.microsoft.com/office/tool/xml/2007/

http://openxmldeveloper.org/forums/thread/638.aspx

http://msdn2.microsoft.com/en-us/library/ms771890.aspx

Brian Jones’s Blog:

http://blogs.msdn.com/brian_jones

http://msdn2.microsoft.com/en-

us/library/ms771890.aspx#office2007wordfileformat_understandingthedatastore

Microsoft Windows Software Development Kit (SDK) for Beta 2 of Windows Vista and WinFX Runtime

Components

XML Schemas (XSD) Starter Kit What is XML Schema (XSD)?

Frequently Asked Questions About Schemas

Comparing Schema Languages

Version and Conformance

Referencing XSD Schemas in Documents

XML Schemas (XSD) Reference

XML Schemas (XSD) Concepts

Page 38: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-38

Table of Contents – Module 2

Goal & Objectives ................................................................................................................................... 1

Key Concepts .......................................................................................................................................... 1

Open Packaging Conventions ................................................................................................... 1

Package ........................................................................................................................................ 1

Parts ............................................................................................................................................. 1

Relationships............................................................................................................................... 1

XML Schema Definition Tool (XSD) .......................................................................................... 2

XML Paper Specification (XPS) ................................................................................................. 2

XML Path Language (Xpath)...................................................................................................... 2

1. Open XML Formats Experiences for End Users and Administrators ........................................ 2

The End-User Experience with the Open XML Formats ....................................................... 2

For Administrators: New File Extensions & Macro Management ....................................... 2

Macro-Enabled Files vs. Macro-free Files ............................................................................... 4

2. Ecma Open Office XML Formats File Structure ............................................................................ 5

Ecma Office Open XML Formats .............................................................................................. 5

Ecma International ..................................................................................................................... 5

Open XML Formats Specifications ........................................................................................... 5

Alternative Representations of Final Draft ............................................................................. 6

File Structure .............................................................................................................................. 6

ZIP Package ................................................................................................................................. 7

Parts ............................................................................................................................................. 8

Relationship Parts ...................................................................................................................... 9

3. Open XML Formats for Microsoft Office Word .......................................................................... 16

Page 39: Architecture Overview 2 - Microsoft · innovation. A compatibility pack is available to enable Office 2000, Office XP, and Office 2003 users to open, edit and save existing documents

1. Architecture Overview

Open XML Formats Training 2/20/2007 1-39

WordProcessingML .................................................................................................................. 16

Design Goals for WordProcessingML .................................................................................... 16

Core Architecture and WordProcessingML Parts ................................................................ 17

4. Open XML Formats for Microsoft Office Excel ............................................................................ 23

SpreadsheetML ........................................................................................................................ 23

Design Goals for SpreadsheetML ........................................................................................... 23

Core Architecture and SpreadsheetML Parts ....................................................................... 24

5. Open XML Formats for Microsoft Office PowerPoint ................................................................. 28

PresentationML ........................................................................................................................ 28

Design Goals for PresentationML .......................................................................................... 28

Core Architecture and PresentationML Parts ...................................................................... 28

6. Executive Summary ......................................................................................................................... 32

Open XML Formats Specifications ......................................................................................... 32

File Structure ............................................................................................................................ 32

Appendix A. XML Tags, Name Lengths and Performance ............................................................. 34

Does tag size matter? .............................................................................................................. 34

Impact of long tag names vs. short tag names .................................................................... 34

Tag length impact ..................................................................................................................... 36

Optimize Based on Your Design Goals .................................................................................. 36

For more information .......................................................................................................................... 37


Top Related