Sample Office 12 XML Document

Brian Jones has posted an example Word document in the new XML format. Actually, he released a ZIP file with three formats of the same document. You can download the ZIP file here: http://jonesxml.com/resources/BasicDocument.zip Let’s look at the three files and their contents.

Basic Document.doc

Basic Document.doc is the familiar legacy binary. We can see the text, formatting, hyperlink, and embedded image. This should be well in your comfort zone. You double-click; it opens in Word. If you open it in a text editor like notepad, you get the binary “gibberish.”

Basic Document.xml

The Basic Document.xml is the same document in WordML (generated from Word 2003). Again, it’s pretty typical. We have the familiar WordML schema with all the relevant tags and the encoded binary (the picture). You double-click; it opens in Word (eventually; it probably has to go through IE first). If you open it in notepad, you get a legible XML document as you would expect.

Basic Document.docx

Basic Document.docx is obviously the new Office 12 XML format (the .docx is a dead giveaway). Opening this file in notepad returns binary “gibberish.” The new Office 12 XML format is actually ZIPped XML. Open Basic Document.docx in WinZip. (You might need to rename it Basic Document.zip). You should see one XML file [Content_Types].xml, and three folders: “_rels”, “docProps”, and “word”.

[Content_Types].xml

Office uses the same ZIP conventions of the new Microsoft Metro Spec. [Content_Types].xml is kind of a “table of contents” for the ZIP file. It allows you to quickly see what’s in the file.

_rels Folder

The _rels Folders tell you what the relationships are between the different document parts. (I’m going to leave the more technical explanation to Brian.)

docProps

docProps contains two XML files: App.xml and Core.xml, both of which contain what we consider to be metadata. App.xml contains the document properties such as attached template, number of pages, words, and characters in the document, etc. This is the equivalent of the “Statistics” tab on the Document Properties screen. (File | Properties)

Core.xml contains more of the document properties such as document title, subject, author, keywords, last editor, number of revisions, date created and date modified. These are basically the same fields as the “Summary” tab on the current Document Properties screen.

That certainly makes it much easier to see what metadata is in the document and presumably removing it, too.

word

The Word folder has all the good stuff. documentProperties.xml is a lot of the stuff from Tools | Options that gets saved with the document such as the view of the document (Normal view, Print Layout view, etc), the zoom setting, and the compatibility options.

I presume fontTable.xml it’s the font substitution table.

styles.xml contains all of the styles and style definitions in the document. I’m guessing it’s really the styles used or defined in the document. This document only has the Normal style, Heading1, DefaultParagraphFont, and Hyperlink styles defined.

wordDocument.xml is the document itself. It has all the text and basic layout info (paragraphs with their associated styles and direct formatting, where the picture goes and its properties, etc.)

media

This folder contains the embedded image as a standard jpg.

What it all means

  • File Size - Because everything is zipped, the files will be smaller than either the legacy binary format or the current WordML document.
  • Document Corruption - Since the Zip spec has CRC error detecting, it should be immediately apparent if there’s corruption at the document “container” level. Having the rest of the document as plain XML files should make it much easier to detect and fix corrupt elements and manipulate the document in general.
  • Speed - The smaller file size and the better document corruption handling should make document production much faster.

Of course, a lot can change between now and the actual release. We’ll have to wait for the first beta to be available or Brian to release more sample files to see what else is really going on.

Recent Entries

Google Storage pricing
Google's done an interesting thing with their storage pricing.On Google's announcement of the ability to upload any file type to…
An Experiment: Using only Chrome for a week
I'm a man of many browsers, but Firefox has my heart (or rather Firefox's extensions do). I want to put…
Microsoft announces Office 2010 versions and pricing
Microsoft announced Office 2010 pricing at CES. Office Home and Business includes Word, Excel, PowerPoint, OneNote, and Outlook. There’s no…