Recently in XML Category

Microsoft has now released documentation for the Office binary formats (.doc, .xls, .ppt) in addition to kicking off the project for an open source binary to Open XML converter (.doc to .docx)   The threw in WMF for good measure.

The translator is great for anyone going to Open XML or those who want to work on XML-based documents.  It’s also cool to see what’s actually in the binary file format.

A Minor OpenXML Automation Epiphany

Im a little late in this one, but I realized it’s a little harder to create OpenXML documents (like Word 2007) that I thought. While it’s still really easy to parse the actual XML code, you then have to ZIP it all with the right “stories” etc. Yuck!

Technorati Tags: ,

Customizing Office 2007 UI

I missed this awhile back I think (it was actually about one year ago). Jensen Harris has a great post on customizing the Ribbon.

Technorati Tags: ,

OpenXML Community Growing

OpenXML Community Growing:

  • Older versions of Office - As you all know, folks who have older versions of Office can download a free update that allows them to read and write the open xml formats. While the downloads have only been available for about 6 months, they are already the 2nd most popular download on Microsoft.com (second only to IE 7). There are well over 4 million downloads to date.
    ET: These work pretty well. They don’t work with some DMSes currently. Both Hummingbird and OpenText are working on versions compatible with them.
  • OpenOffice - Thanks to Novell, you can read and write the OpenXML formats with OpenOffice. The Sun folks are also involved as they move from the XSLT approach to a more native support.
    ET: OpenOffice is a cool alternative to MS Office, but it’s still not as polished. Of course, free can go a long way to make up for that.
  • WordPerfect - Corel has announced support for OpenXML in an upcoming release of their office suite.
  • Palm OS - Documents To Go brings OpenXML support to smartphone and PDA devices powered by the Palm operating system.
  • Mac - NeoOffice brings OpenXML support to the mac.
    ET: As great as NeoOffice is in many ways, it is slow and not as pretty as other Mac apps.
  • MindMapping - Mindjet’s MindManager allows you to follow the logical workflow of first brainstorming, then creating a document outline, and then writing you document. You can brainstorm your idea in MindManager, and then convert those into a wordProcessingML document.
  • OpenXML Writer - The folks up at OpenXML.biz have build a free open source text editor called “OpenXML Writer” that allows you to edit WordprocessingML files.
  • Gnumeric - Gnumeric is an open source spreadsheet application that was one of the first applications out there to show support for SpreadsheetML.
  • Web Development (PHP) - There is an open source project up on codeplex where they are creating a set of PHP classes which allow you to read and write SpreadsheetML files.
  • Java Developers - There is a project up on sourceforge where they are creating a set of Java APIs to make programming against the openxml formats much easier for Java developers.
  • Data Reporting - In Monarch V.9.0 from Datawatch you have the ability to create reports of your data using SpreadsheetML.
  • Word and Character Counting on Mac - Word Counter 2.2.1 is an application for Mac OS X, and it supports a variety of file formats, including WordprocessingML
  • Convert docx to simple html - The docx converter allows you to transform WordprocessingML documents into either plain text or simple html directly from their website.

From BrianJones

Word Document Structure Explained

| 1 Comment

By going the all-XML approach, Microsoft has made it much easier to understand the structure of Word documents and how the various “stories” and elements interact.  Brian Jones has an excellent post about the basics of the Word document structure and then goes further by showing how that structure translates into the Open XML format.  Things are so much easier now, assuming good document production processes and good document formatting.  It also looks like it’ll be easier to “fix” malformed documents.  This will hopefully lead to a number of quality tools that are less expensive than solutions from Microsystems or Levit & James.

Yes, this is really old news, but here it is anyway.

Once Office 2007 is released, Corel will add support for Open XML to the next version of Wordperfect.

From Brian Jones: Open XML Formats

XML in SQL Server 2005

I was looking more at the new XML format in Office 12 and what you can do with that. Web services can be huge and do a lot of cool document automation and generation. I came across this article at MSDN

I’m not a database or SQL guy, so my apologies if I butcher this. Here’s what I gleaned from what’s new in SQL Server 2005 as it relates to XML.

* There’s a new native datatype: XML
* XML values are stored in an internal format as large binary objects (BLOB) in order to support the XML model characteristics more faithfully such as document order and recursive structures.
* SQL Server 2005 provides XML schema collections as a way to manage W3C XML Schemas as metadata
* XML instances can be retrieved using the T-SQL SELECT statement. Five built-in methods on the XML data type are provided to query and modify XML instances.
* The XML data type methods accept XQuery and includes the navigational language XPath 2.0
* A mechanism for indexing XML columns is provided to speed up queries. (Query execution processes each XML instance at runtime; this becomes expensive whenever the XML value is large in size or the query is evaluated on a large number of rows in a table. )
* XML schema information is used in storage and query optimizations.
* Users can store both relational and XML data within the same database; the database engine knows how to honor the XML data model in addition to the relational data model.
* The existing FOR XML functionality has been enhanced in several ways. OpenXML’s (think: Office 12) functional enhancements consist of accepting XML data type in sp_preparedocument and generating XML and new SQL type columns in the rowset.
* Data is validated (against a XML schema) during insertion and modification according to the target namespace of each top-level element.
* During query compilation, XML schemas are used for type checking and static errors are issued for type mismatch. The query compiler also uses XML schemas for query optimizations.
* Almost all of the W3C XML Schema 1.0 specification is supported.
* You can create logical XML Views of your relational data using SQLXML mapping technology. An XML View, also referred to as a “mapping” or an “annotated schema,” is created by adding special annotations to a given XSD schema.
* Once you’ve created a XML View of your database, you can query that view as if it were an actual XML Document using the XPath query language.
* Two new ways to access SQLXML functionality have been added:
** SQLXML Managed Classes
** SQLXML Web Services

From the Conclusion:
* Server-side features include a native implementation for XML storage, indexing, and query processing.
* Existing features such as FOR XML and OpenXML have also been enhanced.
* Client-side support consists of enhancements to ADO.NET to support the XML data type and the SQLXML mapping technology Web release enhancements have now been incorporated into SQL Server 2005.
* The XML data type provides a simple mechanism of storing XML data by inserting XML data into an untyped XML column.
* XML data type preserves document order and is useful for applications such as document management.
** It can also handle recursive XML schemas.

WordML, CALS, and Tables

John Durant posts about WordProcessingML and support for CALS.

I’ll leave the real discussion to John. Here are some auxiliary links, though.

CALS stands for Computer-aided Acquisition and Logistic Support (from Tim Berners Lee proposal for the “Global Hypertext Project” in 1989 quoted on XML.com)

CALS definition from a post by Betty Harvey at XML.com:

The CALS initiative was a DoD initiative. The CALS standards were developed by tri-service working groups which included the Army, Navy and Air Force. The CALS table model was a result
the military standard of MIL-PRF-28001C. CALS table model was adopted by DOCBOOK, as well as TEI, ISO 12074, AECMA, ATA, etc. because it was the only comprehensive table model available universally.

Creating Word Documents Using Movable Type

Anil Dash wrote an article for SixApart last year about using Movable Type to create a Microsoft Word document via WordML (download the schema from Microsoft).

It looks like the template on the SixApart website went missing. You can download the wordml.xml file it references here.

Ed Bott has a post about XML support in Office 2003 vs. Office 12. Anil posted a comment about Movable Type’s ability to create Word documents.

Sample Office 12 XML Document

Brian Jones has posted an example Word document in the new XML format. Actually, he released a ZIP file with three formats of the same document. You can download the ZIP file here: http://jonesxml.com/resources/BasicDocument.zip Let’s look at the three files and their contents.

Basic Document.doc

Basic Document.doc is the familiar legacy binary. We can see the text, formatting, hyperlink, and embedded image. This should be well in your comfort zone. You double-click; it opens in Word. If you open it in a text editor like notepad, you get the binary “gibberish.”

Basic Document.xml

The Basic Document.xml is the same document in WordML (generated from Word 2003). Again, it’s pretty typical. We have the familiar WordML schema with all the relevant tags and the encoded binary (the picture). You double-click; it opens in Word (eventually; it probably has to go through IE first). If you open it in notepad, you get a legible XML document as you would expect.

Basic Document.docx

Basic Document.docx is obviously the new Office 12 XML format (the .docx is a dead giveaway). Opening this file in notepad returns binary “gibberish.” The new Office 12 XML format is actually ZIPped XML. Open Basic Document.docx in WinZip. (You might need to rename it Basic Document.zip). You should see one XML file [Content_Types].xml, and three folders: “_rels”, “docProps”, and “word”.

[Content_Types].xml

Office uses the same ZIP conventions of the new Microsoft Metro Spec. [Content_Types].xml is kind of a “table of contents” for the ZIP file. It allows you to quickly see what’s in the file.

_rels Folder

The _rels Folders tell you what the relationships are between the different document parts. (I’m going to leave the more technical explanation to Brian.)

docProps

docProps contains two XML files: App.xml and Core.xml, both of which contain what we consider to be metadata. App.xml contains the document properties such as attached template, number of pages, words, and characters in the document, etc. This is the equivalent of the “Statistics” tab on the Document Properties screen. (File | Properties)

Core.xml contains more of the document properties such as document title, subject, author, keywords, last editor, number of revisions, date created and date modified. These are basically the same fields as the “Summary” tab on the current Document Properties screen.

That certainly makes it much easier to see what metadata is in the document and presumably removing it, too.

word

The Word folder has all the good stuff. documentProperties.xml is a lot of the stuff from Tools | Options that gets saved with the document such as the view of the document (Normal view, Print Layout view, etc), the zoom setting, and the compatibility options.

I presume fontTable.xml it’s the font substitution table.

styles.xml contains all of the styles and style definitions in the document. I’m guessing it’s really the styles used or defined in the document. This document only has the Normal style, Heading1, DefaultParagraphFont, and Hyperlink styles defined.

wordDocument.xml is the document itself. It has all the text and basic layout info (paragraphs with their associated styles and direct formatting, where the picture goes and its properties, etc.)

media

This folder contains the embedded image as a standard jpg.

What it all means

  • File Size - Because everything is zipped, the files will be smaller than either the legacy binary format or the current WordML document.
  • Document Corruption - Since the Zip spec has CRC error detecting, it should be immediately apparent if there’s corruption at the document “container” level. Having the rest of the document as plain XML files should make it much easier to detect and fix corrupt elements and manipulate the document in general.
  • Speed - The smaller file size and the better document corruption handling should make document production much faster.

Of course, a lot can change between now and the actual release. We’ll have to wait for the first beta to be available or Brian to release more sample files to see what else is really going on.