-
Notifications
You must be signed in to change notification settings - Fork 2
Why not docx?
This gem is a HTML to Word converter. There are several such converters out there, both in Ruby and in other languages; and they all have DOCX as their target. This gem has DOC as its target instead. To explain why, we present what this gem is trying to do, and why DOC is a better fit for it than DOCX.
Converters can be used to convert found HTML pages on the Web into Word documents. As we discuss below, there is a way of doing that with DOCX that is relatively straightforward. But that is not the goal we had in creating this gem. This gem was created, instead, to use HTML as a means of authoring Word documents: we are using it extensively in the Metanorma suite of standard authoring tools, to convert (in the first instance) Asciidoc markup into both HTML and Word output. The target format is extremely rich: we needed to include footnotes, tables of content, mathematical formatting, and comments. So we needed far more than Found HTML online would cater for.
Our approach in this gem was motivated by Sébastien Sauvage, as explained on http://sebsauvage.net/wiki/doku.php?id=word_document_generation . Sauvage evaluates and rejects several alternatives for authoring Word documents, including authoring binary DOC, and authoring DOCX:
BANNED. I don’t have time to read a 7500 pages specification no-one is capable of implementing - not even Microsoft !
The solution Sauvage did alight on was using Word HTML, the variant of HTML that Microsoft Word saves its Word files as. This has several clear advantages:
-
HTML is very familiar to users; certainly much more familiar than OOXML, let along the Microsoft variants of OOXML. There is not much of a learning curve, in contrast with the learning curve of learning OOXML, at the level of detailed we needed. (Recall: we didn’t just need italics and bold; we needed close to the full range of formatting Word could support.)
-
HTML is very simple to integrate into a DOC document: all you need to do, in fact, as Sauvage documents, is wrap the Word HTML in a MIME package with its images.
There is a downside to the approach: Word HTML is not identical to real HTML, but is rather a variant of HTML 4.0, with enhanced CSS but a very primitive implementation of CSS selectors. As a result, Word HTML cannot cope with the typical HTML 5.0 + CSS 3.0 found online now. It also does not deal well with empty tags, which need to be adjusted before ingest into Word. The documentation of Word HTML is not as readily available as when Sauvage found wrote his post; and (when I did track a copy down) it’s not as complete either; we had to work out by trial and error how continuous vs restarted numbering of lists work in Word HTML.
But (again in contrast with OOXML), trial and error is in fact straightforward. If you want to know how Word HTML encodes a particular piece of formatting, all you need to do is create a Word document with that formatting, save it as HTML, and inspect the results. So even without good documentation, the approach is self-documenting. And so long as you use Word’s version of CSS, building on the styleshet that Word exports to HTML—the document is a real Word document: it has native handling of Normal styles and lists, for example, and will respond to Word style templates.
There’s only one catch with this approach: it does not generate DOCX but DOC.
DOC is now a legacy format, and DOCX was introduced as far back as 2007. Eventually, DOC support will be withdrawn, at least from Microsoft Office (though it may well persist in Open/Libre Office—which after all can still open up Mac Microsoft Word 5 files that Mac Microsoft Office can’t.) So this approach does have an expiry date, even if that date is possibly a decade away. It also imposes the nuisance that you (may) need to save the DOC document it generates as DOCX, for further processing.
So what if you want to generate DOCX, and are not prepared to go the route of HTML > DOC > DOCX?
As noted, most approaches do actually author DOCX by writing out native OOXML, whether that OOXML is generated by mapping HTML (as in https://github.com/MuhammetDilmac/Html2Docx), or through a DSL (as in https://github.com/jetruby/puredocx). (These tools also provide the necessary MIME wrapping around the document; the files that needed to be included are much more complicated than for DOC.) Tools such as these do not have full coverage of the formatting possibilities of OOXML—the "7500 pages specification" that Sauvage did not wish to engage with. (You may think Sauvage was exaggerating about the 7500 pages. He is not. The spec is available at http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html under ISO/IEC 29500, in four parts.)
Html2Docx will do italics and images, for example, but it does not even provide for tables, let alone footnotes or mathematical formatting. If you want to include unsupported formatting into the tool, you can’t just save a Word document as HTML, and inspect the results (and you can’t just feed the outputted Word HTML through to the tool): you have to learn OOXML, and poke around the much more complex structure of that specification. (There is an Open XML SDK, which is what https://github.com/kannan-ar/MariGold.OpenXHTML relies on; but of course that just means you have to learn a complex SDK instead of a familiar HTML mapping.)
If and when Microsoft Word pulls support on DOC, that might be worthwhile. But for now, getting the OOXML coverage of tools like Html2Docx or puredocx up to the level we needed was, frankly, far more work than we were prepared to do, for marginal benefit.
There is one more approach to converting HTML worth describing, which is used by https://github.com/evidenceprime/html-docx-js. An HTML document can be embedded in a DOCX shell, as an MHT blob. So long as the images in the HTML document are inline-encoded, the document will open up as a native Word document, just as the DOC output of this gem does.
The good news with this approach is that, unlike this gem, the MHT import deals with full contemporary CSS and HTML 5 beautifully. If you put in HTML and CSS you find online, the result in Word will look pretty much the same. (It won’t understand any of the Javascript of course.)
The bad news is that this approach does only as much as HTML 5.0 + CSS 3.0 does. Word adds to CSS in its Word HTML variant, including such styling as page breaks, footnotes, headers and footers, tables of contents, and comments—formatting that is bound to a page model rather than a browser model of rendering. The MHT import understand some of this markup, but not enough:
MHT import understands Word HTML:
-
Table borders
-
Section breaks
-
Page breaks
-
Dotted tabs
-
List formatting (bullet choice, ordered list numbers style and styling)
-
Paragraph Keep With Next
-
Footnotes
-
Font settings for styles (but not consistently!)
MHT import does not understand Word HTML:
-
Tables of Content
-
Spacing of paragraphs
-
Footnote references (the numbers are missing both from the text and from the footnote)
-
MathML
-
OOXML mathematical formatting
-
Headers and footers (which are in a separate file in Word HTML anyway)
-
Comments
For most purposes, the latter list is not likely to be a deal-breaker. (Paragraph spacing is likely the most pressing, but it can be faked with empty paragraphs.) In fact, we would recommend using the MHT import, as done by https://github.com/evidenceprime/html-docx-js, as a future-proof way of generating Word documents.
For our purposes, however, too much formatting was missing; mathematical formatting was critical. And because we are importing Word HTML back into the document, if the Word HTML formatting is rejected, we don’t have an alternative recourse: if the MHT import understands Word HTML footnotes but not Word HTML footnote references, it’s not obvious what else we can try to make it understand footnote references. For that reason, we are persevering with DOC as our output format.