09 Mar 2014

In a recent discussion I engaged in on Google Plus about XML vs JSON, I was sent a webpage, written by David Lee, which successfully attempts to illustrate that, under specialized conditions, XML and JSON are comparable to each other in compactness, and thus, transmission and storage efficiency. It fails to convince me, however, that XML compares favorably against JSON at parsing performance. The latter issue hasn’t been a real concern until only fairly recently in history, though. It further fails to convince me, en toto, that I should even consider XML again, for anything outside of document markup, especially as more and more enterprises take greater interest in their transactions per watt metric.

First, I want to say that I mostly agree with David’s findings, given the narrow scope of his research thesis. I’m still not satisfied with the report’s lack of illustrations using namespaces, but other than that, it seems pretty comprehensive and well researched. No doubt, with a skilled practitioner defining an XML format, an end-user may find working with XML quite pleasant to work with, from an operational as well as usability point of view. In some cases, XML may even yield smaller uncompressed encodings than its equivalent JSON representation. If you serialize your object fields as attributes instead of nested constructs, and you don’t use characters which require entity expansion, you actually save four characters per attribute over JSON strings, and two bytes for numbers and booleans. It only takes a handful of attributes to recover XML’s framing overhead in that narrow case, assuming very short tag names. The Books corpus tests in figure 9 show this nicely. Alas, as every other corpus test shows, not everything renders so nicely.

This explains why I said specialized conditions above; it takes active software engineering effort to make XML as compact as JSON. Indeed, throughout the whole article, it’s as if David continually paraphrases the mantra, “See, if we only perform this best-practice, XML can be as compact as JSON.”1 This is telling: it illustrates how JSON exhibits the compact-by-design property more than XML (though I certainly feel we can compact JSON further; I find the need for quotes around key names largely superfluous, for example). It’s pretty clear just by looking at the available applications of XML in the real world that few willingly expend the effort necessary to ensure a good quality, compact XML format. Anyone who’s had to work with Spring configuration files, or Maven dependency files, or fixing corrupted IDE project configurations, or synthesizing SOAP payloads for integration testing purposes, or … can tell you just how much of a nightmare XML is, purely from a usability point of view, and more rarely, an operational point of view as well.

Though the research suggests that there’s no real gain to be had over well-designed XML with the use of JSON, it never successfully states the contrapositive, namely that if you fail to exercise discipline with JSON format design, you can end up with JSON as fat as typical XML. The closest contraposition found in the research comes from the use of JSON naively auto-generated from a source of already suboptimal XML. While I can’t prove a negative, I can speculate that this never happened in the real world, for any commonly used wire transmission or storage format. Do detail-oriented developers simply have a predisposition to use JSON? David’s research cannot answer to that; nonetheless, without a supporting contrapositive taken from a corpus of JSON from the real world as David uses with his XML data, one cannot reliably refer to David’s research to justify XML over JSON for an over-the-wire or storage format.

Up to this point, I only discussed data which equates JSON and XML. Already, we find little incentive to reconsideer XML as a viable format for much of anything outside of legacy applications. However, careful examination of the data in David’s research may hint at a reduction in the number of new XML applications going forward.

Presumably in an effort to show superiority of XML over JSON in parsing performance, figure 16 shows JSON parsing takes longer than XML for most payload sizes. However, he’s using a parser that the greater JSON and Javascript communities shunned on account of its known performance and security hazards. Let me re-iterate – nobody I know of uses Javascript’s eval() function to parse anything but the most trivial JSON payloads, and even then, only for illustrative purposes, typically security vulnerabilities. No secure, high-performance production environment, be it on the client-side or server-side, ever uses eval(), period. This explains why jQuery and Node have their own, custom-implemented JSON parsers in the first place. eval() also bypasses any Javascript JIT, hence its lackluster performance. For these reasons, I consider that specific test categorically invalid.

JQuery’s own JSON parser, thanks to modern tracing-JIT technology, better approximates the performance found in such languages as Go, Java, PyPy, et. al., as ultimately we’re executing real machine instructions. To illustrate, figure 17 paints a different picture, which David completely ignores in his conclusion2. Suddenly the built-in XML parsers start to look pretty slow in comparison. As your payloads get bigger, even despite well-formatted XML, it seems parsing XML requires greater CPU demands, from the Y-axis onward.

So why do I find this important? More and more enterprises and individuals alike host their corporate functions on VMs3, either internally or externally via providers like Rackspace. Having efficient parsers not only means less performance drain for your own application, it also means greater performance for your fellow tenants on the physical host. This means reduced IT and support loads for both the enterprise and the hosting provider as well. As enterprises increasingly pay attention to their electric bill, higher performance translates to increased transactions per second per watt consumed, which translates into more efficient compute resource utilization for their dollars spent.

While I agree with the individual who sent me the link that XML has been abused over the years, it’s clear that David’s research does nothing to convince me to return to XML for any reason what-so-ever, nor does it convince me that, in the real world, XML should even be considered as anything but a legacy format. David’s thesis, despite being validated with research, only goes so far as to say that XML only approaches the JSON asymptote, and exceeds it only in the most specialized of circumstances. In fact, David’s own data works against the thesis that one can justify new applications of XML, as I’ve pointed out above. This implies that JSON remains the superior over-the-wire protocol for textual formats. XML, like HTML and SGML before it, remains a document markup format.4 But, I digress. If we’re going to argue using the right tool for the right job, at least provide a compelling use-case for your side, preferably one which doesn’t include data supporting your opposition.

1  He’s right, of course, which explains why I still agree with David’s thesis while concurrently disagreeing with how his thesis is being used to justify continued use of XML. After all, nobody except David asked the question, “Given the subset of applications where XML and JSON compete, can XML be as compact and useful as JSON, for all applications of its use?”. The question remains to this day, “Given the subset of applications where XML and JSON compete, is XML as compact and useful as JSON, for all applications of its use?” David’s research shows that it can be, as long as you pay attention to your schema. Yet, I find his data set too narrow to satisfy my skepticism in the general case. Put more simply, theory and reality are only theoretically related; I want to know to what degree they diverge.

2  David writes, Pure JavaScript parsing generally performs better with XML then with JSON but not always, yet his own data directly contradicts this analysis.

3  And, increasingly, on containers within a single VM instance.

4  However, I, and many others, question even this, as formats like ROFF, GNU Info, Markdown, and AsciiDoc all have the benefit of working far better with revision control tools like Git and Subversion. Besides containing minimal syntax, which allows me to focus more on the problem I’m documenting, and a lot less on the syntax of the markup, documents in these formats tend to organize around lines, which are natural units of work for diff and related tools. What about their relative extensibility? Examining ROFF in particular, we find a rich ecosystem surrounded the ROFF format to provide embedded mathematical equations, camera-ready tables and line figures, and more, long before SVG and MathML, two XML applications, came around. Thus, ROFF proved every bit as “extensible” as XML claims to be. Compared to HTML, it merely lacked anchor points and browsers capable of hypertext navigation, but nothing fundamentally prevents its inclusion except that nobody uses it anymore except to write Unix man-pages in. Note that GNU Info format, AsciiDoc, and Markdown directly support hyperlinking, and at least Markdown and Asciidoc provide means to escape to HTML and/or XML when native markup proves insufficient (rare).