LWN.net Logo

Advertisement

E-Commerce & credit card processing - the Open Source way!

Advertise here

One more reason not to use XML

One more reason not to use XML

Posted Oct 23, 2005 17:09 UTC (Sun) by man_ls (subscriber, #15091)
In reply to: One more reason not to use XML by zooko
Parent article: Small company makes big claims on XML patents (ZDNet)

Wow, XML almost looks nice in comparison to this horror.

Seriously, though: I think XML started well, but it was possessed by creeping featuritis right about the time of XML schemas. Without the accursed namespaces, a parser can be lean and mean.


(Log in to post comments)

One more reason not to use XML

Posted Oct 23, 2005 20:12 UTC (Sun) by zblaxell (subscriber, #26385) [Link]

The horror would be much less horrible if it wasn't in "canonical" mode. "canonical" mode seems to be optimized for cheap parsing, while "transport" mode is optimized for offending the smallest number of possible transports. The "advanced" mode looks like something a human would want to read or write.

A nice feature (which isn't supported by either S-expressions or XML AFAIK) is the ability to choose what the string delimiters are. MIME (--nextPartfoo242af23), Perl (<<'FOOBAR123124151231'), and PostgreSQL ($foo@AE@V@ET!%$) use this technique. All one has to do is generate a random delimeter string of sufficient length, then nobody has to worry about premeasuring strings of arbitrary length or about quoting and unquoting properly.

One more reason not to use XML

Posted Oct 23, 2005 23:19 UTC (Sun) by Zarathustra (guest, #26443) [Link]

The Perl quoting style you mention is the bourne shell quoting style ;)

And really, quoting is not that hard if you do it right, like rc(that is one of it's great advantages over sh which has rather messy quoting): Strings can be delimited by ', if so, a ' character inside the string is represented by ''. Eg., "it's easy" becomes 'it''s easy'. And you don't need any other escaping or quoting than that, which is very easy to parse and generate.

One more reason not to use XML

Posted Oct 24, 2005 9:32 UTC (Mon) by philips (guest, #937) [Link]

Seems you never really parsed that.
It is one hell to parse that. And it is one hell to convert.
Pascal used that for quoting single quote - and it was nightmare. C's backslash is magnitude more managable.

Beauty of XML in its uniformity. XML files are XML. Stylesheets are XML. Schemas are XML too.

Anyway, the true beauty starts when you convert from XML to s-exprs to ASN.1 to XML back. I got several apps where XML<->ASN.1 conversion performed constantly and it just works. Putting file into database/flash? - ASN.1/TLV. Wanna read it on screen? - XML, here you go. Two python scripts do all the silly job.

Do not forget. XML is not the best format - it is the best compromise of all formats. I personally prefer ASN.1/TLVs - and xxd is my best friend. Other people who has to read the files all day prefer XML since we can add syntax sugar to it.

What's new here? - both are possible at the same time.

One more fly in the quoting ointment

Posted Oct 24, 2005 14:00 UTC (Mon) by Max.Hyre (subscriber, #1054) [Link]

[A] ' character inside the string is represented by ''. Eg., "it's easy" becomes 'it''s easy'. And you don't need any other escaping or quoting

I'm not understanding something here. Using this quoting style, She said "it's easy". becomes 'She said "it"s easy".'. Run that through the de-quoting algorithm.

There's simply no way to delimit random text without being able to escape the delimiter somehow. Even MIME's mechanism has a greater than zero chance of failure. (Admittedly, it's something like 10**-1000000000 [that's a billion*] or less.) As a HW engineer colleague says: The question isn't ``Does it work?'', but rather ``Can it fail?''.

*Or a thousand million, for my British readers :-)

One more fly in the quoting ointment

Posted Oct 24, 2005 18:18 UTC (Mon) by zblaxell (subscriber, #26385) [Link]

Beware evil fonts. "It''s easy" != 'It"s easy'. One is a string of length 10, with two ticks, the other a string of length 9, with one quote. They look obviously different in my mail client when I got the LWN comment notification, but they look exactly the same in my browser window.

MIME's mechanism has a >0 chance of failure, but if implemented correctly it has a much lower chance of failure than the chance of randomly guessing someone's GPG session key (defeating privacy) and finding an SHA1 collision (defeating authentication). Undetected CRC errors or rotting RAM bits are many orders of magnitude more likely to mangle the message.

If you absolutely must avoid failure, a MIME encoder can always hide the delimiter string using quoted-printable encoding (I've seen some encoders transform 'From ' into '=46rom ' at the beginning of some paragraphs to avoid another infamous email delimiter string), or read the entire string in advance and search for a delimiter that doesn't appear within it.

choosing the right tool for the job

Posted Oct 24, 2005 10:59 UTC (Mon) by zooko (subscriber, #2589) [Link]

That sample you linked to shows that netstrings and s-expressions are not optimized for human
readability, but instead they hit a "sweet spot" with nearly optimal portability, simplicity,
efficiency, etc., while also having decent readability for those occasions where you need to read
through a protocol dump or a crypto certificate to debug it.

If you are looking for a format for your config file, I suggest YAML or a basic "key=value\n"
format. If you are looking for a format for some "heavy lifting" data such as a database backend
or performance-intensive data storage/transport, then I suggest your own custom binary format
or (if you are careful) ASN.1. If you are looking for a format for a wire protocol or a lightweight
data format (such as for certificates) then I suggest Rivest's s-expressions.

If you are looking for a format for humans to read and edit to add markup to some text, then by
all means use XML. Except of course when you should just use HTML. In fact, I'm not sure that
I've actually personally experienced a case where XML would have been the best choice.

Some people appear to think (and I used to think) that by choosing a well-suited format instead
of XML for each of these tasks that we were inhibiting interoperation and data migration.
However, in practice it has turned out that (a) the value of easy data migration between most of
these things is often less than the value of a well-fitting format for each of them, and (b) interop
and data migration is not actually improved much if at all when both sides were using XML in the
first place. It's not that hard in the first place, and the addition of XML everywhere hasn't made it
any easier in my experience.

One tool to rule them all

Posted Oct 24, 2005 14:15 UTC (Mon) by man_ls (subscriber, #15091) [Link]

netstrings and s-expressions are not optimized for human readability, but instead they hit a "sweet spot" with nearly optimal portability, simplicity, efficiency, etc. while also having decent readability for those occasions where you need to read through a protocol dump or a crypto certificate to debug it.
If that is the sweet spot, I don't want to taste bitterness. I know I can quickly spot the trouble in an XML file and repair it, probably with the help of tools or just a simple text editor. With netstrings and s-expressions it looks like I can go mad while trying to decipher it, but maybe for an experienced user it's easy.
If you are looking for a format for your config file, I suggest YAML or a basic "key=value\n" format.
Key-value pairs quickly become hellish when you try to represent hierarchical data. Don't know about YAML.
If you are looking for a format for some "heavy lifting" data such as a database backend or performance-intensive data storage/transport, then I suggest your own custom binary format or (if you are careful) ASN.1.
Thanks, but if I wanted to use binary formats I would not be looking at XML, YAML or anything else. It's a different use case.
If you are looking for a format for a wire protocol or a lightweight data format (such as for certificates) then I suggest Rivest's s-expressions.
I would suggest simple XML without namespaces or schemas. The verbosity of XML has advantages: the added redundancy can help repair corrupted files.
In fact, I'm not sure that I've actually personally experienced a case where XML would have been the best choice.
I would revert your statement: XML is not optimal for any of the mentioned tasks, but IMHO it is near the "sweet spot" you mentioned earlier for some of them. It is reasonably apt for configuration files, user data and even transport. For performance-intensive tasks it is not so good, but may be acceptable if you can compress it or spare the bandwidth.
interop and data migration is not actually improved much if at all when both sides were using XML in the first place.
I agree with this part. In fact SOAP is the Simple Protocol From Hell, and all subsequent extensions and protocols have been invented to make people hate XML and computing in general. Likewise with programming languages written in XML. Keep it as a format for data storage, and you will generally be fine.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds