Yet again, a significant root cause of issues is C here

Posted Jun 26, 2025 8:18 UTC (Thu) by parametricpoly (subscriber, #143903)
Parent article: Libxml2's "no security embargoes" policy

Parsers should be written in memory safe, preferably total functional languages (not turing complete). The comment by the author

> As you may have noticed, most of our fuzzers inject malloc failures to cover code paths handling such failures, see #344 (closed). In the past, I have fixed ~100 bugs related to handling of malloc failures. I do not consider these issues security-critical,

tells a lot. It wasn't mentioned here but I bet the performance is also abysmal. Examples such as https://gitlab.gnome.org/GNOME/libxml2/-/issues/212. It's just simply the wrong too for the job.

Yet again, a significant root cause of issues is C here

Posted Jun 26, 2025 13:58 UTC (Thu) by pizza (subscriber, #46) [Link] (2 responses)

> Parsers should be written in memory safe, preferably total functional languages (not turing complete).

That's all fine and dandy; except for the minor problem of no such languages existing [1] over two decades ago when libxml2 was first written.

Then there's the other problem where ones internal data structures are usually pretty closely tied to the parser's API (and data structures) which makes it quite hard to retrofit existing code to a different parser.

Then there's the third problem where XML is a particularly sadistic, infinitely-recursive freeform beast. XML parsers have to be malleable and adaptable to handle whatever arbitrary input that you are looking to consume. There is no getting around that inherent complexity, and it's better to have as much as possible handled within the parser itself so application writers can minimize the number of footguns they are juggling.

Yes, these goals are fundamentally in conflict. Welcome to the wonderful world of engineering tradeoffs and technical debt.

[1] At least not with a stable C-compatible ABI. Even Rust is only a decade old (1.0 was released in May 2015), and WUFFS (v0.1 released in 2019) still considers XML a "long-term" roadmap item.

Yet again, a significant root cause of issues is C here

Posted Jun 26, 2025 19:41 UTC (Thu) by wahern (subscriber, #37304) [Link] (1 responses)

> WUFFS (v0.1 released in 2019) still considers XML a "long-term" roadmap item.

While WUFFS includes a JSON decoder, it's just a JSON tokenizer, not a parser. (It includes an example JSON parser, but only the tokenizer is using WUFFS.) I would assume the roadmap for XML is just that, an XML tokenizer, unless they're contemplating extending WUFFs capability set in tandem with an XML parser. But both JSON and XML are designed to be trivial to tokenize[1], and I would be surprised if there were any security CVEs in major implementations purely rooted in tokenization. In my recollection, bugs in libxml2 and libxslt (and libexpat and others, for that matter) have been in the higher levels of the implementation stack.

OTOH, I guess WUFFs does enforce a streaming approach to tokenization and, by extension, parsing. That's generally a good idea, anyhow, but I guess it does serve a channeling function for those inclined toward approaches that would unnecessarily rely on too much ad hoc dynamic allocation and pointer chasing.

[1] At least ignoring interactions with DTDs, like the ability to define new named entities.

Yet again, a significant root cause of issues is C here

Posted Jun 27, 2025 9:07 UTC (Fri) by chris_se (subscriber, #99706) [Link]

> While WUFFS includes a JSON decoder, it's just a JSON tokenizer, not a parser. (It includes an example JSON parser, but only the tokenizer is using WUFFS.)

Just took a look at the JSON examples in WUFFS after you mentioned this, and yeah...

For JSON the difference between a pure tokenizer and a SAX-like parser are small enough that it doesn't really matter, which is why that's fine. But I don't think this still holds for XML, especially if you include all the features libxml2 supports.

Plus the main appeal of libxml2 is the support for DOM and other more advanced features, not just having a SAX parser, those are a dime a dozen, so even a SAX parser would maybe be at most 10% of libxml2...

> I would be surprised if there were any security CVEs in major implementations purely rooted in tokenization. In my recollection, bugs in libxml2 and libxslt (and libexpat and others, for that matter) have been in the higher levels of the implementation stack.

Yes, the pure tokenization of XML is probably the easiest part of parsing XML, so I don't expect that any mature XML parser will have any remaining bugs remaining in the tokenization logic.