LWN.net Logo

The Journal - a proposed syslog replacement

The Journal - a proposed syslog replacement

Posted Nov 20, 2011 6:51 UTC (Sun) by skissane (subscriber, #38675)
In reply to: The Journal - a proposed syslog replacement by jmorris42
Parent article: The Journal - a proposed syslog replacement

I disagree completely. I think the UNIX emphasis on plain text, while it had its value in its original historical context of trying to get something working quickly, it holds us back to keep it like it was some kind of unquestionable religious dogma.

What is bad about binary formats? Nothing inherently wrong with them. Sure, if they are poorly designed, poorly documented, lack good tools, etc., then no doubt they can cause a lot of grief, but those are problems with poorly designed and poorly implemented binary formats, not with binary formats per se.

What I would like to see, is a simple, flexible, self-describing, binary format. ASN.1 or binary XML are both good places to start, although I think both suffer from useless extra complexity. Maybe something like Google protocol buffers would be a good choice?

If you really want text, you can easily write a tool to dump the binary out to text. Hey, you could have a format with two official serialisations, a text-based and a binary one.

But the problem I see with most Unix-style plain text formats, is every one is different. There is a lack of consistency/standardisation, especially when it comes to how to escape special characters, etc.


(Log in to post comments)

The Journal - a proposed syslog replacement

Posted Nov 20, 2011 7:11 UTC (Sun) by dlang (✭ supporter ✭, #313) [Link]

What makes you think that binary formats would be any more standardized than the existing text formats?

There are ways to do self-describing text formats, but developers don't do it.

With text formats it's a lot easier to examine the file and reverse engineer the format than it is from a binary format.

Damaged/lost files are also a place where text files are easier to recover than binary files.

In theory none of this should ever be needed and binary files are just fine. But this is where the quote "in theory, theory and practice are the same, but in practice they are not" applies

The Journal - a proposed syslog replacement

Posted Nov 20, 2011 7:55 UTC (Sun) by skissane (subscriber, #38675) [Link]

I think, if you want to stick to text, it would be much better if tools output in some standardised text format, e.g. XML, JSON, YAML, etc.

But then, once you have a standardised text format, why not save some space and processing time with an efficient binary serialization of XML/JSON/YAML/what-have-you?

And then you can have a tool, e.g. bin2text, which reads the binary format on standard input and writes the text format on standard output, and vice versa. With such a tool, reverse-engineering/examination should be no harder than with a plain text format.

I think this would be better than both (1) the rather poorly-defined text formats used at present by many tools and (2) binary is more efficient than text.

The point you make about trying to recover from corrupted files being easier when they are in text is true, but how often do you have to deal with that? If there were provided some good quality libraries (say C with bindings to other common languages such as C++, Java, Perl, Python, etc.), the odds of a corrupt file due to programmer error should be low, outside of some mid-transaction failure scenario. And if we had transaction support in the library or the underlying filesystem, we could avoid that problem too.

The Journal - a proposed syslog replacement

Posted Nov 23, 2011 22:12 UTC (Wed) by cas (subscriber, #52554) [Link]

But then, once you have a standardised text format, why not save some space and processing time with an efficient binary serialization of XML/JSON/YAML/what-have-you?

  • space is irrelevant these days. multi-terabyte disks are cheap, readily available consumer products
  • in my experience, XML etc *greatly* complicates most jobs, increasing processing time, difficulty of programming, difficulty of understanding WTF is going on. it turns what should be a quick and simple one liner to extract information into a multi-hour programming effort reading API docs, parsing the data in whatever obscured format it's in (and possibly parsing other things like the DTD).
  • it's completely missing the point of XML, JSON, YAML etc - they're data *transfer* protocols, not data *storage* methods. their purpose is to unambiguosly transfer data from one system to another, not to store data in yet another obscure special purpose file format
  • it violates the KISS principle. but, then, everything Lennart is involved in does that.

The Journal - a proposed syslog replacement

Posted Nov 23, 2011 23:38 UTC (Wed) by dlang (✭ supporter ✭, #313) [Link]

even multi-terabyte disks are expensive if you need a lot of them.

I store my logs at 10:1 compression (or better) and I still have 10's of TB of logs to deal with.

The Journal - a proposed syslog replacement

Posted Nov 20, 2011 8:12 UTC (Sun) by drag (subscriber, #31333) [Link]

Well they tried. It ended up being XML. :(

The Journal - a proposed syslog replacement

Posted Nov 20, 2011 19:27 UTC (Sun) by skissane (subscriber, #38675) [Link]

The problem with XML is:
1) a syntax originally designed for marking up documents got reused for
data, with the result that XML provides distinctions which are
unnecessary for data purposes (e.g. element vs. attribute distinction)
2) historical baggage, e.g. DTDs
Certainly you can define new syntaxes which avoid those two problems that
XML has. On the other hand, whatever its warts, XML is an industry standard,
and practical considerations often imply choosing the imperfect industry
standard over some technically superior but rarely used alternative.

But, JSON is quite common now, and addresses some of the issues above. (But
I think it has its own deficiencies too)

The Journal - a proposed syslog replacement

Posted Nov 21, 2011 14:50 UTC (Mon) by sorpigal (subscriber, #36106) [Link]

> I think the UNIX emphasis on plain text, while it had its value in its original historical context of trying to get something working quickly, it holds us back to keep it like it was some kind of unquestionable religious dogma.

It seems logical to say that a standard binary format is just as good as a standard text format, which is why this is carefully documented as one bullet point in the Unix philosophy: use text. The extra overhead was a lot worse back when this idea was developed, yet they stuck with it anyway. If you don't meditate on "Why" then you will invent a non-text system that's "as good or better" than text and suffer as a result. You can either accept received wisdom and "just do it," ignore this sage advice at your own peril or embrace the idea wholeheartedly.

It's not true...

Posted Nov 21, 2011 18:41 UTC (Mon) by khim (subscriber, #9252) [Link]

It seems logical to say that a standard binary format is just as good as a standard text format

This is not true. To pull useful data from corrupted text file you need a human being. To pull it from binary format with embedded CRC checks you only need to rigorpously use one very fast function. Sure, this only protects the data from accidental changes (zero-out pages in the middle, bitflips, that kind of things) but the funny thing that when people describe how they heroically recover data from corrupt disk or filesystem it's almost always from accidental corruption.

The extra overhead was a lot worse back when this idea was developed, yet they stuck with it anyway.

Actually it's much worse today. When you had hundreds of kilobytes or may be few megabytes of logs - human as "recovery system" works. When there are gigabytes, terabytes and petabytes of logs - it's hopeless.

The Journal - a proposed syslog replacement

Posted Nov 23, 2011 22:03 UTC (Wed) by cas (subscriber, #52554) [Link]

what is wrong with binary formats is that I have to use a *different* tool for every different binary format. i'll never use any of them often enough to truly master them and, worse, anything i learn about their usage is trapped within that single usage context.

with plain text formats I can use the *same* collection of tools for everything, and every new thing i learn about sh or sed or grep or awk or perl or whatever is automatically useful in hundreds of other contexts, not just the context in which i originally learnt it.

binary formats suck.

The Journal - a proposed syslog replacement

Posted Nov 27, 2011 2:23 UTC (Sun) by HelloWorld (guest, #56129) [Link]

> with plain text formats I can use the *same* collection of tools for everything, and every new thing i learn about sh or sed or grep or awk or perl or whatever is automatically useful in hundreds of other contexts, not just the context in which i originally learnt it.
I found that in most cases, it is easier to learn a new tool that is specialized for the job at hand than to try to get "standard unix" tools to do what you want, especially if you want to do it in a robust and maintainable way. For example, people keep asking again and again how to handle some XML format with things like sed or awk, which is just a bad idea given the existence of specialized tools like xmlstarlet.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds