User: Password:
|
|
Subscribe / Log in / New account

They should be paying attention to the lumberjack project

They should be paying attention to the lumberjack project

Posted Apr 14, 2012 1:25 UTC (Sat) by dlang (subscriber, #313)
In reply to: They should be paying attention to the lumberjack project by aliguori
Parent article: Toward more reliable logging

Do you have a suggestion for a serialization format to use that would be better?

among the selection criteria are (not in order)

1. size of the resulting serialized string (XML struggles here)

2. availability of libraries in all languages to deal with the serialization format.

3. ability to be transported through traditional logging mechanisms (syslog for example)

4. ability for existing logging tools that can deal with ad-hoc formats to be able to interact with the serialized data

5. human readability of the resulting string

6. ability to represent hierarchical structures.

7. avoiding people saying 'why did you use X' (JSON obviously fails here as well, but it does have a fairly large mindshare)


(Log in to post comments)

They should be paying attention to the lumberjack project

Posted Apr 14, 2012 2:46 UTC (Sat) by slashdot (guest, #22014) [Link]

Are hierarchical structures really needed in logging? Why?

How do you plan to index and search a set of records consisting of hierarchical structures?

A list of key/value pairs seems much better, and can support indexing and SQL queries trivially.

They should be paying attention to the lumberjack project

Posted Apr 14, 2012 23:21 UTC (Sat) by dlang (subscriber, #313) [Link]

> Are hierarchical structures really needed in logging? Why?

I was not involved with the discussions, but a few things come to mind

1. the data you want to log may include structures

2. you may need to log multiple data items of the same type (like a list of filesnames)

you can always choose to flatten a hierarchical structure if you need to, but it's much harder to re-create the structure after you have flattened it

not all storage is SQL.

They should be paying attention to the lumberjack project

Posted Apr 14, 2012 23:35 UTC (Sat) by man_ls (guest, #15091) [Link]

Nice! Now all we need is a MongoDB database inside the kernel. And it could be used for all kinds of things: device trees, file structures, even memory management come to mind.

Just joking. I think I get the least-common-denominator motivation. But isn't getting JSON into any kind of logging facility going to cause immediate "designed-by-committee" fears into kernel developers, and therefore ignore the lumberjack project?

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 0:01 UTC (Sun) by dlang (subscriber, #313) [Link]

remember that what we are talking about is not anything consumed inside the kernel, just the format of a one-way feed to output from the kernel to userspace

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 1:01 UTC (Sun) by man_ls (guest, #15091) [Link]

OK, that makes sense. JSON is trivial to generate.

They should be paying attention to the lumberjack project

Posted Apr 17, 2012 16:38 UTC (Tue) by k8to (subscriber, #15413) [Link]

To be fair, a list of say filenames isn't hard to do in kv pairs.

filename:foo filename:bar filename:baz

Sure, that's more trouble to parse than assuming you can't have repeats, but it's not much work.

However, it starts getting tedious if you need to say something like:

severity:fatal message="corruption in filesystem regarding following items" filename:foo inode:3 filename:bar inode:6 filename:baz inode:8

At this point you really want more structure, or else the consuming end has to intuit to group things.

They should be paying attention to the lumberjack project

Posted Apr 21, 2012 8:19 UTC (Sat) by neilbrown (subscriber, #359) [Link]

.... filename-1:foo inode-1:3 filename-2:bar inode-2:6 filename-3:bar inode-3:8

Explicit structure embedded in the names of name:value pairs.

Yes, it's ugly. But it's simple and if most cases don't need any real structure, then the ugliness will hardly be noticed.

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 1:56 UTC (Sun) by dlang (subscriber, #313) [Link]

to some extent, hierarchical structures can be simulated by item names

for example, a firewall log message needs to have the following items in it

the device generating the log message
the device that is the source of the traffic
the device that is the destination of the traffic

each of these device definitions may include more than one piece of information (hostname, FQDN, IP address, port number)

you could have

loghost: hostname, logip: 1.1.1.1, sourcehost: hostname2, sourceIP: 2.2.2.2, sourceport: 1234, destinationhost: hostname3, destinationIP: 3.3.3.3, destinationport:1234

or you could have
logsource { name: hostname, ip: 1.1.1.1}, source { name: hostname2, ip: 2.2.2.2, port: 1234}, destination { name: hostname3, ip: 3.3.3.3, port 1234 }

personally, I find the second arrangement better and less likely to get confused by people adding extra information to a particular component

as another example, think of all the contexts that a userid can appear in, including what user the applications writing the log message is running as, should all these different possible contexts use a different tag? or are we better off using the same tag everywhere and using the hierarchy information to determine the context?

They should be paying attention to the lumberjack project

Posted Apr 14, 2012 3:20 UTC (Sat) by drag (subscriber, #31333) [Link]

This may come off as a bit abusive and probably is full of fail but what I'd like to see in a log format is null deliminated strings.

And it would look something like this:
log_version\0 machine_ident\0 machine_fqdn\0 timestamp\0 service_ident\0 service_string\0 process_id\0 severity\0 data\0 checksum\0\0\n

something simple like that. The *ident fields are UUID and are completely arbitrary.

The 'machine_ident' would be generated when the syslog-like daemon first starts up like ssh keys are. When the logging daemon connects to a service or starts a new log file it just pukes out a log entry with various useful system identification strings that can be easily picked up by any logging parsing software. Like how browsers do when they connect to a web server. That way it makes it easy to identify the machine by UUID. As long as you can read the first log entry in any file or any time it connects to a network logging daemon then you can figure out what it is pretty easily.

Timestamps are just x.xxxx seconds from unix epoch, GMT. Can have a fine grain of a time stamp as the application warrants and the system can deliver on.

Severity level is similar to how Debian does their apt-pinning. Just a number, like 0-1000. And that number maps to different severity levels:
0-250 - debug
250-500 - info
500-750 - warning
750-1000 - error

That way application developers have a way of saying "well this error is more of a error then that error", which seems important.

The actual data field can be whatever you want. Any data as long as no nulls. Probably more structuring can be layered on later, but this makes it easy to incorporate legacy logging data into this format. Just take the string as delivered by the application/server, stuff the entire thing into <data> and wrap it in those other fields as well as can be done. <data> being JSON would be fine by me and the fact that it's JSON or whatever would be recorded as part of the version string.

I know something like that would make my job a lot easier. :)

They should be paying attention to the lumberjack project

Posted Apr 14, 2012 23:23 UTC (Sat) by dlang (subscriber, #313) [Link]

The biggest problem with your approach is that it requires throwing away all existing logging and log processing tools, and as you aren't going to get everyone to buy into the new scheme at once and modify every program in the world to use your new scheme the probable result is that nothing will happen instead.

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 3:38 UTC (Sun) by drag (subscriber, #31333) [Link]

I guess so.

I figured it would be the logging daemon's job to put in all the fields as well as it can, but shovel in the log from the application into the 'data' section. If it leaves the 'severity' section empty or whatever then that would be legal. It's a 'best effort' type thing rather then requiring strict compliance.

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 4:03 UTC (Sun) by dlang (subscriber, #313) [Link]

The idea here (lumberjack and CEE) is to support and encourage the applications (including the kernel) to create structured logs so that the data that you are referring to as the 'data' section is easier (and thus more reliable) to deal with.

the first step is to have the normal message just stuck in the 'data' section, and the lumberlog library ( http://algernon.github.com/libumberlog/umberlog.html ) is designed to do just that. It can be LD_PRELOADed for any application and it modifies the syslog() call to log a structured log (JSON structure with added metadata). It then allows the application programmer to change syslog() calls to ul_syslog() calls and add additional name-value pairs.

the next step is to create a more complete logging API that allows the application programmer to more easily create structured logs. Debate over how that could/should work is ongoing.

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 0:45 UTC (Sun) by jzbiciak (subscriber, #5246) [Link]

YAML, perhaps? I guess YAML is a superset of JSON anyway these days.

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 1:08 UTC (Sun) by dlang (subscriber, #313) [Link]

what is it about YAML that would make it preferable? What can you encode in YAML that you can't do in JSON? (remember we are talking about logs here, not arbitrary documents) What can you encode in YAML better (more clearly, more simply) than you can in JSON?

doing a quick google search for YAML vs JSON I am not seeing anything that strikes me as being drastically better about JAML, and the fact that the JSON spec is considerably simpler seems to be an advantage.

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 1:27 UTC (Sun) by jzbiciak (subscriber, #5246) [Link]

Nothing in particular. It was the only other format that seemed to fit all your selection criteria. I wasn't too surprised to find out that YAML 1.2 is a strict superset of JSON. JSON probably wins on simplicity.

Some additional things in YAML that might be helpful that I don't think are in JSON (but could be mistaken): Explicit typecasting, and the ability to have internal cross-references.

I'm not entirely certain internal cross-references would be useful, although maybe they're useful to refer back to a component of an earlier log message. (Flip side: There's value in redundancy in logs, especially when records go missing.) Explicit typecasting might be useful if there's ever a case where a given value looks like a number but really ought to be treated as a string.

All that said, those are dubious benefits, and JSON probably wins on simplicity. I only mentioned YAML because it was the only other format I could think of offhand that survives the selection criteria fairly well.

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 1:46 UTC (Sun) by dlang (subscriber, #313) [Link]

Given that filtering criteria may mean that prior log entries are not available, references to them can't count on working, and I don't see much likelihood of them being useful within a single log message (a full document yes, a single log message no)

My query about different serialization protocols was serious. I don't pretend that I know all of them and the advantages of each, so it is very possible that there is something out there that's better.

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 2:41 UTC (Sun) by jzbiciak (subscriber, #5246) [Link]

The fact that very few standard formats survive the selection criteria illustrates the challenge, too. Good luck!

I mentioned YAML because I've found it very lightweight for the things I've used it for, and it is very human-friendly. I didn't realize that JSON is a proper subset of YAML until I looked up some comparisons. So, JSON wins similarly in the human-friendly department, and its simpler spec makes it easier to adopt.

Simple Declarative Language looks interesting. It appears to be a modest step up from JSON, adding explicit types to containers and the ability to add attributes to the type. Sure, you can capture that in a JSON serialization by adding explicit fields, but making it a first class aspect of the syntax has a certain economy to it. I hadn't heard of SDL before today. It looks interesting. Unfortunately, the list of languages that have SDL APIs seems out of line with my usual requirements of C and Perl.

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 3:00 UTC (Sun) by jzbiciak (subscriber, #5246) [Link]

Expanding on my SDL comment... You could easily imagine capturing many repeated aspects of a log entry in the entry type and attributes, rather than fields within the entry record itself. eg:

Example record from my /var/log/messages:

Apr  8 14:23:44 elysium kernel: [9234662.980516] r8169 0000:03:00.0: eth0: link up

One possible way to split between attributes and keys within the container:

entry date=1333913564 host=elysium source=kernel level=info timestamp=9234662.980516 \
     { message="r8169 0000:03:00.0: eth0: link up" }

Or something...

Honestly, I go back and forth between the value of attributed types vs. just embedding the information as fields within the structure. What color do I want my bikeshed today?

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 6:33 UTC (Sun) by lindi (subscriber, #53135) [Link]

You'd also want to have a way to extract that "eth0" in a programmatic way.

They should be paying attention to the lumberjack project

Posted Apr 20, 2012 21:18 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

You'd also want to extract the "0000:03:00.0", "r8169" (device driver name), and possibly "up".

And the date, host, and source values aren't from the kernel, so they wouldn't be in there.

They should be paying attention to the lumberjack project

Posted Apr 20, 2012 21:29 UTC (Fri) by dlang (subscriber, #313) [Link]

the information may not be from the kernel, but by the time anything other than the log transport sees the data, it will need to be there (and arguably the timestamp should be put there by the kernel)

They should be paying attention to the lumberjack project

Posted Apr 20, 2012 22:28 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

Aren't we talking about in what form the kernel should produce log messages?

They should be paying attention to the lumberjack project

Posted Apr 20, 2012 22:30 UTC (Fri) by dlang (subscriber, #313) [Link]

I had wandered a bit from that, but yes, that's where we started.

And the kernel should put the timestamp on the messages it generates, you don't know how long it's going to be before some other process picks them up and could add a timestamp to them.

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 3:03 UTC (Sun) by dlang (subscriber, #313) [Link]

I agree abut SDL, but whatever is used needs to be supported in every language (or something that's easy enough to manually create)

Keep in mind that JSON is just the least common denominator, the 'everything must support this' option. It is expected that most logging libraries, and the logging transports (i.e. syslog daemons) will support additional options. At the moment the other options expected for later are

BSON (more efficient transport with type information)

XML (because someone will want it, it's hard to do structured stuff and ignore XML ;-)

but others can be added as/if needed.

They should be paying attention to the lumberjack project

Posted Apr 20, 2012 7:54 UTC (Fri) by man_ls (guest, #15091) [Link]

What about protocol buffers? Have they fallen out of grace already?

They should be paying attention to the lumberjack project

Posted Apr 20, 2012 15:11 UTC (Fri) by dlang (subscriber, #313) [Link]

protocol buffers are good for some things, but they serialize into a binary format, which is not compatible with existing logging tools.

Also (as I understand them) protocol buffers require absolute agreement between the sender and the receiver on the data structures to be passed. This is hard to do for logging libraries that will be written in many different languages, multiple log transport tools, and the vast array of log analysis/storage tools.

They should be paying attention to the lumberjack project

Posted Apr 20, 2012 17:23 UTC (Fri) by smurf (subscriber, #17840) [Link]

No, they're backwards compatible
From the documentation:

>> You can add new fields to your message formats without
>> breaking backwards-compatibility; old binaries simply
>> ignore the new field when parsing

https://developers.google.com/protocol-buffers/docs/overview

They should be paying attention to the lumberjack project

Posted Apr 20, 2012 17:31 UTC (Fri) by dlang (subscriber, #313) [Link]

Ok, but in any case, they won't work with the existing (text based) logging tools.

Yes, any change to the message being logged 'breaks' existing tools that depend on exact matches of known log messages, but as long as the new log format is still text based, all the existing tools can be tweaked (new regex rules) and handle the log messages.

If you switch to something other than text streams for your messages, you will require that all logging tools be re-written to handle your new format. Since this is unlikly to happen, there is a very large emphisis in being compatible with the existing tools.

They should be paying attention to the lumberjack project

Posted Apr 20, 2012 21:10 UTC (Fri) by man_ls (guest, #15091) [Link]

Protocol buffers is a binary protocol, like BSON. If binary formats are being considered (as I deduced from your message) then protocol buffers should be considered. (I myself think that BSON has a much brighter future, but I was just wondering.)

They should be paying attention to the lumberjack project

Posted Apr 20, 2012 21:37 UTC (Fri) by dlang (subscriber, #313) [Link]

nxlog already has a binary transport, but it can only be used from nxlog to nxlog. There is though of having a binary transport, but that's a bit out still as the discussion is still focusing on the right way to generate the data and what tags are going to be used.

CEE is supposed to be releasing a 1.0beta spec, and the initial fields planned are documented at https://fedorahosted.org/lumberjack/wiki/FieldList#Unifie...

for the API, the initial focus is on trying to get a good C API that can replace the syslog() call. RedHat has a largish project that they've been calling ELAPI (Enhanced Logging API https://fedorahosted.org/ELAPI/wiki/WikiPage/Architecture ) that they are now realizing largely overlaps with the capabilities of the modern syslog daemons, so they are going though the code they wrote for that and ripping out lots of it to only keep what's needed. There is some question of if the result is still in the 'sledgehammer to swat a fly' category and so you have lumberlog working from the other direction

NOT YAML

Posted Apr 15, 2012 2:54 UTC (Sun) by dskoll (subscriber, #1630) [Link]

No, not YAML! YAML used to be a nice simple language, but the spec has become bloated and byzantine. It includes features that are completely inappropriate for structured logging like the ability to represent self-referential data structures.

I recently went through a large amount of pain converting our (commercial) product to use JSON for serialization rather than YAML just because YAML was becoming such a PITA.

Compare the one-page JSON spec with the monstrous YAML 1.2 spec.

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 0:54 UTC (Sun) by aliguori (subscriber, #30636) [Link]

I really don't have a suggestion. The closest thing I can think of is S-Expressions or YAML.

JSON is woefully under specified. There is no guidance in the spec about how implementations should handle numerics. It defers to EMCAScript for any specification ambiguity and EMCAScript is pretty clear that all numbers are represented as IEEE double precision floating point numbers.

AFAICT, a conforming implementation must truncate any number > 2^52 or at least treat two numbers as equivalent if they truncate to the same number.

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 1:36 UTC (Sun) by dlang (subscriber, #313) [Link]

what is the numeric range in YAML?

2^52 is a rather large number, how likely are you to need larger numbers in log messages? Are those cases distinct enough that you could just use the 'string' type for the number?

S-Expressions seem even more under specified than JSON from what I see from a quick google search

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 2:10 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

2^64-1 is quite likely to appear at least in some log entries.

They should be paying attention to the lumberjack project

Posted Apr 15, 2012 3:02 UTC (Sun) by dskoll (subscriber, #1630) [Link]

According to the JSON specification, the following is a perfectly valid number:

18446744073709551615 (AKA 264-1)

It may be that some (most?) JSON libraries can't handle such a large number correctly, but it's valid according to the JSON grammar.

They should be paying attention to the lumberjack project

Posted Apr 27, 2012 15:45 UTC (Fri) by mirabilos (subscriber, #84359) [Link]

Only floats. Integers are not limited.

They should be paying attention to the lumberjack project

Posted Apr 16, 2012 16:48 UTC (Mon) by jhoblitt (subscriber, #77733) [Link]

The use of JSON (and/or YAML and/or a serialization format that handles nested data types) is absolutely the right approach as "soon or later" someone will want to log data that doesn't fit well into $whatever_new_format_we_just_came_up_with.

Why wasn't the author of logstash included in the lumberjack discussions?

They should be paying attention to the lumberjack project

Posted Apr 16, 2012 18:20 UTC (Mon) by dlang (subscriber, #313) [Link]

> Why wasn't the author of logstash included in the lumberjack discussions?

I don't know. I found out about the Lumberjack discussions via announcments (I don't remember exactly where, I think I remember seeing an announcement here on lwn), it's an open list lumberjack-developers@lists.fedorahosted.org

RedHat had some sort of meeting back in Febuary where they had several of the syslog developers there, and the lumberjack project was annouced right about the beginning of March.

the volume on the mailing list varies greatly, we had several days of 40-50 messages/day and then there have been no messages for over a week now.

They should be paying attention to the lumberjack project

Posted Apr 26, 2012 12:27 UTC (Thu) by b0ti (guest, #81465) [Link]

> Why wasn't the author of logstash included in the lumberjack discussions?
He was.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds