The logger meets linux-kernel
The logger is a simple device designed to collect log data from applications and make that data available to developers and the central logging system. It was designed around a number of goals, most of which are driven by the untrusted nature of Android applications: the system wants to allow those applications to send messages to the log in a reliable and trustworthy manner. So applications should not be able to fake who they are, consume too much memory with log data, or spam the log at the expense of data from the kernel or other applications. Writing to the log is also meant to be a fast operation; lots of log data is generated, but little of it is read, so the write operation is the one that should be the fastest.
The driver implements a handful of logging "devices" for different streams; they are known as "main," "events," "radio," and "minor." These devices are hardwired into the code; there is no way to change the set of devices at run time. Each device is implemented as a simple ring buffer in kernel space; writing to the device adds a message to the buffer (annotated with timestamp and process ID information) while any number of readers can pull data out of the buffer. Each log has 256KB of memory dedicated to it; that size, too, is wired into the code. There is a small set of ioctl() operations provided to get the size of the log or the amount of unread logging data stored there, or to flush all data from the log.
The first question to be raised was: why is this facility needed at all? The kernel already has a mechanism for letting user space add entries to the log stream. In mainline kernels, that is done by writing to /dev/kmsg; that device takes the data given to it and passes it straight to printk(). There are a few reasons why Android would not want to use this interface: it does not allow for the separation of logging streams (and, thus, is limited to the root account), it does not add process ID information, and printk() is quite a bit more expensive than simply copying the log data into kernel memory. While adding a duplicate logging infrastructure is obviously problematic, a case could be made that something like logger should be merged and used as a replacement for /dev/kmsg.
That said, "something like logger" need not be the logger itself. If nothing else, it is hard to imagine the current code being merged without corrections for the most obvious problems: the hardwired log streams, for example. Logger is entirely unaware of namespaces, making it unsuited to environments where different process groups may want their own logging streams. It also does not really succeed in keeping processes from spamming the logs and overwriting data from other processes; each of the four streams is isolated from the others, but applications still end up logging to the same streams.
There is also the question of whether the user-space API is the right one. Adding more ioctl() calls to the kernel is never a popular thing to do. Some participants have suggested that an entirely user-space-based logging daemon would work just as well. But a user-space solution has some disadvantages: it would not be able to provide trusted process ID information, and it would likely be slower. Communicating log data to a user-space daemon and adding information like time stamps would involve more system calls than simply logging the data through the kernel. It would also be hard to merge kernel and user-space log data (and, in particular, to ensure that the ordering between events is correct) with a user-space scheme.
Neil Brown suggested that a filesystem interface could be used instead of logger's char device ABI. A logging stream could be established simply by creating a file within that filesystem; regular file permissions could then be used to control access to the streams. The ioctl() calls could be replaced with regular filesystem calls; flushing a log would be done with truncate(), for example. A filesystem-based logger would, clearly, involve moving to a different user-space interface, but it seems like it should be able to satisfy the use case that led to the creation of logger with a more Unix-like ABI.
Android developer Brian Swetland has indicated that there might be interest in trying out an alternative logging implementation, but it would have to meet their needs and it's not something that they would be interested in creating themselves:
If somebody wants to go write a complete compatible replacement that just drops in, we certainly could take a look at it and see how it works, but given that the benefits are not clear to us, we're not interested in going off and doing that work ourselves.
There is another point to consider as well: the longstanding interest in adding a more structured logging interface to the kernel seems to be getting stronger. Adding a new logging mechanism that doesn't address the concerns of translators, documentation writers, and people maintaining automated network management systems is going to be a hard sell. Kay Sievers, one of the developers behind "The Journal," has a plan of his own for logging; it involves adding more structured information to the existing message stream. He thinks there is no need to add a new, incompatible logging mechanism; he is also strongly opposed to the idea of separating message streams as logger does. That separation, he says, just creates problems when messages from multiple sources need to be merged back together in the correct order.
Brian actually seems receptive to some of these ideas, but has expressed concerns around rate-limiting and controlling access to log data. Lennart Poettering believes these needs can be met by the Journal with a little help from control groups. It seems possible that something could be done to make both groups happy - though the code to actually do so has not been posted.
As of this writing, that is where the discussion stands. On the surface, it looks like another piece of Android technology has gotten tangled up in unrelated requirements from various kernel developers, with the result that it will sink like its predecessors. But there is the potential for a number of possible solutions here. One, of course, would be to simply merge a moderately cleaned-up logger for Android's use and be done with it. Another would be to, as Neil suggested, focus on Android's user-space ABI for logging. If that ABI were formalized in a way that would allow multiple underlying implementations, the mainline could merge something other than logger and still be that much closer to being able to run Android's user space.
Perhaps the most intriguing outcome, though, would come if developers from
Android, the Journal, and others with an interest in structured logging
could get together and agree on what the next generation of logging should
look like. Their solution would be likely to start with a critical mass of
users
from the outset, improving its chances of being merged and adopted by
others. Given the number of structured logging efforts that have
floundered in the past, getting a widely-accepted solution in place would
be a major accomplishment.
Posted Jan 4, 2012 20:04 UTC (Wed)
by mezcalero (subscriber, #45103)
[Link] (1 responses)
The fancy rate limit logic of the journal that takes cgroup info and disk space into account is actually in systemd git master since a while.
And I don't see why timestamping log streams in userspace would be "slow", as the kernel automatically attaches a timestamp to all datagrams anyway, which is accessible via SO_TIMESTAMP/SO_TIMESTAMPNS. Basically on /dev/log the kernel attaches those timestamps it's just a matter of querying them from userspace.
And I don't think Neil Browns file system idea would work, since you'd need to a sane way to communicate the current write and read pointer into those log buffers to userspace (i.e. since we use it as a ring buffer, right?), but that would be necessarily racy, unless the files on those file systems get very very magic seek semantics at which point they'd probably be a lot more like a char device, than a normal file.
Posted Jan 4, 2012 23:21 UTC (Wed)
by neilbrown (subscriber, #359)
[Link]
I imagine that when you open and write, the message simply gets appended - i.e. O_APPEND is forced.
The important difference to a char device is that you can O_CREATE new ones.
Posted Jan 4, 2012 20:44 UTC (Wed)
by dlang (guest, #313)
[Link] (2 responses)
Posted Jan 18, 2012 5:39 UTC (Wed)
by kevinm (guest, #69913)
[Link] (1 responses)
Posted Jan 18, 2012 19:53 UTC (Wed)
by dlang (guest, #313)
[Link]
Posted Jan 4, 2012 20:51 UTC (Wed)
by dlang (guest, #313)
[Link] (25 responses)
actually, they first need to explain why the existing structured logging that syslog can do is insufficient.
before someone goes off to say that existing logging needs to be scrapped, they need to be familiar with the current RFCs and capabilities of what they are saying needs to be scrapped.
Posted Jan 4, 2012 21:05 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Jan 4, 2012 21:11 UTC (Wed)
by dlang (guest, #313)
[Link]
any new proposal, by definition, is going to have to start behind this as they will first have to define a standard and write software to handle it before it does any good to worry about software generating structured info.
I actually have seen software that generates structured syslog messages. I agree it's not common, but then again, I don't believe that any number of 'structured logging proposals' is ever going to solve the problem of the programmer being lazy and not writing good structured logs (they can write bad structured logs easily, it happens all the time on windows in the eventlog)
Posted Jan 4, 2012 23:31 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (22 responses)
So passing messages into the kernel needs to be fast and pulling them out simply should not happen in normal usage - they just get discarded.
Posted Jan 5, 2012 0:08 UTC (Thu)
by cmccabe (guest, #60281)
[Link] (5 responses)
In order to make it truly fast, it had to be lockless (at least most of the time), thread-local (no cacheline ping-pong) and entirely in userspace.
I don't see how anything that involves a syscall can ever truly be "free." Even a system which just involves an mmap'ed ring buffer protected by a single mutex will tend to sharply limit scalability on the multi-core hardware of the future (and present).
Maybe Journald can accumulate a bunch of messages and send them to the kernel all at once. However, that begs the question: what happens to unflushed messages on a crash? They could use a signal handler to take care of this problem, but a lot of programs handle signals themselves.
Posted Jan 5, 2012 0:59 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (4 responses)
Android can implement it with a trivial library that writes out to their in-kernel ring buffer. You can implement it using your awesome fast_log, and both can still run all the wonderful Android applications, each on the kernel of their choice.
Then when all the app developers start saying to Android "The log system from that McCabe guy is much better than yours - could you start providing that one instead", then we can all forget about the android-logger in the kernel and move on.
Posted Jan 5, 2012 1:27 UTC (Thu)
by dlang (guest, #313)
[Link]
Posted Jan 5, 2012 1:42 UTC (Thu)
by cmccabe (guest, #60281)
[Link] (2 responses)
It seems like the only reason they wanted this in the kernel in the first place was speed. Sending messages to a daemon over a socket or pipe was considered too slow. Unfortunately, they ended up adding a lot of policy like the log stream names "radio", "main", "events", etc. This was also the rationale for putting Binder and friends in kernel space-- to get that nice zero-copy performance along with the security they wanted. Maybe what we really need is a secure circular buffer IPC primitive rather than all these special-purpose widgets.
It's interesting that the Android guys went with unstructured logging. They control the entire userspace, so presumably it would be a lot easier for them to force developers to write structured log messages than it will be for Red Hat to get people to use structured Journal logs. There is also something bizarre about the fact that 1970s-era UNIX systems were fine using unstructured logging, and small cell phone systems are fine using it, but people claim that it is too inefficient to run on servers. However, maybe there's something I'm overlooking and the current focus on structured journald logging will be useful. Time will tell, I guess.
Posted Jan 9, 2012 15:07 UTC (Mon)
by sorpigal (guest, #36106)
[Link]
I chalk it up to simple Windows-envy.
Posted Jan 14, 2012 3:40 UTC (Sat)
by swetland (guest, #63414)
[Link]
Unstructured (or really a bit semistructured -- timestamps, pid, loglevel are structured, message content is not) logs are partially to avoid having to have knowledge of the precise software running or maintaining an ever-expanding definition of log message formats to interpret the logs from an arbitrary device that's encountered. Very valuable in a large ecosystem with a huge variety of devices, software versions, etc.
The name/sizes stuff should just move to platformdata / module params / or the like. An oversight that's easily enough corrected.
Posted Jan 5, 2012 0:16 UTC (Thu)
by dlang (guest, #313)
[Link] (15 responses)
however, I'll bet that you are really talking about much lower data rates, I'd be surprised if you needed much over 100 logs/sec and amazed if you need over 1000 logs/sec (remember, we're not talking about logs from the kernel here, we're talking about logs from userspace, the type of thing that is sent to syslog on a traditional linux distro, or is written to a app-specific log file)
the fact that the kernel is involved at all is simply a desire from the android developers to avoid having a userspace daemon.
for the trivial capabilities that the logger currently has (listen to a filesystem name, accept anything written to it into a circular buffer, and regurgitate the buffer if requested), it seems to me that a pretty trivial userspace daemon should be able to do this. I still haven't seen any explanation (other than the "we didn't want another daemon running" one) for why the kernel should be involved at all.
Posted Jan 5, 2012 0:52 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (1 responses)
I don't think we are just talking about the stuff that a normal distro sends to syslog. Brian observed that app developers complain that other apps send too many message to the log so "their" much more important messages fall off the end of the log. This suggests a high rate of log-spam.
The kernel is *always* involved. Even if you use shared memory, who do you think it is that maps the memory into each process on the relevant page faults?
Yes: by pre-faulting large shared memory images and using futexs you could interleave logs from multiple processes with little interaction with the kernel (if each process ran on its own processor) but that doesn't fit the android security model. Each process has it's own uid and doesn't trust other processes (or at least the log system doesn't trust any of it's clients beyond the fixed API). So shared-memory logging doesn't really meet the need.
Posted Jan 5, 2012 1:26 UTC (Thu)
by dlang (guest, #313)
[Link]
given that they have a rather tiny fixed-size buffer (256K was mentioned), having enough logs generated to fall off the end does not really equate to that large a data rate.
But even if high performance is needed (and note that I am still not agreeing that it is), that doesn't, by itself, justify implementing something in the kernel. That's why we no longer have the tux webserver in the kernel, and there have been many other things proposed that "required" kernel level implementations for performance reasons, that ended up not going in to the kernel.
the fact that we haven't been able to get anyone to talk about data rates other than the very generic "overrunning a 256K buffer in seconds" doesn't do much to justify this going into the kernel.
I'll bet just about any linux system generates > 256k of syslog messages in 'seconds' as it boots up.
In fact, I have a server that I just rebooted, in it's first second, syslog shows >100K of message data in syslog (and this is a pretty stripped down kernel and boot, I'll bet standard distro kernels end up logging quite a bit more)
Posted Jan 5, 2012 16:41 UTC (Thu)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (12 responses)
Posted Jan 5, 2012 17:39 UTC (Thu)
by dlang (guest, #313)
[Link] (11 responses)
and in any case, where are the numbers?
how much traffic are we actually talking about here, what how much processor does it take to do this in kernel space, how much more in userspace (and even if it's 100x more, is it enough to actually matter?)
Posted Jan 5, 2012 18:03 UTC (Thu)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (10 responses)
So the sensitivity to CPU consumption is way higher than I would have guessed a couple of years ago. In addition, there are other well-known issues that arise when relying on a separate daemon in addition to efficiency. I have run into them many times over the years (make that decades), and I would guess that you have as well.
It is instructive to turn the question around. First, start with the assumption that the Android guys actually are competent in their area of expertise. Then ask yourself why a userspace daemon might be problematic for them.
Posted Jan 5, 2012 18:58 UTC (Thu)
by dlang (guest, #313)
[Link] (9 responses)
I really believe that this is premature optimization and micro-benchmark optimization.
to restate an example I've posted before, several years ago at $work, the engineers ran a benchmark that showed a 60x performance improvement (from 30 seconds to .5 seconds) for a read_config() function by creating a shared memory segment and populating that rather than reading from disk.
the problem is that this benchmark was for 100,000 reads of the config file, and the real-world usage showed that this function was executed 50,000 times in an hour across >50 machines. This 6-month engineering project saved a total of 15 cpu seconds in a peak hour across 200 cores.
sometimes "it's faster", even for high multipliers of how much faster just don't matter because the function just isn't called frequently enough.
I have some firewalls that run 'inefficient' fork-for-every-connection proxies (horribly inefficient, right), but on a peak day these systems (on low-end hardware) still don't hit a loadaverage of 0.2, so the inefficiency really doesn't matter.
application logging just isn't that heavy, even on a 'slow' cell phone processor.
Posted Jan 5, 2012 20:48 UTC (Thu)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (7 responses)
Fair enough.
Just don't be surprised when others return the favor by being unwilling to assume that you are competent within your area of expertise.
Posted Jan 5, 2012 21:26 UTC (Thu)
by dlang (guest, #313)
[Link] (6 responses)
In general, this is exactly how it works in kernel development.
Posted Jan 5, 2012 22:10 UTC (Thu)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (5 responses)
How did you determine that the Android guys' avoidance of user-mode daemons was strictly due to performance issues?
Posted Jan 5, 2012 22:51 UTC (Thu)
by dlang (guest, #313)
[Link] (4 responses)
Posted Jan 5, 2012 23:13 UTC (Thu)
by dlang (guest, #313)
[Link] (3 responses)
Posted Jan 6, 2012 16:35 UTC (Fri)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (2 responses)
One caution: It is difficult to get requirements from the Android guys. Not because they are uncooperative, but because they are only conscious of some of the requirements. The rest are so second-nature to them that they don't think of them. This means that working on a suitable replacement is an iterative process, where you learn the requirements by repeatedly implementing your best guess at each point in time.
John Stultz has gone through much of this process with his alarm-timers work, and he does appear to be getting close to a solution that is useful to both the Android guys and to non-smartphone software.
I have been on both sides of this. Back in the 1990s, those of us at Sequent habitually wrote parallel programs, but could not describe how to do it to others. We did try, but the results were not so good. Instead, we used apprenticeship-style training: We put the new guy in a cube surrounded by people who knew parallel programming, and after a few months, the new guy would have no problem writing competent parallel code.
I was on the other side while gathering wakelock requirements in 2010 and 2011. The “Avoid Thundering Herds of Timers” requirement is an example of an important requirement that was not at all obvious to me, but so obvious to the Android guys that they didn't think to state it.
Sort of like you would probably not see the need to put “the acceleration of gravity at the site is 9.8 meters per second squared” when creating specs for a builder. ;–)
Posted Jan 12, 2012 23:20 UTC (Thu)
by Kluge (subscriber, #2881)
[Link] (1 responses)
Does this mean that Android developers are from Mars, kernel hackers from Venus? ;-)
Posted Jan 13, 2012 0:54 UTC (Fri)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
Posted Jan 18, 2012 6:02 UTC (Wed)
by kevinm (guest, #69913)
[Link]
Rather, it's the fact that every extra cycle wasted is an extra cycle that the CPU must remain in a higher power state, and thus represents an incremental hit to the battery life. Recall in particular that context-switching from one userspace process to another requires a cache-flush on older ARM processors.
Posted Jan 4, 2012 23:01 UTC (Wed)
by dw (subscriber, #12017)
[Link] (7 responses)
Isn't this exactly what unix(7) SO_PASSCRED is for?
Posted Jan 5, 2012 2:08 UTC (Thu)
by mezcalero (subscriber, #45103)
[Link] (6 responses)
Posted Jan 6, 2012 10:43 UTC (Fri)
by geuder (subscriber, #62854)
[Link] (5 responses)
Overall I think I can see both points in this discussion.
Why doing some limited hard-coded stuff in the kernel if much more fexible functionality can be done in a user space daemon?
I have not worked with Android, but having seen power and other resource consumption of Maemo's daemons (udev, hal, DBus, tracker, what else...) I can understand people saying that Android took a very strict standpoint against user space daemons. Actually Maemo has syslog, but it's not installed in factory devices because it is too resource hungry.
Maybe the truth is just that one size does not fit all.
In a server the most advanced poetterish journal daemon will not consume any measurable amount of resources and the server admin might benefit from all the flexibility and functionality.
On a phone you don't even want to waste the context switches and the code size and 4 hard coded logs are just good enough.
Of course the kernel is configurable. So wouldn't that just mean that the most limited version is in the kernel (didn't Linus say that if it is used in millions of devices it deserves to be merged). But everybody from desktop upwards just configures it out, because they use a full-fledged user space daemon.
Posted Jan 8, 2012 3:08 UTC (Sun)
by dmag (guest, #17775)
[Link] (4 responses)
On the other hand, I think the filesystem idea has merit. I wonder if tmpfs has disk quotas? Each app could create it's own file in /dev/shm.
Posted Jan 9, 2012 2:16 UTC (Mon)
by quotemstr (subscriber, #45331)
[Link] (3 responses)
System calls are cheap. The cost of jumping to a userspace daemon is trivial. If pumping messages over a unix socket is good enough for X11, it's good enough for logging, which needs a tiny fraction of X's throughput.
No optimization without quantification.
Posted Jan 9, 2012 2:39 UTC (Mon)
by neilbrown (subscriber, #359)
[Link]
No credibility without code. And measurements. :-)
Posted Jan 9, 2012 23:26 UTC (Mon)
by geuder (subscriber, #62854)
[Link] (1 responses)
If you want to make a $ 30-50 phone using a 500 mAh battery the definition of trivial changes.
How many instructions does a Linux context switch take these days? I know it's architecture specific and depends on dozens of environment factors, so you are wlecome to parametrize your answer :)
> If pumping messages over a unix socket is good enough for X11
Actually that's not good enough on a small phone.
Even if logging might not be the worst problem, from my experience as a system architect I know that sometimes you have to have simple & strict rules in big organizations. If the rule is that no user space daemons, then you must stick to it even if a certain daemon would not cause real harm. If there is a dozen of teams trying to get their daemon in, it's just easier to say no to everybody than yes to to 1 team and no to 11 others.
(I don't know whether this was the reason for Android to do what they did.)
Posted Jan 9, 2012 23:55 UTC (Mon)
by dlang (guest, #313)
[Link]
if something is 100x more expensive to run, but still only averages one cpu second per hour, it really doesn't matter.
device wakeups are similar, if you have hundreds per second it's a major problem, but once you get to around one per second the benefits of extending the sleep time further is of much less value (the average amount of time/power spent awake starts to become a fraction of the power spent 'asleep but able to wake up' and as a result, further reducing that fraction makes little practical difference.
Posted Jan 6, 2012 14:45 UTC (Fri)
by dps (guest, #5725)
[Link]
IMHO assumptions about the niceness and bug freeness of user space in the kernel are an unambiguously bad idea.
Posted Jan 8, 2012 1:24 UTC (Sun)
by mlankhorst (subscriber, #52260)
[Link] (1 responses)
Posted Jan 10, 2012 22:37 UTC (Tue)
by mezcalero (subscriber, #45103)
[Link]
The logger meets linux-kernel
The logger meets linux-kernel
When you open and read you just read from the start of the ring. When you get to the end you get EOF, though you can use 'poll' to find out when more data is available (in the same way that various /sys and /proc files support 'poll').
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
It seems that the sort of logging that this is used for is debugging traces. i.e. hundreds of messages that you will almost never want to look at - so it needs to almost be free.
You only look when a problem occurs and then you want to see everything that happened in the last few minutes so that you can piece the history together.
The logger meets linux-kernel
See https://github.com/cmccabe/redfish/blob/master/util/fast_...
Also the .h, etc.
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
What matters
What matters
What matters
What matters
What matters
What matters
What matters
It would be really cool if you did get a chance to try this out!
What matters
What matters
Mars and Venus
What matters
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
- logging size could be limited by quota
- file owner is proof of ownership
- to log, the application calls gettime() and writes to memory (I assume we don't care if the app lies about time.)
- a daemon periodically reads (and zeros) the logs
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
I think even the kernel community has some rules with no discussion and no exceptions...
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel
The logger meets linux-kernel