Kernel development
Brief items
Kernel release status
The current 2.6 development kernel is 2.6.27-rc2, released on August 5. There's a lot of changes here, many of which are fixes or include file reorganizations (architecture-specific include files are moving from include/asm-xxx to arch/xxx/include), but there's also a driver for the SGI "GRU" system management device, support for the MIPS architecture in the common kgdb debugger, a new subsystem for the management of voltage and current regulators, some core memory management and VFS locking changes, a driver for the SPI master controller on Orion chips, and the removal of the long-deprecated cli() and sti() functions. See the short-form changelog for details, or the full changelog for lots of details.As of this writing, no changesets have been merged into the mainline repository since the 2.6.27-rc2 release.
The current -mm tree is 2.6.27-rc1-mm1. Recent changes to -mm consist mainly of a large reduction in size as hundreds of patches flow into the mainline.
The current stable 2.6 kernel is 2.6.26.2, released on August 6. It contains a large set of fixes for a wide variety of problems. Previously, 2.6.26.1 (also a large set of fixes) was released on August 1.
For 2.6.25 users: 2.6.25.14 (August 1) and 2.6.25.15 (August 6) continue the series of fixes for that kernel release.
Kernel development news
Quotes of the week
And arguably, if the goal is security theater, much like the security lines in airports, perhaps it doesn't matter. If there are silly CIO's that are willing to pay for such a thing, regardless of whether or not it is actually *necessary* to maintain security, one school of capitalism would say it doesn't matter if it actually provides any functional value or not.
On the other hand, it seems pretty clear there are plenty of LKML developers who aren't buying it. :-)
Notes from the Ottawa Linux Power Management Summit
The Ottawa Linux power management summit was held on July 22, 2008 - immediately prior to this year's Linux Symposium. For those who could not be there, Len Brown has posted a set of notes from the meeting. The discussion covered a wide variety of topics, including the OMAP3 processor, snapshot boot, USB power management, server power management, and more.File descriptor handling changes in 2.6.27
One of the changes merged for 2.6.27 is a set of system call extensions designed to get around some longstanding security issues with POSIX file descriptors; LWN covered these extensions back in May. Now the author of that work (Ulrich Drepper) has posted a description of these changes, why they are important, and how they will be used in the C library. Worth a look, especially for developers working on threaded applications.Can user-space bugs be kernel regressions?
Adding new functionality to the kernel while maintaining the interfaces for user space is the standard kernel development practice. Sometimes, though, that can tickle bugs in user-space programs in unpleasant ways. When that happens, it is clearly a regression—something that worked before no longer does—but is it a kernel regression? In the end, it doesn't matter, it seems, because the kernel needs to change to keep the user-space program working, even at the expense of "ugliness".
Clearly for purely internal kernel functionality, there is no mandate for compatibility across kernel versions. But, when the user-space interface is involved, things get a bit trickier. A change that alters the way a documented interface works is essentially never done; user-space interfaces are maintained forever. When new functionality properly uses a documented interface, but breaks a user-space program, it gets murkier.
That situation came up recently when Andrew Morton noticed that the linux-next tree broke the X server on his laptop. The problem was quickly diagnosed as a problem in the Synaptics touchpad driver for X. An array that was being passed to an ioctl() was sized based on the number of bits, rather than bytes, it should contain. Thus the maximum buffer length passed was off by a factor of eight.
As a solution, Dmitry Torokhov offered up a patch, not to kernel code, but to the synaptics X driver. That didn't sit particularly well, with Morton and others, eventually leading to a pronouncement from Linus Torvalds:
Torokhov clearly felt that it was the driver, not his changes, that were at fault, which is entirely understandable because it's true. That doesn't alter the fact that new kernels would break existing, working configurations on laptops everywhere. The kernel change just fully used an existing, documented interface as Torokhov explained:
Declaring an array of 64 bytes, but telling the kernel it can store up to 511 bytes into it is obviously a bug. But, as Morton points out:
What _does_ matter is that people's stuff will break. Apparently lots of people's. That's a problem. A _practical_ problem. Can we pleeeeeeze be practical and find some way of preventing it?
Since the code was in linux-next, it was targeted at the 2.6.28 kernel. In Torokhov's thinking, this would allow something approaching six months for distributions to update the synaptics driver. But that is a fundamental misunderstanding of how and when kernels are upgraded—it is not only by way of distributions. Introducing a change like this would result in many messages to linux-kernel from unhappy folks with broken X servers.
Kernel hackers purposely build and run kernels on a wide variety of hardware and distributions. That includes older distributions that no longer get updates so they would be stuck with the buggy driver, thus non-working X server, essentially forever. Obviously, they could rebuild the synaptics driver—kernel hackers have been known to compile things other than kernels—but that isn't the point.
There are major benefits to also having lots of regular users update their kernels frequently. Trying to ensure that there won't be any unnecessary barriers to doing that can only help. Torvalds describes it this way:
Torvalds and Torokhov worked out a fix that preserved the old behavior for
a specific passed-in buffer length, while allowing the new events to be
delivered to any other users of the ioctl() that passed in the
proper length. Torvalds commented:
"Yeah, it's not pretty, but pragmatism before beauty.
"
It is, to some extent, a gray area. Regressions are bad for any number of reasons, but maintaining hackarounds for buggy user-space programs has its own set of problems. The hope is that eventually the need for the workaround goes away so that it can be removed. It would seem difficult to determine when the last user of the old synaptics driver finally upgrades, so this code could be with us for a long time. Given the alternative, the price seems worth it.
Though Torvalds was absolute in condemning any known regression, even for programs that are clearly misusing an interface, there must be a line somewhere. If some obscure program, with few users, gets broken by the kernel doing something documented and reasonable, it is hard to imagine that this kind of workaround will be required. This particular problem was relatively easy to decide, the next might not be.
A kernel message catalog
Kernel developers will often use printk() to output a message when something goes wrong. Such messages tend to be helpful to kernel developers; if nothing else, they can be used to find the place in the source where the message is emitted, and that, in turn, is most useful for somebody trying to figure out what the message is really saying. So, if your kernel tells you, for example, "lguest is afraid of being a guest," a quick dig through the source turns up a comment reading "Lguest can't run under Xen, VMI or itself. It does Tricky Stuff." Problem solved - or, at least, understood.But, for the bulk of Linux users and administrators, the act of printk() interpretation by recourse to the kernel source is, itself, Tricky Stuff. If the kernel cannot tell them directly what the problem is, they would much rather have a more straightforward means of translating messages into some sort of useful English.
Or maybe not: for many Linux users, English may not be much more helpful than straight kernel-speak. It would be really nice to translate those messages into some sort of useful French, or Chinese, etc. What it comes down to, in the end, is that printk() alone will never be able to provide sufficient information to users in a way which can be understood and used to solve problems.
Just over one year ago, LWN looked at some proposals for adding structure to kernel messages. After that, the discussion went quiet, to the point that it seemed like not much was happening in the messaging area. But one should not forget that we are dealing with companies like IBM which have been creating massive binders full of kernel message documentation for several decades. They're not going to give up so easily. So the posting (by Martin Schwidefsky) of a new kernel messaging proposal is not an entirely surprising event.
In the latest scheme, each source file which generates structured messages defines a macro KMSG_COMPONENT as a string naming the specific subsection. This name will often match the name of the module which is created from that code, but that is not necessarily the case. The name, once chosen, is supposed to remain fixed forevermore; it becomes, in essence, part of the user-space interface and should always match the documentation.
Then, each message is assigned an integer identification number. The combination of the component name and the message number should be unique throughout the kernel; it is used by various tools to associate a more detailed explanation of whatever the message is intended to communicate. The message number is used with one of a number of new printk()-like functions:
kmsg_alert(id, format, args...);
kmsg_err(id, format, args...);
kmsg_warn(id, format, args...);
kmsg_info(id, format, args...);
kmsg_notice(id, format, args...);
kmsg_dev_alert(id, dev, format, args...);
/* ... */
The "_dev" versions take an additional struct device
argument (like dev_printk()) and encode the device name in the
resulting message. That message (for all variants) will include the
component name and the message number in any output. So, for example, the
S/390 "xpram" driver includes the following:
#define KMSG_COMPONENT "xpram"
/* ... */
if (devs <= 0 || devs > XPRAM_MAX_DEVS) {
kmsg_err(1, "%d is not a valid number of XPRAM devices\n", devs);
Should this particular error check trigger, the resulting message will look like this:
xpram.1: 42 is not a valid number of XPRAM devices
Thus far, our user is probably not feeling much better informed than before. But there is additional information which is made available and associated with that message tag. In this particular case, it looks like this:
/*? * Tag: xpram.1 * Text: "%d is not a valid number of XPRAM devices" * Severity: Error * Parameter: * @1: number of partitions * Description: * The number of XPRAM partitions specified for the 'devs' module parameter * or with the 'xpram.parts' kernel parameter must be an integer in the * range 1 to 32. The XPRAM device driver created a maximum of 32 partitions * that are probably not configured as intended. * User action: * If the XPRAM device driver has been compiled as a separate module, * unload the module and load it again with a correct value for the * 'devs' module parameter. If the XPRAM device driver has been compiled * into the kernel, correct the 'xpram.parts' parameter in the kernel * parameter line and restart Linux. */
Here, we have a more verbose description of the message. Even more helpfully (one hopes), there is a discussion of what can be done to make this message go away. This information can be provided within the source or in a separate documentation file; it can also, presumably, be nicely formatted and distributed to paying customers as a binder for the system administrator's bookshelf. It can be translated into other languages for Linux users worldwide (and beyond: one could have a lot of fun with the Klingon translation for this kind of material).
The patch includes a script (written in Perl with undocumented messages, of course) which (when invoked with make D=1) will go through the source and make sure that every kernel message has an associated description block; it can also format the descriptions into man pages if desired. There are checks for missing descriptions or overloaded message ID numbers; the script does not, at the moment, check for a change in the message text.
Martin's first posting made this work specific to the S/390 architecture; following a suggestion from Andrew Morton, he made it generic in later versions. The cost of this work is zero for those who do not use it, so there is a reasonable chance that it will find its way into the mainline eventually. Before the message catalog system can be truly useful, though, developers will have to go through and document a substantial portion of the messages created by the kernel - and keep that documentation current as the kernel evolves.
The TALPA molehill
The TALPA malware scanning API was covered here in December, 2007. Several months later, TALPA is back - in the form of a patch set posted by a Red Hat employee. The resulting discussion has certainly not been what the TALPA developers would have hoped for; it is, instead, a good example of how a potentially useful idea can be set back by poor execution and presentation to the kernel community.The idea behind TALPA is simple: various companies in the virus-scanning business would like a hook into the kernel which allows them to check for malware and prevent its spread. So the patch adds a hook into the VFS code which intercepts every file open operation. A series of filters can be attached to this intercept, with the most important one being a mechanism which makes the file being opened available to a user-space process as a read-only file descriptor. That process can scan the file and tell the kernel whether the open operation should be allowed to proceed or not. In this way, the scanning process can prevent any sort of access to files which are deemed to contain bits with evil intentions.
There are a few other details, of course. A caching mechanism prevents rescanning of unchanged files, increasing performance considerably. There is also a hook on close() calls which can trigger the rescanning of a file. Processes can exempt themselves from scanning if it might get in their way; scanning can also be turned off for specific files, such as those used for relational database storage. But the patch set is relatively small, as it really does not have that much to do.
This capability could well prove to be useful. Even if one is not concerned about malware infections on Linux systems, a lot of files destined for more vulnerable platforms can pass through Linux servers. There is also the potential for the detection of attempted exploits of the Linux host. Normally, in the Linux world, the way we respond to knowledge of a specific vulnerability is to patch the problem rather than scan for exploits, but there may be systems which cannot be restarted on short notice, and which could benefit from an updated scanning database while running code with known vulnerabilities. Also, as Alan Cox pointed out, this feature could be useful for entirely different objectives, such as efficient indexing of files as they change.
What might be best of all, though, is that this hook could replace a number of rather less pleasant things being done by anti-malware vendors now. Some of these products use binary-only modules, plant hooks into the system call table, and generally behave in unwelcome ways. Moving all of that to a user-space process behind a well-defined API could be beneficial for everybody involved.
The patches have gotten a generally hostile reception on the kernel mailing lists, though. Some developers are uninspired about the ultimate objective:
That's an objection which can be worked around; the kernel developers do not normally want to determine which applications will or will not be supported by the system as a whole.
Another objection, though, might be harder: this hook is said not to be the best solution to the problem. Instead of putting a hook deep within the VFS layer, the anti-malware people could simply hook into the C library (perhaps with LD_PRELOAD), put the malware scanning directly into the processes (mail clients or web servers, say) which are passing files through the system, or embed the scanning into a stackable filesystem implemented with FUSE (or a similar mechanism). That has led to counterarguments that scanning implemented in this manner could be evaded by a hostile application - by performing system calls directly, for example, instead of going through the C library. Certain kinds of attacks, it is said, could get around a purely user-space solution.
That argument, however, highlights the real problem with this posting. The patch includes a set of 13 "requirements," including intercepting file opens, caching results, exempting processes, and so on. But none of these requirements describe the problem which is really being solved. In particular, as noted by Al Viro and others, there is no description of the threat which this patch is intended to mitigate:
If the scanning host could be infected, then a scanning mechanism which could be circumvented by a rogue program is indeed a problem. But that is a very different threat than simply trying to prevent evil attachments from creating mayhem on Windows boxes; it does not appear to be a threat which these patches are trying to address.
The lack of a clearly described problem has caused the discussion of these patches to go around in circles; it is not possible to evaluate (1) whether the goals of these patches are worth supporting, or (2) whether the patches can actually be successful in achieving those goals. The code, in other words, cannot be reviewed. Until the TALPA developers can clarify that situation, their work will look like an example of "shoot first, then aim." That kind of code tends not to make it into the mainline, even if it could be useful in the end.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
