User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.38-rc4, released on February 7. "There's nothing much that stands out here. Some arch updates (arm and powerpc), the usual driver updates: dri (radeon/i915), network cards, sound, media, scisi, some filesystem updates (cifs, btrfs), and some random stuff to round it all out (networking, watchpoints, tracepoints, etc)." The short-form changelog is in the announcement, or see the full changelog for all the details.

Stable updates: the long-term update was released on February 7 with a long list of important fixes.

The update was released on February 9. This one contains a couple dozen important fixes.

Comments (none posted)

Quotes of the week

Anyone who programs ATA controllers on the basis of common sense rather than documentation, errata sheets and actually testing rather than speculating is naïve.
-- Alan Cox

I'm invoking the anti-discrimination statutes here on behalf of those of us who don't like beer.
-- James Bottomley pushes the limits of tolerance

Realize that 50% of today's professional programmers have never written a line of code that had to be compiled.
-- Casey Schaufler

As you can see in these posts, Ralink is sending patches for the upstream rt2x00 driver for their new chipsets, and not just dumping a huge, stand-alone tarball driver on the community, as they have done in the past. This shows a huge willingness to learn how to deal with the kernel community, and they should be strongly encouraged and praised for this major change in attitude.
-- Greg Kroah-Hartman

There are no "rules", things have to work and that is the only rule.
-- Markus Rechberger (context here)

Comments (15 posted)

Linus Torvalds: looking back, looking forward (ITWire)

ITWire has posted a lengthy interview with Linus Torvalds. "On the other hand, one of the things I've always enjoyed in Linux development has been how it's stayed interesting by evolving. So maybe it's less 'fun' in the crazy-go-lucky sense, but on the other hand the much bigger development team and the support brought in by all the companies around Linux has also added its own very real fun. It's a lot more social, for example. So the project may have lost something, but it gained something else to compensate."

Comments (32 posted)

Increasing the TCP initial congestion window

By Jonathan Corbet
February 9, 2011
The TCP slow start algorithm, initially developed by Van Jacobson, was one of the crucial protocol tweaks which made TCP/IP actually work on the Internet. Slow start works by limiting the amount of data which can be in flight over a new connection and ramping the transmission speed up slowly until the carrying capacity of the connection is found. In this way, TCP is able to adapt to the actual conditions on the net and avoid overloading routers with more data than can be accommodated. A key part of slow start is the initial congestion window, which puts an upper bound on how much data can be in flight at the very beginning of a conversation.

That window has been capped by RFC 3390 at four segments (just over 4KB) for the better part of a decade. In the meantime, connection speeds have increased and the amount of data sent over a given connection has grown despite the fact that connections live for shorter periods of time. As a result, many connections never ramp up to their full speed before they are closed, so the four-segment limit is now seen as a bottleneck which increases the latency of a typical connection considerably. That is one reason why contemporary browsers use many connections in parallel, despite the fact that the HTTP specification says that a maximum of two connections should be used.

Some developers at Google have been agitating for an increase in the initial congestion window for a while; in July 2010 they posted an IETF draft pushing for this change and describing the motivation behind it. Evidently Google has run some large-scale tests and found that, by increasing the initial congestion window, user-visible latencies can be reduced by 10% without creating congestion problems on the net. They thus recommend that the window be increased to 10 segments; the draft suggests that 16 might actually be a better value, but more testing is required.

David Miller has posted a patch increasing the window to 10; that patch has not been merged into the mainline, so one assumes it's meant for 2.6.39.

Interestingly, Google's tests worked with a number of operating systems, but not with Linux, which uses a relatively small initial receive window of 6KB. Most other systems, it seems, use 64KB instead. Without a receive window at least as large as the congestion window, a larger initial congestion window will have little effect. That problem will be fixed in 2.6.38, thanks to a patch from Nandita Dukkipati raising the initial receive window to 10 segments.

Comments (23 posted)

Bounding GPL compliance times

A quick look at any development conference will reveal that quite a few Linux hackers are currently carrying phones made by HTC. They obviously like the hardware, but kernel developers have been getting increasingly annoyed by HTC's policy of delaying source code releases for up to 120 days after a given handset ships. In response, Matthew Garrett has suggested an addition to the top-level COPYING file in the kernel source:

While this version of the GPL does not place an explicit timeframe upon fulfilment of source distribution under section 3(b), it is the consensus viewpoint of the Linux community that such distribution take place as soon as is practical and certainly no more than 14 days after a request is made.

About the only response so far has been from Alan Cox, who has suggested that getting a lawyer's opinion on the matter might be useful. Linus, over whose name the new text would appear, has not commented on it. So it's not clear if the change will go in or whether it will inspire any changes in vendor behavior if it is merged. But it does, at least, make the developers' feelings on the matter known.

Comments (8 posted)

Kernel development news

Removing ext2 and/or ext3

By Jonathan Corbet
February 9, 2011
The ext4 filesystem has, at this point, moved far beyond its experimental phase. It is now available in almost all distributions and is used by default in many of them. Many users may soon be in danger of forgetting that the ext2 and ext3 filesystems even exist in the kernel. But those filesystems do exist, and they require ongoing resources to maintain. Keeping older, stable filesystems around makes sense when the newer code is stabilizing, but somebody is bound to ask, sooner or later, whether it is time to say "goodbye" to the older code.

The question, as it turns out, came sooner - February 3, to be exact - when Jan Kara suggested that removing ext2 and ext3 could be discussed at the upcoming storage, filesystems, and memory management summit. Jan asked:

Of course it costs some effort to maintain them all in a reasonably good condition so once in a while someone comes and proposes we should drop one of ext2, ext3 or both. So I'd like to gather input what people think about this - should we ever drop ext2 / ext3 codebases? If yes, under what condition do we deem it is OK to drop it?

One might protest that there will be existing filesystems in the ext3 (and even ext2) formats for the indefinite future. Removing support for those formats is clearly not something that can be done. But removing the ext2 and/or ext3 code is not the same as removing support: ext4 has been very carefully written to be able to work with the older formats without breaking compatibility. One can mount an ext3 filesystem using the ext4 code and make changes; it will still be possible to mount that filesystem with the ext3 code in the future.

So it is possible to remove ext2 and ext3 without breaking existing users or preventing them from going back to older implementations. Beyond that, mounting an ext2/3 filesystem under ext4 allows the system to use a number of performance enhancing techniques - like delayed allocation - which do not exist in the older implementations. In other words, ext4 can replace ext2 and ext3, maintain compatibility, and make things faster at the same time. Given that, one might wonder why removing the older code even requires discussion.

There appear to be a couple of reasons not to hurry into this change, both of which have to do with testing. As Eric Sandeen noted, some of the more ext3-like options are not tested as heavily as the native modes of operation:

ext4's more, um ... unique option combinations probably get next to no testing in the real world. So while we can say that noextent, nodelalloc is mostly like ext3, in practice, does that ever really get much testing?

There is also concern that ext4, which is still seeing much more change than its predecessors, is more likely to introduce instabilities. That's a bit of a disturbing idea; there are enough production users of ext4 now that the introduction of serious bugs would not be pleasant. But, again, the backward-compatible operating modes of ext4 may not be as heavily tested as the native mode, so one might argue that operation with older filesystems is more likely to break regardless of how careful the developers are.

So, clearly, any move to get rid of ext2 and ext3 would have to be preceded by the introduction of better testing for the less-exercised corners of ext4. The developers involved understand that clearly, so there is no need to be worried that the older code could be removed too quickly.

Meanwhile, there are also concerns that the older code, which is not seeing much developer attention, could give birth to bugs of its own. As Jan put it:

The time I spend is enough to keep ext3 in a good shape I believe but I have a feeling that ext2 is slowly bitrotting. Sometime when I look at ext2 code I see stuff we simply do differently these days and that's just a step away from the code getting broken... It would not be too much work to clean things up and maintain but it's a work with no clear gain (if you do the thankless job of maintaining old code, you should at least have users who appreciate that ;) so naturally no one does it.

Developers have also expressed concern that new filesystem authors might copy code from ext2, which, at this point, does not serve as a good example for how Linux filesystems should be written.

The end result is that, once the testing concerns have been addressed, everybody involved might be made better off by the removal of ext2 and ext3. Users with older filesystems would get better performance and a code base which is seeing more active development and maintenance. Developers would be able to shed an older maintenance burden and focus their efforts on a single filesystem going forward. Thanks to the careful compatibility work which has been done over the years, it may be possible to safely make this move in the relatively near future.

Comments (36 posted)

Supporting multiple LSMs

By Jake Edge
February 9, 2011

With some regularity, the topic of allowing multiple Linux Security Modules (LSMs) to all be active comes up in the Linux kernel community. There have been some attempts at "stacking" or "chaining" LSMs in the past, but nothing has ever made it into the mainline. On the other hand, though, every time a developer comes up with some kind of security hardening patch for the kernel, they are generally directed toward the LSM interface. Because the "monolithic" security solutions (like SELinux, AppArmor, and others) tend to have already taken the single existing LSM slot in many distributions, these simpler, more targeted LSMs are generally unable to be used. But a discussion on the linux-security-module mailing list suggests that work is being done that just might solve this problem.

The existing implementation of LSMs uses a single set of function pointers in a struct security_operations for the "hooks" that get called when access decisions need to be made. Once a security module gets registered (typically at boot time using the security= flag), its implementation is stored in the structure and any other LSM is out of luck. The idea behind LSM stacking would be to keep multiple versions of the security_operations structure around and to call each registered LSM's hooks for an access decision. While that sounds fairly straightforward, there are some subtleties that need to be addressed, especially if different LSMs give different answers for a particular access.

This problem with the semantics of "composing" two (or more) LSMs has been discussed at various points, without any real global solution for composing arbitrary LSMs. As Serge E. Hallyn warned over a year ago:

The problem is that composing any two security policies can quickly have subtle, unforeseen, but dangerous effects. That's why so far we have stuck with the status quo where only one LSM is 'active', but that LSM can manually call hooks from other LSMs.

There is one example of stacking LSMs as Hallyn describes in the kernel already; the capabilities LSM is called directly from other LSMs where necessary. That particular approach is not very general, of course, as LSM maintainers are likely to lose patience with adding calls for every other possible LSM. A more easily expandable solution is required.

David Howells posted a set of patches that would add that expansion mechanism. It does that by allowing multiple calls to the register_security() initialization function, each with its own set of security_operations. Instead of the current situation, where each LSM manages its own data for each kind of object (credentials, keys, files, inodes, superblocks, IPC, and sockets), Howell's security framework will allocate and manage that data for the LSMs.

The security_operations structure gets new *_data_size and *_data_offset fields for each kind of object, with the former filled in by the LSM before calling register_security() and the latter being managed by the framework. The data size field tells the framework how much space is needed for the LSM-specific data for that type of object, and the offset is used by the framework to find each LSM's private data. For struct cred, struct key, struct file, and struct super_block, the extra data for each registered LSM is tacked onto the end of the structure rather than going through an intermediate pointer (as is required for the others). Wrappers are defined that will allow an LSM to extract its data from an object based on the new fields in the operations table.

The framework then maintains a list of registered LSMs and puts the capabilities LSM in the first slot of the list. When one of the security hooks is called, the framework iterates over the list and calls the corresponding hook for each registered LSM. Depending on the specific hook, different kinds of iterators are used, but the usual iterator looks for a non-zero response from an LSM's hook, which would indicate a denial of some kind, and returns that to the framework. The other iterators are used for specialized calls, for example when there is no return value or when only the first hook found should be called. The upshot is that the hooks for registered LSMs get called in order (with capabilities coming first), and the first to deny the access "wins". Because the capabilities calls are pulled out separately, that also means that the other LSMs no longer have to make those calls themselves; instead the framework will handle it for them.

But there are a handful of hooks that do not work very well in a multi-LSM environment, in particular the secid (an LSM-specific security label ID) handling routines (e.g. secid_to_secctx(), task_getsecid(), etc.). Howells's current implementation just calls the hook of the first LSM it finds that implements it, which is not going to make it possible to use multiple LSMs that all implement those hooks (currently just SELinux and Smack). Howells's solution is to explicitly ban that particular combination:

I think the obvious thing is to reject any chosen module that implements any of these interfaces if we've already selected a module that implements them. That would mean you can choose either Smack or SELinux, but not both.

But Smack developer Casey Schaufler isn't convinced that is the right course: "That kind of takes the wind out of the sails, doesn't it?" He would rather see a more general solution that allows multiple secids, and the related secctxs (security contexts), to be handled by the framework:

It does mean that there needs to be a standard for a secctx that allows for the presence of multiple concurrent LSMs. There will have to be an interface whereby either the composer/stacker can break a secctx into its constituent parts or with which an LSM can pull the bit it cares about out. In either case the LSMs may need to be [updated] to accept a secctx in a standardized format.

Another interesting part of Schaufler's message is that he has been working on an "alternative approach" to the multi-LSM problem that he calls "Glass". The code is, as yet, unreleased, but Schaufler describes Glass as an LSM that composes other LSMs:

The Glass security blob is an array of pointers, one for each available LSM, including commoncap, which is always in the last slot. The Glass LSM is always registered first. As subsequent LSMs register they are added to the glass LSM vector. When a hook is invoked glass goes through its vector and if the LSM provides a hook it gets called, and the return remembered. If any other LSM provided a hook the commoncap hook is skipped, but if no LSM was invoked commoncap is called.

Unlike Howells's proposal, Glass would leave the calls to the capabilities LSM (aka commoncap) in the existing LSMs, and only call commoncap if no LSM implemented a given hook. The idea is that the LSMs already handle the capabilities calls in their hooks as needed, so it is only when none of those get called that requires a call into commoncap. In addition, Glass leaves the allocation and management of the security "blobs" (LSM-specific data for objects) to the LSMs rather than centralizing them in the framework as Howells's patches do.

In addition to various other differences, there is a more fundamental difference in the way that the two solutions handle multiple LSMs that all have hooks for a particular security operation. Glass purposely calls each hook in each registered LSM, whereas Howells's proposal typically short-circuits the chain of hooks once one of them has denied the access. Schaufler's idea is that an LSM should be able to maintain state, which means that skipping its hooks could potentially skew the access decision:

My dreaded case is an LSM that bases controls on statistical frequency of access to files. There is no way you could skip any of its hooks, and I don't see off hand any file access hook it wouldn't use. I have heard people (think credit card companies) suggest such things, so although I don't have use for it I can't discount the potential for it.

There are plenty of other issues to resolve, including things like handling /proc/self/attr/current (which contains the security ID for the current process) because various user-space programs already parse the output of that file, though it is different depending on which LSM is active. A standardized format for that file, which takes multiple LSMs into account, might be better, but it would break the kernel ABI and is thus not likely to pass muster. Overall, though, Howells and Schaufler were making some good progress on defining the requirements for supporting multiple LSMs. Schaufler is optimistic that the collaboration will bear fruit: "I think that we may be able to get past the problems that have held multiple LSMs back this time around."

So far, there is only the code from Howells to look at, but Schaufler has promised to make Glass available soon. With luck, that will lead to a multi-LSM solution that the LSM developers can coalesce behind, whether it comes from Howells, Schaufler, or a collaboration between them. There may still be a fair amount of resistance from Linus Torvalds and other kernel hackers, but the lack of any way to combine LSMs comes up too often for it to be ignored forever.

Comments (2 posted)

Mesh networking with batman-adv

By Jonathan Corbet
February 8, 2011
Your editor has recently seen two keynote presentations on two continents which, using two very different styles, conveyed the same message: the centralization of the Internet and the services built on it has given governments far too much control. Both speakers called for an urgent effort to decentralize the net at all levels, including the transport level. An Internet without centralized telecommunications infrastructure can be hard to envision; when people try the term that usually comes out is "mesh networking." As it happens, the kernel has a mesh networking implementation which made the move from the staging tree into the mainline proper in 2.6.38.

Mesh networking, as its name implies, is meant to work via a large number of short-haul connections without any sort of centralized control. A proper mesh network should configure itself dynamically, responding to the addition and removal of nodes and changes in connectivity. In a well-functioning mesh, networking "just happens" without high-level coordination; such a net should be quite hard to disrupt. What the kernel offers now falls somewhat short of that ideal, but it is a good demonstration of how hard mesh networking can be.

The "Better Approach To Mobile Ad-hoc Networking" (BATMAN) protocol is described in this draft RFC. A BATMAN mesh is made up of a set of "originators" which communicate via network interfaces - normal wireless interfaces, for example. Every so often, each originator sends out an "originator message" (OGM) as a broadcast to all of its neighbors to tell the world that it exists. Each neighbor is supposed to note the presence of the originator and forward the message onward via a broadcast of its own. Thus, over time, all nodes in the mesh should see the OGM, possibly via multiple paths, and thus each node will know (1) that it can reach the originator, and (2) which of its neighbors has the best path to that originator. Each node maintains a routing table listing every other node it has ever heard of and the best neighbor by which to reach each one.

This protocol has the advantage of building and maintaining the routing tables on the fly; no central coordination is needed. It should also find near-optimal routes to each. If a node goes away, the routing tables will reconfigure themselves to function in its absence. There is also no node in the network which has a complete view of how the mesh is built; nodes only know who is out there and the best next hop. This lack of knowledge should add to the security and robustness of the mesh.

Nodes with a connection to the regular Internet can set a bit in their OGMs to advertise that fact; that allows others without such a connection to route packets to and from the rest of the world.

The original BATMAN protocol uses UDP for the OGM messages. That design allows routing to be handled with the normal kernel routing tables, but it also imposes a couple of unfortunate constraints: nodes must obtain an IP address from somewhere before joining the mesh, and the protocol is tied to IPv4. The BATMAN-adv protocol found in the Linux kernel has changed a few things to get around these problems, making it a rather more flexible solution. BATMAN-adv works entirely at the link layer, exchanging non-UDP OGMs directly with neighboring nodes. The routing table is maintained within a special virtual network device, which makes all nodes on the mesh appear to be directly connected via that virtual interface. Thus the system can join the mesh before it has a network address, and any protocol can be run over the mesh.

BATMAN-adv removes some of the limitations found in BATMAN, but readers who have gotten this far will likely be thinking of the limitations that remain. The flooding of broadcast OGMs through the net can only scale so far before a significant amount of bandwidth is consumed by network overhead. The protocol trims OGMs which are obviously not of interest - those which describe a route which is known to be worse than others, for example - but the OGM traffic will still be significant if the mesh gets large. The routing tables will also grow, since every node must keep track of every other node in existence. The overhead for these tables is probably manageable for a mesh of 1,000 nodes; it is probably hopeless for 1,000,000 nodes. Mobile devices - which are targeted by this protocol - are especially likely to suffer as the table gets larger.

Security is also a concern in this kind of network. Simple bandwidth-consuming denial of service attacks would seem relatively straightforward. Sending bogus OGMs could cause the size of routing tables to explode or disrupt the routing within the mesh. A more clever attack could force traffic to route through a hostile node, enabling man-in-the-middle exploits. And so on. The draft RFC quickly mentions some of these issues, but it seems clear that security has not been a major design goal.

So it would seem clear that BATMAN-adv, while interesting, is not the solution to the problem of an overly-centralized network. It could be a useful way to extend connectivity through a building or small neighborhood, but it is not meant to operate on a large scale or in an overtly hostile environment. The bigger problem is a hard one to solve, to say the least. The experience gained with protocols like BATMAN-adv may will prove valuable in the search for that solution, but there is clearly some work to be done still.

Comments (5 posted)

Patches and updates

Kernel trees


Build system

Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management



Virtualization and containers

Benchmarks and bugs


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds