User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.1-rc4, released on August 28. "Anyway, go wild and please do test. The appended shortlog gives a reasonable idea of the changes, but they really aren't that big. So I definitely *am* hoping that -rc5 will be smaller. But at the same time, I continue to be pretty happy about the state of 3.1 so far. But maybe it's just that my meds are finally working." See the full changelog for all the details.

Stable updates: the,, and 3.0.4 updates were released on August 29 with the usual set of important fixes.

Comments (none posted)

Quotes of the week

POSIX has been wrong before. Sometimes the solution really is to say "sorry, you wrote that 20 years ago, and things have changed".
-- Linus Torvalds

It only takes a little multicast to mess up your whole day.
-- Dave Täht

For the last ~6 months the Broadcom team has been working on getting their driver out of staging. I have to believe that they would have rather been working on updating device support during that time. I can only presume that they would make that a priority in the long run.

How many times has b43 been > ~6 months behind on its hardware support? Despite Rafał's recent heroic efforts at improving that, I can't help but wonder how long will it be before b43 is again dreadfully behind?

-- John Linville

Bad English and a github address makes me unhappy.
-- Linus Torvalds

Comments (4 posted) compromised

The main page is currently carrying a notice that the site has suffered a security breach. "Earlier this month, a number of servers in the infrastructure were compromised. We discovered this August 28th. While we currently believe that the source code repositories were unaffected, we are in the process of verifying this and taking steps to enhance security across the infrastructure." As the update mentions, there's little to be gained by tampering with the git repositories there anyway.

Comments (71 posted)

Kernel development news

The x32 system call ABI

By Jonathan Corbet
August 29, 2011
The 32-bit x86 architecture has a number of well-known shortcomings. Many of these were addressed when this architecture was extended to 64 bits by AMD, but running in 64-bit mode is not without problems either. For this reason, a group of GCC, kernel, and library developers has been working on a new machine model known as the "x32 ABI." This ABI is getting close to ready, but, as a recent discussion shows, wider exposure of x32 is bringing some new issues to the surface.

Classic 32-bit x86 has easily-understood problems: it can only address 4GB of memory and its tiny set of registers slows things considerably. Running a current processor in the 64-bit mode fixes both of those problems nicely, but at a cost: expanding variables and pointers to 64 bits leads to expanded memory use and a larger cache footprint. It's also not uncommon (still) to find programs that simply do not work properly on a 64-bit system. Most programs do not actually need 64-bit variables or the ability to address massive amounts of memory; for that code, the larger data types are a cost without an associated benefit. It would be really nice if those programs could take advantage of the 64-bit architecture's additional registers and instructions without simultaneously paying the price of increased memory use.

That best-of-both-worlds situation is exactly what the x32 ABI is trying to provide. A program compiled to this ABI will run in native 64-bit mode, but with 32-bit pointers and data values. The full register set will be available, as will other advantages of the 64-bit architecture like the faster SYSCALL64 instruction. If all goes according to plan, this ABI should be the fastest mode available on 64-bit machines for a wide range of programs; it is easy to see x32 widely displacing the 32-bit compatibility mode.

One should note that the "if" above is still somewhat unproven: actual benchmarks showing the differences between x32 and the existing pure modes are hard to come by.

One outstanding question - and the spark for the current discussion - has to do with the system call ABI. For the most part, this ABI looks similar to what is used by the legacy 32-bit mode: the 32-bit-compatible versions of the system calls and associated data structures are used. But there is one difference: the x32 developers want to use the SYSCALL64 instruction just like native 64-bit applications do for the performance benefits. That complicates things a bit, since, to know what data size to expect, the kernel needs to be able to distinguish system calls made by true 64-bit applications from those running in the x32 mode, regardless of the fact that the processor is running in the same mode in both cases. As an added challenge, this distinction needs to be made without slowing down native 64-bit applications.

The solution involves using an expanded version of the 64-bit system call table. Many system calls can be called directly with no compatibility issues at all - a call to fork() needs little in the translation of data structures. Others do need the compatibility layer, though. Each of those system calls (92 of them) is assigned a new number starting at 512. That leaves a gap above the native system calls for additions over time. Bit 30 in the system call number is also set whenever an x32 binary calls into the kernel; that enables kernel code that cares to implement "compatibility mode" behavior.

Linus didn't seem to mind the mechanism used to distinguish x32 system calls in general, but he hated the use of compatibility mode for the x32 ABI. He asked:

I think the real question is "why?". I think we're missing a lot of background for why we'd want yet another set of system calls at all, and why we'd want another state flag. Why can't the x32 code just use the native 64-bit system calls entirely?

There are legitimate reasons why some of the system calls cannot be shared between the x32 and 64-bit modes. Situations where user space passes structures containing pointers to the kernel (ioctl() and readv() being simple examples) will require special handling since those pointers will be 32-bit. Signal handling will always be special. Many of the other system calls done specially for x32, though, are there to minimize the differences between x32 and the legacy 32-bit mode. And those calls are the ones that Linus objects to most strongly.

It comes down, for the most part, to the format of integer values passed to the kernel in structures. The legacy 32-bit mode, naturally, uses 32-bit values in most cases; the x32 mode follows that lead. Linus is saying, though, that the 64-bit versions of the structures - with 64-bit integer values - should be used instead. At a minimum, doing things that way would minimize the differences between the x32 and native 64-bit modes. But there is also a correctness issue involved.

One place where the 32- and 64-bit modes differ is in their representation of time values; in the 32-bit world, types like time_t, struct timespec, and struct timeval are 32-bit quantities. And 32-bit time values will overflow in the year 2038. If the year-2000 issue showed anything, it's that long-term drop-dead days arrive sooner than one tends to think. So it's not surprising that Linus is unwilling to add a new ABI that would suffer from the 2038 issue:

2038 is a long time away for legacy binaries. It's *not* all that long away if you are introducing a new 32-bit mode for performance.

The width of time_t cannot change for legacy 32-bit binaries. But x32 is an entirely new ABI with no legacy users at all; it does not have to retain any sort of past compatibility at this point. Now is the only time that this kind of issue can be fixed. So it is probably entirely safe to say that an x32 ABI will not make it into the mainline as long as it has problems like the year-2038 bug.

At this point, the x32 developers need to review their proposed system call ABI and find a way to rework it into something closer to Linus's taste; that process is already underway. Then developers can get into the serious business of building systems under that ABI and running benchmarks to see whether it is all worth the effort. Convincing distributors (other than Gentoo, of course) to support this ABI will take a fairly convincing story, but, if this mode lives up to its potential, that story might just be there.

Comments (58 posted)

No-I/O dirty throttling

By Jonathan Corbet
August 31, 2011
"Writeback" is the process of writing dirty pages back to persistent storage, allowing those pages to be reclaimed for other uses. Making writeback work properly has been one of the more challenging problems faced by kernel developers in the last few years; systems can bog down completely (or even lock up) when writeback gets out of control. Various approaches to improving the situation have been discussed; one of those is Fengguang Wu's I/O-less throttling patch set. These changes have been circulating for some time; they are seen as having potential - if only others could actually understand them. Your editor doesn't understand them either, but that has never stopped him before.

One aspect to getting a handle on writeback, clearly, is slowing down processes that are creating more dirty pages than the system can handle. In current kernels, that is done through a call to balance_dirty_pages(), which sets the offending process to work writing pages back to disk. This "direct reclaim" has the effect of cleaning some pages; it also keeps the process from dirtying more pages while the writeback is happening. Unfortunately, direct reclaim also tends to create terrible I/O patterns, reducing the bandwidth of data going to disk and making the problem worse than it was before. Getting rid of direct reclaim has been on the "to do" list for a while, but it needs to be replaced by another means for throttling producers of dirty pages.

That is where Fengguang's patch set comes in. He is attempting to create a control loop capable of determining how many pages each process should be allowed to dirty at any given time. Processes exceeding their limit are simply put to sleep for a while to allow the writeback system to catch up with them. The concept is simple enough, but the implementation is less so. Throttling is easy; performing throttling in a way that keeps the number of dirty pages within reasonable bounds and maximizes backing store utilization while not imposing unreasonable latencies on processes is a bit more difficult.

If all pages in the system are dirty, the system is probably dead, so that is a good situation to avoid. Zero dirty pages is almost as bad; performance in that situation will be exceedingly poor. The virtual memory subsystem thus aims for a spot in the middle where the ratio of dirty to clean pages is deemed to be optimal; that "setpoint" varies, but comes down to tunable parameters in the end. Current code sets a simple threshold, with throttling happening when the number of dirty pages exceeds that threshold; Fengguang is trying to do something more subtle.

Since developers have complained that his work is hard to understand, Fengguang has filled out the code with lots of documentation and diagrams. This is how he depicts the goal of the patch set:

  ^ task rate limit
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
                          setpoint^                  limit^  dirty pages

The goal of the system is to keep the number of dirty pages at the setpoint; if things get out of line, increasing amounts of force will be applied to bring things back to where they should be. So the first order of business is to figure out the current status; that is done in two steps. The first is to look at the global situation: how many dirty pages are there in the system relative to the setpoint and to the hard limit that we never want to exceed? Using a cubic polynomial function (see the code for the grungy details), Fengguang calculates a global "pos_ratio" to describe how strongly the system needs to adjust the number of dirty pages.

This ratio cannot really be calculated, though, without taking the backing device (BDI) into account. A process may be dirtying pages stored on a given BDI, and the system may have a surfeit of dirty pages at the moment, but the wisdom of throttling that process depends also on how many dirty pages exist for that BDI. If a given BDI is swamped with dirty pages, it may make sense to throttle a dirtying process even if the system as a whole is doing OK. On the other hand, a BDI with few dirty pages can clear its backlog quickly, so it can probably afford to have a few more, even if the system is somewhat more dirty than one might like. So the patch set tweaks the calculated pos_ratio for a specific BDI using a complicated formula looking at how far that specific BDI is from its own setpoint and its observed bandwidth. The end result is a modified pos_ratio describing whether the system should be dirtying more or fewer pages backed by the given BDI, and by how much.

In an ideal world, throttling would match the rate at which pages are being dirtied to the rate that each device can write those pages back; a process dirtying pages backed by a fast SSD would be able to dirty more pages more quickly than a process writing to pages backed by a cheap thumb drive. The idea is simple: if N processes are dirtying pages on a BDI with a given bandwidth, each process should be throttled to the extent that it dirties 1/N of that bandwidth. The problem is that processes do not register with the kernel and declare that they intend to dirty lots of pages on a given BDI, so the kernel does not really know the value of N. That is handled by carrying a running estimate of N. An initial per-task bandwidth limit is established; after a period of time, the kernel looks at the number of pages actually dirtied for a given BDI and divides it by that bandwidth limit to come up with the number of active processes. From that estimate, a new rate limit can be applied; this calculation is repeated over time.

That rate limit is fine if the system wants to keep the number of dirty pages on that BDI at its current level. If the number of dirty pages (for the BDI or for the system as a whole) is out of line, though, the per-BDI rate limit will be tweaked accordingly. That is done through a simple multiplication by the pos_ratio calculated above. So if the number of dirty pages is low, the applied rate limit will be a bit higher than what the BDI can handle; if there are too many dirty pages, the per-BDI limit will be lower. There is some additional logic to keep the per-BDI limit from changing too quickly.

Once all that machinery is in place, fixing up balance_dirty_pages() is mostly a matter of deleting the old direct reclaim code. If neither the global nor the per-BDI dirty limits have been exceeded, there is nothing to be done. Otherwise the code calculates a pause time based on the current rate limit, the pos_ratio, and number of pages recently dirtied by the current task and sleeps for that long. The maximum sleep time is currently set to 200ms. A final tweak tries to account for "think time" to even out the pauses seen by any given process. The end result is said to be a system which operates much more smoothly when lots of pages are being dirtied.

Fengguang has been working on these patches for some time and would doubtless like to see them merged. That may yet happen, but adding core memory management code is never an easy thing to do, even when others can easily understand the work. Introducing regressions in obscure workloads is just too easy to do. That suggests that, among other things, a lot of testing will be required before confidence in these changes will be up to the required level. But, with any luck, this work will eventually result in better-performing systems for us all.

Comments (9 posted)

Broadcom's wireless drivers, one year later

By Jonathan Corbet
August 29, 2011
On September 9, 2010, Broadcom announced that it was releasing an open source driver for its wireless networking chipsets. Broadcom had long resisted calls for free drivers for this hardware, so this announcement was quite well received in the community, despite the fact that the quality of the code itself was not quite up to contemporary kernel standards. One year later, this driver is again under discussion, but the tone of the conversation has changed somewhat. After a year of work, Broadcom's driver may never see a mainline release.

Broadcom's submission was actually two drivers: brcmsmac for software-MAC chipsets, and brcmfmac for "FullMAC" chipsets with hardware MAC support. These drivers were immediately pulled into the staging tree with the understanding that there were a lot of things needing fixing before they could make the move to the mainline proper. In the following year, developers dedicated themselves to the task of cleaning up the drivers; nearly 900 changes have been merged in this time. The bulk of the changes came from Broadcom employees, but quite a few others have contributed fixes to the drivers as well.

This work has produced a driver that is free of checkpatch warnings, works on both small-endian and big-endian platforms, uses kernel libraries where appropriate, and generally looks much better than it originally did. On August 24, Broadcom developer Henry Ptasinski decided that the time had come: he posted a patch moving the Broadcom drivers into the mainline. Greg Kroah-Hartman, maintainer of the staging tree, said that he was fine with the driver moving out of staging if the networking developers agreed. Some of those developers came up with some technical issues, but it appeared that these drivers were getting close to ready to make the big move out of staging.

That was when Rafał Miłecki chimed in: "Henry: a simple question, please explain it to me, what brcmsmac does provide that b43 doesn't?" Rafał, naturally, is the maintainer of the b43 driver; b43, which has been in the mainline for years, is a driver for Broadcom chipsets developed primarily from reverse-engineered information. It has reached a point where, Rafał claims, it supports everything Broadcom's driver does with one exception (BCM4313 chipsets) that will be fixed soon. Rafał also claims that the b43 driver performs better, supports hardware that brcmsmac does not, and is, in general, a better piece of code.

So Rafał was clear on what he thought of the brcmsmac driver (brcmfmac didn't really enter into the discussion); he was also quite clear on what he would like to see happen:

We would like to have b43 supported by Broadcom. It sounds much better, I've shown you a lot of advantages of such a choice. Switching to brcmsmac on the other hand needs a lot of work and improvements.

On one hand, Rafał is in a reasonably strong position. The b43 driver is in the mainline now; there is, in general, a strong resistance to the merging of duplicate drivers for the same hardware. Code quality is, to some extent, in the eye of the beholder, but there have been few beholders who have gone on record to say that they like Broadcom's driver better. Looking at the situation with an eye purely on the quality of the kernel's code base in the long term, it is hard to make an argument that the brcmsmac driver should move into the mainline.

On the other hand, if one considers the feelings of developers and the desire to have more hardware companies supporting their products with drivers in the Linux kernel, one has to ask why Broadcom was allowed to put this driver into staging and merge hundreds of patches improving it if that driver was never going to make it into the mainline. Letting Broadcom invest that much work into its driver, then asking it to drop everything and support the reverse-engineered driver that it declined to support one year ago seems harsh. It's not a story that is likely to prove inspirational for developers in other companies who are considering trying to work more closely with the kernel community.

What seems to have happened here (according mainly to a history posted by Rafał, who is not a disinterested observer) is that, one year ago, the brcmsmac driver supported hardware that had no support in b43. Since then, b43 has gained support for that additional hardware; nobody objected to the addition of duplicated driver support at that time (as one would probably expect, given that the Broadcom driver was in staging). Rafał doesn't say whether the brcmsmac driver was helpful to him in filling out hardware support in the b43 driver. In the end, it doesn't matter; it would appear that the need for brcmsmac has passed.

One of the most important lessons for kernel developers to learn is that one should focus on the end result rather than on the merging of a specific piece of code. One can argue that Broadcom has what it wanted now: free driver support for its hardware in the mainline kernel. One could also argue that Broadcom should have worked on adding support to b43 from the beginning rather than posting a competing driver. Or, failing that, one could say that the Broadcom developers should have noticed the improvements going into b43 and thought about the implications for their own work. But none of that is going to make the Broadcom developers feel any better about how things have turned out. They might come around to working on b43, but one expects that it is not a hugely appealing alternative at the moment.

The kernel process can be quite wasteful - in terms of code and developer time lost - at times. Any kernel developer who has been in the community for a significant time will have invested significant time into code that went straight into the bit bucket at some time or other. But that doesn't mean that this waste is good or always necessary. There would be value in finding more reliable ways to warn developers when they are working on code that is unlikely to be merged. Kernel development is distributed, and there are no managers paying attention to how developers spend their time; it works well in general, but it can lead to situations where developers work at cross purposes and somebody, eventually, has to lose out.

That would appear to have happened here. In the short term, the kernel and its users have come out ahead: we have a choice of drivers for Broadcom wireless chipsets and can pick the best one to carry forward. Even Broadcom can be said to have come out ahead if it gains better support for its hardware under Linux. Whether we will pay a longer-term cost in developers who conclude that the kernel community is too hard to work with is harder to measure. But it remains true that the kernel community can, at times, be a challenging place to work.

Comments (73 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management



Benchmarks and bugs


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds