Brief items
The current development kernel is 3.1-rc4,
released on August 28. "
Anyway, go
wild and please do test. The appended shortlog gives a reasonable idea of
the changes, but they really aren't that big. So I definitely *am* hoping
that -rc5 will be smaller. But at the same time, I continue to be pretty
happy about the state of 3.1 so far. But maybe it's just that my meds are
finally working." See
the
full changelog for all the details.
Stable updates: the 2.6.32.46, 2.6.33.19, and 3.0.4 updates were released on August 29
with the usual set of important fixes.
Comments (none posted)
POSIX has been wrong before. Sometimes the solution really is to
say "sorry, you wrote that 20 years ago, and things have changed".
--
Linus Torvalds
It only takes a little multicast to mess up your whole day.
--
Dave Täht
For the last ~6 months the Broadcom team has been working on
getting their driver out of staging. I have to believe that they
would have rather been working on updating device support during
that time. I can only presume that they would make that a priority
in the long run.
How many times has b43 been > ~6 months behind on its hardware
support? Despite Rafał's recent heroic efforts at improving that,
I can't help but wonder how long will it be before b43 is again
dreadfully behind?
--
John Linville
Bad English and a github address makes me unhappy.
--
Linus Torvalds
Comments (4 posted)
The
main kernel.org page is currently
carrying a notice that the site has suffered a security breach.
"
Earlier this month, a number of servers in the kernel.org
infrastructure were compromised. We discovered this August 28th. While we
currently believe that the source code repositories were unaffected, we are
in the process of verifying this and taking steps to enhance security
across the kernel.org infrastructure." As the update mentions,
there's little to be gained by tampering with the git repositories there
anyway.
Comments (71 posted)
Kernel development news
By Jonathan Corbet
August 29, 2011
The 32-bit x86 architecture has a number of well-known shortcomings. Many
of these were addressed when this architecture was extended to 64 bits by
AMD, but running in 64-bit mode is not without problems either. For this
reason, a group of GCC, kernel, and library developers has been working on
a new machine model known as the "x32 ABI." This ABI is getting close to
ready, but, as a recent discussion shows, wider exposure of x32 is bringing
some new issues to the surface.
Classic 32-bit x86 has easily-understood problems: it can only address 4GB
of memory and its tiny set of registers slows things considerably. Running
a current processor in the 64-bit mode fixes both of those problems nicely,
but at a cost: expanding variables and pointers to 64 bits leads to
expanded memory use and a larger cache footprint. It's also not uncommon
(still) to find programs that simply do not work properly on a 64-bit
system. Most programs do not
actually need 64-bit variables or the ability to address massive amounts of
memory; for that code, the larger data types are a cost without an
associated benefit. It would be really nice if those programs could take
advantage of the 64-bit architecture's additional registers and instructions
without simultaneously paying the price of increased memory use.
That best-of-both-worlds situation is exactly what the x32 ABI is trying to
provide. A program compiled to this ABI will run in native 64-bit mode,
but with 32-bit pointers and data values. The full register set will be
available, as will other advantages of the 64-bit architecture like the
faster SYSCALL64 instruction. If all goes according to plan, this
ABI should be the fastest mode available on 64-bit machines for a wide
range of programs; it is easy to see x32 widely displacing the 32-bit
compatibility mode.
One should note that the "if" above is still somewhat unproven: actual
benchmarks showing the differences between x32 and the existing pure modes
are hard to come by.
One outstanding question - and the spark for
the current discussion - has
to do with the system call ABI. For the most part, this ABI looks similar
to what is used by the legacy 32-bit mode: the 32-bit-compatible versions
of the system calls and associated data structures are used. But there is
one difference: the x32 developers want to use the SYSCALL64
instruction just like
native 64-bit applications do for the performance benefits. That
complicates things a bit, since, to know what data size to expect, the
kernel needs to be able to distinguish
system calls made by true 64-bit applications from those running in the x32
mode, regardless of the fact that the processor is running in the same mode in
both cases. As an added challenge, this distinction needs to be made
without slowing down native 64-bit applications.
The solution involves using an expanded version of the 64-bit system call
table. Many system calls can be called directly with no compatibility
issues at all - a call to fork() needs little in the translation
of data structures. Others do need the compatibility layer, though. Each
of those system calls (92 of them) is assigned a new number starting at
512. That leaves a gap above the native system calls for additions over
time. Bit 30 in the system call number is also set
whenever an x32 binary calls into the kernel; that enables kernel code that
cares to implement "compatibility mode" behavior.
Linus didn't seem to mind the mechanism used to distinguish x32 system
calls in general, but he hated the use of
compatibility mode for the x32 ABI. He asked:
I think the real question is "why?". I think we're missing a lot of
background for why we'd want yet another set of system calls at
all, and why we'd want another state flag. Why can't the x32 code
just use the native 64-bit system calls entirely?
There are legitimate reasons why some of the system calls cannot be shared
between the x32 and 64-bit modes. Situations where user space passes
structures containing pointers to the kernel (ioctl() and
readv() being simple examples) will require special handling since
those pointers will be 32-bit. Signal handling will always be special.
Many of the other system calls done specially for x32, though, are there to
minimize the differences between x32 and the legacy 32-bit mode. And those
calls are the ones that Linus objects to
most strongly.
It comes down, for the most part, to the format of integer values passed to
the kernel in structures. The legacy 32-bit mode, naturally, uses 32-bit
values in most cases; the x32 mode follows that lead. Linus is saying,
though, that the 64-bit versions of the structures - with 64-bit integer
values - should be used instead. At a minimum, doing things that way would
minimize the differences between the x32 and native 64-bit modes. But
there is also a correctness issue involved.
One place where the 32- and 64-bit modes differ is in their representation
of time values; in the 32-bit world, types like time_t, struct
timespec, and struct timeval are 32-bit quantities. And
32-bit time values will overflow in the year 2038. If the year-2000 issue
showed anything, it's that long-term drop-dead days arrive sooner than one
tends to think. So it's not surprising that Linus is unwilling to add a new ABI that would suffer
from the 2038 issue:
2038 is a long time away for legacy binaries. It's *not* all that
long away if you are introducing a new 32-bit mode for performance.
The width of time_t cannot change for legacy 32-bit binaries. But
x32 is an entirely new ABI with no legacy users at all; it does not have to
retain any sort of past compatibility at this point. Now is the only time
that this kind of issue can be fixed. So it is probably entirely safe to
say that an x32 ABI will not make it into the mainline as long as it has
problems like the year-2038 bug.
At this point, the x32 developers need to review their proposed system call
ABI and find a way to rework it into something closer to Linus's taste;
that process is already underway.
Then developers can get into the serious business of building systems under
that ABI and running benchmarks to see whether it is all worth the effort.
Convincing distributors (other than Gentoo, of course) to support this ABI
will take a fairly convincing story, but, if this mode lives up to its
potential, that story might just be there.
Comments (58 posted)
By Jonathan Corbet
August 31, 2011
"Writeback" is the process of writing dirty pages back to persistent
storage, allowing those pages to be reclaimed for other uses. Making
writeback work properly has been one of the more challenging problems faced
by kernel developers in the last few years; systems can bog down completely
(or even lock up) when writeback gets out of control. Various approaches
to improving the situation have been discussed; one of those is Fengguang
Wu's I/O-less throttling patch set. These changes have been circulating
for some time; they are seen as having potential - if only others could
actually understand them. Your editor doesn't understand them either, but
that has never stopped him before.
One aspect to getting a handle on writeback, clearly, is slowing down
processes that are creating more dirty pages than the system can handle.
In current kernels, that is done through a call to
balance_dirty_pages(), which sets the offending process to work
writing pages back to disk. This "direct reclaim" has the effect of
cleaning some pages; it also keeps the process from dirtying more pages
while the writeback is happening. Unfortunately, direct reclaim also tends
to create terrible I/O patterns, reducing the bandwidth of data going to
disk and making the problem worse than it was before. Getting rid of
direct reclaim has been on the "to do" list for a while, but it needs to be
replaced by another means for throttling producers of dirty pages.
That is where Fengguang's patch set comes
in. He is attempting to create a control loop capable of determining how
many pages each process should be allowed to dirty at any given time.
Processes exceeding their limit are simply put to sleep for a while to
allow the writeback system to catch up with them. The concept is simple
enough, but the implementation is less so. Throttling is easy; performing
throttling in a way that keeps the number of dirty pages within reasonable
bounds and maximizes backing store utilization while not imposing
unreasonable latencies on processes is a bit more difficult.
If all pages in the system are dirty, the
system is probably dead, so that is a good situation to avoid. Zero dirty
pages is almost as bad; performance in that situation will be exceedingly
poor. The virtual memory subsystem thus aims for a spot in the middle
where the ratio of dirty to clean pages is deemed to be optimal; that
"setpoint" varies, but comes down to tunable parameters in the end.
Current code sets a simple threshold, with throttling happening when the
number of dirty pages exceeds that threshold; Fengguang is trying to do
something more subtle.
Since developers have complained that his work is hard to understand,
Fengguang
has filled out the code with lots of documentation and diagrams. This is
how he depicts the goal of the patch set:
^ task rate limit
|
| *
| *
| *
|[free run] * [smooth throttled]
| *
| *
| *
..bdi->dirty_ratelimit..........*
| . *
| . *
| . *
| . *
| . *
+-------------------------------.-----------------------*------------>
setpoint^ limit^ dirty pages
The goal of the system is to keep the number of dirty pages at the
setpoint; if things get out of line, increasing amounts of force will be
applied to bring things back to where they should be. So the first order
of business is to figure out the current status; that is done in two
steps. The first is to look at the global situation: how many dirty pages
are there in the system relative to the setpoint and to the hard limit that
we never want to exceed? Using a cubic polynomial function (see the code
for the grungy details), Fengguang calculates a global "pos_ratio" to
describe how strongly the system needs to adjust the number of dirty
pages.
This ratio cannot really be calculated, though, without taking the backing
device (BDI) into account. A process may be dirtying pages stored on a
given BDI, and the system may have a surfeit of dirty pages at the moment,
but the wisdom of throttling that process depends also on how many dirty
pages exist for that BDI. If a given BDI is swamped with dirty pages, it
may make sense to throttle a dirtying process even if the system as a whole
is doing OK. On the other hand, a BDI with few dirty pages can clear its
backlog quickly, so it can probably afford to have a few more, even if the
system is somewhat more dirty than one might like. So the patch set tweaks
the calculated pos_ratio for a specific BDI using a complicated formula
looking at how far that specific BDI is from its own setpoint and its
observed bandwidth. The end result is a modified pos_ratio describing whether the
system should be dirtying more or fewer pages backed by the given BDI, and
by how much.
In an ideal world, throttling would match the rate at which pages are being
dirtied to the rate that each device can write those pages back; a process
dirtying pages backed by a fast SSD would be able to dirty more pages more
quickly than
a process writing to pages backed by a cheap thumb drive. The idea is simple:
if N processes are dirtying pages on a BDI with a given bandwidth, each
process should be throttled to the extent that it dirties 1/N of that
bandwidth. The problem is that processes do not register with the kernel
and declare that they intend to dirty lots of pages on a given BDI, so the
kernel does not really know the value of N. That is handled by
carrying a running estimate of N. An initial per-task bandwidth limit is
established; after a period of time, the kernel looks at the number of
pages actually dirtied for a given BDI and divides it by that bandwidth limit to
come up with the number of active processes. From that estimate, a new
rate limit can be applied; this calculation is repeated over time.
That rate limit is fine if the system wants to keep the number of dirty
pages on that BDI at its current level. If the number of dirty pages (for
the BDI or for the system as a whole) is out of line, though, the per-BDI
rate limit will be tweaked accordingly. That is done through a simple
multiplication by the pos_ratio calculated above. So if the number of
dirty pages is low, the applied rate limit will be a bit higher than what
the BDI can handle; if there are too many dirty pages, the per-BDI limit
will be lower. There is some additional logic to keep the per-BDI limit
from changing too quickly.
Once all that machinery is in place, fixing up
balance_dirty_pages() is mostly a matter of deleting the old
direct reclaim code. If neither the global nor the per-BDI dirty limits have
been exceeded, there is nothing to be done. Otherwise the code calculates
a pause time based on the current rate limit, the pos_ratio, and number of
pages recently dirtied by the current task and sleeps for that long. The maximum
sleep time is currently set to 200ms. A final tweak tries to account for
"think time" to even out the pauses seen by any given process. The end
result is said to be a system which operates much more smoothly when lots
of pages are being dirtied.
Fengguang has been working on these patches for some time and would
doubtless like to see them merged. That may yet happen, but adding core
memory management code is never an easy thing to do, even when others can
easily understand the work. Introducing regressions in obscure workloads
is just too easy to do. That suggests that, among other things, a lot of
testing will be required before confidence in these changes will be up to
the required level. But, with any luck, this work will eventually result
in better-performing systems for us all.
Comments (9 posted)
By Jonathan Corbet
August 29, 2011
On September 9, 2010, Broadcom
announced
that it was releasing an open source driver for its wireless networking
chipsets. Broadcom had long resisted calls for free drivers for this
hardware, so this announcement was quite well received in the community,
despite the fact that the quality of the code itself was not quite up to
contemporary kernel standards. One year later, this driver is again under
discussion, but the tone of the conversation has changed somewhat.
After a year of work, Broadcom's driver may never see a mainline release.
Broadcom's submission was actually two drivers: brcmsmac for software-MAC
chipsets, and brcmfmac for "FullMAC" chipsets with hardware MAC support.
These drivers were immediately pulled into the staging tree with the understanding that
there were a lot of things needing fixing before they could make the
move to the mainline proper. In the following year, developers dedicated
themselves to the task of cleaning up the drivers; nearly 900 changes have
been merged in this time. The bulk of the changes came from Broadcom
employees, but quite a few others have contributed fixes to the drivers as
well.
This work has produced a driver that is free of checkpatch warnings, works
on both small-endian and big-endian platforms, uses kernel libraries where
appropriate, and generally looks much better than it originally did. On
August 24, Broadcom developer Henry Ptasinski decided that the time had
come: he posted a patch moving the Broadcom
drivers into the mainline. Greg Kroah-Hartman, maintainer of the staging
tree, said that he was fine with the driver
moving out of staging if the networking developers agreed. Some of those
developers came up with some technical issues, but it appeared that these
drivers were getting close to ready to make the big move out of staging.
That was when Rafał Miłecki chimed
in: "Henry: a simple question, please explain it to me, what
brcmsmac does provide that b43 doesn't?" Rafał, naturally, is
the maintainer of the b43 driver; b43, which has been in the mainline for
years, is a driver for Broadcom chipsets developed primarily from
reverse-engineered information. It has reached a point where, Rafał
claims, it supports everything Broadcom's driver does with one exception
(BCM4313 chipsets) that will be fixed
soon. Rafał also claims that the b43 driver performs better, supports hardware that brcmsmac does not, and
is, in general, a better piece of code.
So Rafał was clear on what he thought of the brcmsmac driver (brcmfmac
didn't really enter into the discussion); he was
also quite clear on what he would like to
see happen:
We would like to have b43 supported by Broadcom. It sounds much
better, I've shown you a lot of advantages of such a
choice. Switching to brcmsmac on the other hand needs a lot of work
and improvements.
On one hand, Rafał is in a reasonably strong position. The b43 driver
is in the mainline now; there is, in general, a strong resistance to the
merging of duplicate drivers for the same hardware. Code quality is, to
some extent, in the eye of the beholder, but there have been few beholders
who have gone on record to say that they like Broadcom's driver better.
Looking at the situation with an eye purely on the quality of the kernel's
code base in the long term, it is hard to make an argument that the
brcmsmac driver should move into the mainline.
On the other hand, if one considers the feelings of developers and the
desire to have more hardware companies supporting their products with
drivers in the Linux kernel, one has to ask why Broadcom was allowed to put
this driver into staging and merge hundreds of patches improving it if that
driver was never going to make it into the mainline. Letting Broadcom
invest that much work into its driver, then asking it to drop everything
and support the reverse-engineered driver that it declined to support one
year ago seems harsh. It's not a story that is likely to prove
inspirational for developers in other companies who are considering trying
to work more closely with the kernel community.
What seems to have happened here (according mainly to a history posted by Rafał, who is not a
disinterested observer) is that, one year ago, the brcmsmac driver
supported hardware that had no support in b43. Since then, b43 has gained
support for that additional hardware; nobody objected to the addition of
duplicated driver support at that time (as one would probably expect, given
that the Broadcom driver was in staging). Rafał doesn't say whether
the brcmsmac driver was helpful to him in filling out hardware support in
the b43 driver. In the end, it doesn't matter; it would appear that the
need for brcmsmac has passed.
One of the most important lessons for kernel developers to learn is that
one should focus on the end result rather than on the merging of a specific
piece of code. One can argue that Broadcom has what it wanted now: free
driver support for its hardware in the mainline kernel. One could also
argue that Broadcom should have worked on adding support to b43 from the
beginning rather than posting a competing driver. Or, failing that, one
could say that the Broadcom developers should have noticed the improvements
going into b43 and thought about the implications for their own work.
But none of that is
going to make the Broadcom developers feel any better about how things have
turned out. They might come around to working on b43, but one expects that
it is not a hugely appealing alternative at the moment.
The kernel process can be quite wasteful - in terms of code and developer
time lost - at times. Any kernel developer who has been in the community
for a significant time will have invested significant time into code that
went straight into the bit bucket at some time or other. But that doesn't
mean that this waste is good or always necessary. There would be value in
finding more reliable ways to warn developers when they are working on code
that is unlikely to be merged. Kernel development is distributed, and
there are no managers paying attention to how developers spend their time;
it works well in general, but it can lead to situations where
developers work at cross purposes and somebody, eventually, has to lose
out.
That would appear to have happened here. In the short term, the kernel and
its users have come out ahead: we have a choice of drivers for Broadcom
wireless chipsets and can pick the best one to carry forward. Even
Broadcom can be said to have come out ahead if it gains better support for
its hardware under Linux. Whether we will pay a longer-term cost in
developers who conclude that the kernel community is too hard to work with
is harder to measure. But it remains true that the kernel community can,
at times, be a challenging place to work.
Comments (73 posted)
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>