Sponsored link Serve your customers, not your servers, with VERIO Linux VPS. Full-access test-drive here. |
LWN.net Weekly Edition for April 24, 2008The Grumpy Editor encounters the Hardy Heron Your editor is not always known for making life easy for himself. Perhaps one of the most clear examples of masochistic behavior would be a certain preference for running development distributions on mission-critical systems. That said, your editor has stuck with a stable distribution on his laptop through a round of intensive travel earlier this year. But that was too easy, so, shortly before heading off to the Linux Foundation's Collaboration Summit, the laptop got moved to the Ubuntu "Hardy Heron" distribution. Needless to say, there have been some interesting ups and downs (literally) since then.There is always a certain thrill that comes with upgrading a system and finding that important features no longer work. In this case, the problem was suspend and resume, which your editor uses heavily. In fact, the system would suspend just fine - as long as one failed to notice that, behind the cleverly darkened screen, the laptop's backlight had been left on. Needless to say, this new behavior is not helpful if one's goal is to save power while the system is suspended, but it gets worse than that. Your editor discovered this nice surprise after carrying the computer in a backpack for a few hours; by the time it came out, it was almost too hot to hold. Happily, no permanent damage appears to have been done. Or, perhaps, unhappily. Your editor has been looking for an excuse to get a new laptop for a while. The problem turned out to be a HAL configuration error combined with a strange internal model number which makes your editor's Thinkpad X31 different from, seemingly, every other X31 on the planet. Once your editor found the bug report and attached a "me too" comment, the solution was quick in coming. On the net, one can find complaints that Ubuntu is unresponsive to bug reports, but that was certainly not the experience here. As an aside, it seems worth noting that life seems to have gotten more complicated, with a lot more code wrapped around the kernel than there once was. The problematic configuration file was /usr/share/hal/fdi/information/10freedesktop/20-video-quirk-pm-ibm.fdi - not a place where your editor, who is not a HAL expert, would have thought to look. That, it seems, is the price of more capable hardware and software, but sometimes your editor pines for the days when it seemed possible to carry a full understanding of the system within a single brain. GNOME developers are (perhaps unjustly in recent years) known for taking a minimal approach to configuration options. That can be irritating, but just as annoying is their tendency to reset the options they do provide over major updates. Once suspend and resume work, your editor demands something else of a laptop when traveling: absolute silence. So the return of beeps to gnome-terminal was not appreciated. Those were easily silenced, but the GNOME developers also saw fit to bring back the blinking cursor - and they took away the configuration option which abolishes that intolerable feature. Your editor first ran into the unstoppable blink with Rawhide; a query to the developers there turned up a quick answer. It seems that the GNOME developers have decided to create a single, system-wide parameter to control blinking cursors. Now, your editor approves of the concept of being able to turn off that behavior everywhere with a single switch - but only as long as that switch isn't hidden where nobody will ever find it. In this case, the GNOME developers have taken this feature, wrapped it in old newspapers, and stashed it behind the furnace in the basement; then they put a trunk on top of it. It is a rare user who will find it unassisted. In the hopes that it may save one or two readers from some time spent with search engine, your editor will now divulge the top-secret incantation which turns blinking cursors off:
gconftool-2 --type bool --set /desktop/gnome/interface/cursor_blink false
Naturally, a terminal window is required to run this command. It would have been nice if the developers who packaged this code for Hardy Heron had found a way to smooth over this change, but no such luck; as far as your editor can tell, no distributor has made that effort. Another bit of fun is that your editor is no longer able to set the desktop background; the relevant configuration windows are ineffective. In this case, it would appear that the task of implementing the user's background choices have been moved to nautilus - just the place your editor would have thought to look for it. As it happens, your editor has no use for file managers and does not run nautilus - and is punished with an immutable Ubuntu-brown background for that sin. Happily, your editor still knows how to run xsetroot. All of the above is a set of relatively minor grumbles, all of which are rectified in relatively short order. Once those details have been taken care of, the Hardy Heron release works quite well. One of the biggest aggravations from previous upgrades - having OpenOffice.org reformat the slides in all of your editor's presentations - was not present this time around. Hopefully we are moving into an era where "it didn't mangle my documents" is not something considered worthy of mention. There was one very nice surprise as well. Your editor's laptop previously required almost 12 watts of power when running unplugged. This laptop is not at the bleeding edge of current technology, so the amount of time it was able to run without a recharge has been dropping for a while. With the Hardy release, steady-state power consumption has dropped to just over 9 watts - a big improvement. The credit for this change belongs to developers at all levels: kernel, applications, distributors, etc. The end result is a system which runs much more efficiently, and that is a good thing. All told, your editor is reasonably content; this distribution looks like one which might just be worth keeping around. That's a good thing, since Ubuntu plans to maintain it as a "long-term support" release. Not that your editor intends to make much use of that long-term support; there should be a new development series starting soon, after all. One of the nice things about development distributions is that support never ends as long as one stays on the treadmill and the project itself remains alive.
ELC: A taste of the conference Technical conferences generally provide a wealth of choices, to the point where participants have to make tough decisions at times to pick the session to sit in on. This year's Embedded Linux Conference was no exception; there were multiple slots where the author had to wish that he could be in more than one place at a time. But, he did manage to take notes in some of those that he attended; hopefully some of the conference flavor can come through in the following report. Power managementMontaVista's Kevin Hilman presented an approach for handling power management on embedded devices that focused on changes that can be made to the kernel, but noted that there is much that can be done by applications too. Because of the time and money budgets available for embedded projects, many do not have the resources to do a complete job of tuning the kernel to get the best possible power performance. There is also no "one size fits all" solution for power management, there are too many device-specific issues to allow that. Hilman's approach is to target specific "building blocks" that embedded developers can incorporate into their project. Each block will provide some savings, so the project can stop when the desired performance is reached—or it is time to ship the device. One of the easier steps is to customize the idle loop in the kernel, putting the processor to sleep when there is no work to be done. There are different kinds of sleep, though, generally trading off power savings and wakeup latency. The cpuidle subsystem provides a means to specify those values in an architecture independent way, which, along with a platform independent "governor", can put the processor into various sleep modes. The only platform dependent piece are the hooks to enter each of the different sleep states. A similar approach is taken by the CPUfreq subsystem, which can reduce the clock frequency of the CPU to reduce power consumption using the Dynamic Voltage and Frequency Scaling (DVFS) feature of some processors. "Operating points" (OPs)—voltage and frequency tuples—are defined for the hardware. There are various generic CPUfreq governors that can then be used to determine when to change OPs and which to change to. The governor will invoke a platform-specific driver to effect that change. In addition, power management "quality of service" is currently being discussed to allow applications to request a certain level of performance that may override some of the lower-level sleep or frequency decisions. Embedded SELinuxSELinux has a well-earned reputation for being able to restrict processes to only use those resources that have been specifically allowed by policy, but it is rather resource intensive. Yuichi Nakamura presented Hitachi's research into bringing SELinux into a more resource constrained embedded environment. One of the first problems they encountered was the need for flash filesystems that support extended attributes (xattrs), which is where SELinux stores labels for files. Only jffs2 currently supports xattrs, so that is the one they used. The next big hurdle was trying to get a set of policies that were stripped down to the needs of an embedded platform. Nakamura started with the SELinux reference policy (refpolicy) and started removing rules. The sheer number of rules and policies that needed to be removed was daunting—as was the need to understand what was being removed. He also ran into strange dependencies: removing a sendmail policy caused a problem in the apache rules. The solution was to create a simplified policy language and policy editor that reduced the problem to something more tractable for the embedded world. In the process it greatly reduced the size of the policy files, from 4.6M down to 60K. Another problem encountered was the performance and size of SELinux, which is a common embedded woe. Through some hand optimization of the read/write path, along with removing some unused permissions checks, they were able to increase the performance by a factor of ten on their SuperH reference platform. By changing some static buffers in SELinux to a dynamic allocation they also saved 250K of runtime memory. Much of that work was merged into 2.6.24. There is still work to be done, but with the changes, SELinux is viable for embedded platforms. GCC and kernel hackingTwo sessions provided various tips and tricks for embedded development, with Gene Sally of Timesys focused on GCC, while IBM's Hugh Blemings shared some of the things he has learned from the kernel hackers he works with. Sally discussed the different ways that developers could get a GCC toolchain for their target processor. One of the bigger hurdles that an embedded developer faces is getting a cross-compiler toolchain—one that runs on his development workstation, but generates code for the target platform. There are several ways to get the toolchain: as a tarball for popular development/target combinations, by using helper tools like crosstool or buildroot from uClibc, or by building it from source directly. Building from source is the most difficult, of course, but allows for the most customizations and flexibility. Sally went on to describe a handful of useful GCC command-line options for helping to debug cross-compilers or just to better understand what GCC is doing:
Blemings concentrated on the development infrastructure by describing the lab that he used to port the kernel to a Taishan PowerPC-based evaluation board. When undertaking a project like that, "get to know your hardware team" because they will have lots of important information and shortcuts that can be used as part of the board "bringup". At IBM in Canberra, where Blemings is based, they have gotten to the point where they can bring up Linux on any board where they can "access memory and point the PC [program counter] at it"; his tips have come out of that environment. One of the most important things is to realize that you will be building kernels over and over again, so optimizing your environment for that will save lots of time. His suggestion was to start with a "honkin'" compile box; he described an IBM multi-processor box as an excellent choice but noted that the cost was so high he couldn't get one. It would, however, do "3k/sec"—that's compile 3 kernels per second. In the absence of something like that, he suggested borrowing cycles by using ccache and distcc to reduce and parallelize the compilation that needs to be done. Even adding relatively modest machines into the distcc pool can significantly reduce time spent waiting for a new kernel. Ubuntu mobile and embedded (UME) and MaemoOne of the hottest areas in embedded Linux these days is the mobile internet device (MID) market. There were two talks on MID-focused distributions, with Canonical's David Mandala giving an overview of Ubuntu Mobile and Embedded (UME) and Nokia's Kate Alhola talking about the status and future directions of Maemo Mobile Linux. UME is a relatively recent addition to the mobile device space—they are anxiously awaiting hardware to run on—whereas Maemo has been around for a while, powering the Nokia N770, N800, and N810 internet tablets. UME is an effort to apply the Ubuntu distribution and philosophy to touchscreen devices. Mandala explained that they are taking existing Linux applications and adapting them for small screens that use fingers, rather than keyboard and mouse, as the input device. The resolution of the displays is typically something approaching that of low-end desktops, but the physical space they take up is far smaller (i.e. the dots per inch or DPI is high) making it difficult to do development without actual hardware. The UME project is working with Intel's Moblin.org project to target Atom processor based systems. It uses the Hildon application framework atop GNOME Mobile, running on an Ubuntu 8.04 (Hardy Heron) distribution. Mandala stressed that Linux should be "invisible" on these devices as users just want applications that work to browse the web, use email, and the like. The main focus of UME has, so far, been on the user interface, though power consumption, memory footprint, and speeding up boot times are all on their radar. Canonical is very interested in fostering a community around UME, but that has been "a bit of a challenge", mostly due to a lack of hardware to run on. Mandala expects a few different hardware devices to be available "soon" and that will make it easier to attract a development community. As should come as no surprise after Nokia's purchase of Trolltech early this year, Alhola announced that Maemo would be supporting both GTK and Qt in the near future. This is part of Nokia's belief that there is "no single truth", so Maemo supports multiple paths to development on the platform. Maemo directly supports C, C++, and Python, while the community has added support for Java, Objective C, Vala, and Mono. Nokia makes a very clear distinction in its product line between phones, which are largely closed platforms, and tablets, which are open. Open source software is an essential part of their strategy as they want to build an application ecosystem around their products. "We are taking open source to the consumer mainstream," Alhola explains. One of the interesting tools that Nokia is working on as part of Maemo is Scratchbox, which is a toolkit geared towards making cross-compilation easier. It does this by making the development environment look and act like the execution environment, using QEMU to simulate the target hardware. Scratchbox supports both ARM and x86 targets, with experimental support for additional architectures. It uses standard toolchains and distributions where possible and is released under the GPL. LogFSLogFS is a flash filesystem that is targeted at the larger flash devices that are becoming more widespread. Unlike some filesystems currently in use, most notably jffs2, LogFS is specifically designed to avoid some of the performance and scalability problems that come with larger devices. Jörn Engel is the developer of LogFS, with some support from the Consumer Electronics Linux Forum (sponsor of ELC), so he gave an update on the status of the project. Engel used an unconventional scale (the sucks/rules meter) to measure the progress that had been made in the last year. The scale runs from -10 to 10 and measures the "suckiness" of particular features of the filesystem. Taking a page from This Is Spinal Tap, the score for the mount speed of LogFS was measured at 11 both last year and this. It is clearly the feature that Engel is most proud of as it takes 10-60ms to mount a filesystem; a similarly sized jffs2 takes on the order of one second. Engel looked at around ten separate attributes of the filesystem, first rating them on where LogFS was a year ago, then re-rating based on where it is today. The conclusion is that the average measure has moved from -2.75 to -0.55, so that "on average, it hardly sucks". He says he is getting confident enough to submit it to Andrew Morton for inclusion in his tree, hopefully on its way into the mainline. Engel is clearly somewhat frustrated with people who are waiting until it is "done" to start using LogFS—though there are some fairly serious usability problems that would tend to limit testers—proclaiming: "LogFS is finished, try it now, today!" In conclusionThere were more talks, of course, as well as an active "hallway track" for the roughly 175 participants. ELC is a well-run and very interesting conference that is worth consideration for anyone who uses, or plans to use, Linux as an embedded operating system. This year's venue, the Computer History Museum was a nice facility for a conference of this size. It also had some great exhibits that will bring back memories for anyone who has been using computers, calculators, or game systems over the past 50 years or so—well worth a visit when one is in Silicon Valley.
OLPC at a turning point It looks like hard times for the One Laptop Per Child project. Quite a few key developers have left, including Mary Lou Jepsen, Ivan Krstić, Andres Salomon, and Walter Bender. Laptop deployments are far below the several million that the project had hoped for by this time, and many of the goals for the system's software have not been achieved. There is persistent talk of supporting Windows, with suggestions that Linux could be dropped altogether. An ongoing thread on the project's development mailing list shows that quite a few participants are concerned about where things are going. To many, it seems, OLPC is about to go down as a noble failure.These rumors may be just a bit premature, though. When considering what may really come of OLPC, it's worth keeping a few things in mind. One of those is the fact that the project has just completed a major push to its first mass-production system. Your editor has watched the project closely enough to see that, as with many such efforts, the people involved have been putting in lots of long hours to get the job done. When this kind of pressure is lifted, it is natural to take a break, catch up on the house work, and, perhaps, find a new job. So the departure of some key staff at this stage is not entirely surprising. A look at the state of OLPC's software suggests that the project had set an overly ambitious set of goals for its first release. When that happens, one must jettison some objectives; the later that this is done, the more likely it is that the wrong objectives will be tossed overboard. There are signs that OLPC tried to do too much for too long, with an end result which is not as stable, as fast, or as fully-featured as one would like. As many people close to the project have noted, the laptop's software remains immature. But, as former president Walter Bender put it:
While [we] have heard a lot of noise about performance in the media and
from some members of the development community, it has not, in my
experience been a major road-block in the school trials and
deployments. There are lots of bugs and lots of things that could
be improved upon, and these should certainly be addressed, but the
characterizations being made in this thread do not reflect the
realities of the OLPC deployments--the children and teachers are
using the laptops and are learning.
Finally, the number of laptops delivered to children is far below the level the project had planned upon. Fewer deployments means a lower impact for the project, but it also cannot be helping to create the economies of scale the project had counted on to push the cost down. There have also been some embarrassing failures along the way, including the misplacing of a large number of "Give one get one" orders until after it was too late to include them in the manufacturing run. All of the above points to a need to make some changes in how the project is run. Changes always create uncertainty, so it would be surprising if OLPC participants were not a little nervous at the moment. What happens in the next few months will likely determine OLPC's fate. The project's leadership has famously said in the past that OLPC is an education project, not a laptop project. Some people have recently expressed concerns that, in fact, OLPC is turning into a laptop project, with deployment numbers being the main goal. Nicholas Negroponte doesn't help when he allows himself to be quoted as being "mainly concerned with putting as many laptops as possible in children's hands." If OLPC becomes primarily a low-cost laptop vendor, and especially if it goes to proprietary operating systems as a means toward that end, it will lose much of the community that has grown up around the project. And that would be a shame. There is great beauty in the idea of putting a well-designed learning tool into the hands of children and empowering those children by providing a system which is completely open and hackable. A large and motivated community of highly-capable people came together behind that vision and did their best to rethink how this technology should work and create something better. Deployment groups in a number of countries have gotten the resulting systems into the hands of thousands of children, and many of them are reporting good results. A lot of good things have happened here, and it doesn't have to end now. But it might end soon. To pull things together, the project will have to communicate a clearer vision of where it plans to go with its software at all levels; Mr. Negroponte's statement of continued support for Sugar appears to be an attempt to start this process. The operational side of the project needs to get its act together. Some transparency on, for example, what is being done with donation money and what agreements have been made with outside corporations, would be most helpful. And, most of all, the group of volunteers working with this project have to be convinced anew that they are not wasting their time. If the project's leadership can manage all of that, there may well be great things coming from OLPC in the future.
Page editor: Jonathan Corbet Security Image handling vulnerabilities Bugs that linger for eight years without a fix are probably annoying to whoever reported them; perhaps others as well. When those bugs have possible security implications, it is hard to see how they can remain unfixed for even eight months, let alone years, but that appears to be the case with some GTK image handling bugs. Code to handle image formats has been the source of numerous vulnerabilities along the way, which makes it even harder to see why these have languished so long. A call for ideas for a hackfest on the GNOME foundation mailing list seems like a bit of a strange place to find information about vulnerabilities, but in the ensuing thread, Michael Chudobiak brought up some bugs that he would like to see addressed, perhaps as part of a hackfest:
I'd like to suggest one possible topic: The pixbuf loaders. They're slow
and memory intensive, and this drags down anything that needs thumbnails
(Nautilus, etc). There is a lot of opportunity to improve the
responsiveness of the desktop here.
The bugs he listed were from 2002 (80925), 2004 (142428), and 2008 (522803), but Alan Cox mentioned that he reported one of them as a GNOME security bug "about eight years ago". In his opinion all of the bugs were of the "well known, never fixed" variety. Because the code in question lives in GTK—used by many GNOME applications—"quite a few gnome apps fed small compressed images explode". The basic problem is that the routines handling images create the full-resolution image in memory regardless of the size requested. In addition, various memory-intensive techniques are used to scale the image to the requested size. This impacts Nautilus and other GNOME programs that create thumbnails of large images. Presumably, a denial of service, at a minimum, can result from these operations, though there may be other ways to exploit any program crashes that result. Cox has a plan to see them get fixed:
Unfortunately they are well known but nobody seems to care. I'll forward
your message to the vendor security list and we'll see what happens.
Probably the bug just needs to be made *very* public to incentivise
people to fix it 8)
The vendor security list, often abbreviated vendor-sec, is a closed mailing list for distribution security teams to exchange information about vulnerabilities in various programs. It is closed so that bugs that are not publicly known can be freely discussed. Whether Cox's posting to that list spurs any action remains to be seen. It is a rare week where LWN does not report some kind of image handling botch as a new vulnerability. This week, a cups vulnerability in handling PNG files could lead to a denial of service; last week we reported an Opera vulnerability in handling images in HTML canvas elements that could possibly lead to arbitrary code execution. Image handling is an area where all bugs need to be scrutinized carefully for potential security issues. Hopefully, part of the problem is that the GNOME hackers did not realize the security implications of the bugs. There does seem to be ample complaint about performance problems, though, to get some kind of action over the last six or eight years. This is a set of related bugs that have seemingly been overlooked for a long time. Perhaps that time is now coming to an end.
New vulnerabilities clamav: buffer overflows
clamav: multiple vulnerabilities
cups: arbitrary code execution
dbmail: authentication bypass
fedora-ds-admin: privilege escalation and arbitrary command execution
feh: shell command injection
firefox: denial of service
ikiwiki: cross-site request forgery
mplayer: arbitrary code execution
mt-daapd: integer overflow
openfire: denial of service
openoffice.org: multiple vulnerabilities
php-toolkit: denial of service
poppler: arbitrary code execution
python2.4: arbitrary code execution
speex: insufficient boundary checks
sun java: multiple vulnerabilities
suphp: privilege escalation
vlc: multiple vulnerabilities
WebKit: cross-site scripting and code execution
Page editor: Jake Edge
Kernel development Release status Kernel release status The 2.6.26 merge window is open, so there is no current 2.6 development release. See the article below for a summary of the patches merged for 2.6.26 so far.The current -mm tree is 2.6.25-mm1. Recent changes to -mm include some read-copy-update enhancements and the OLPC architecture support code, but mostly -mm is just getting ready for the big flow of patches into the mainline. See the -mm merge plans document for Andrew's plans for 2.6.26. The current stable 2.6 kernel is 2.6.25, released on April 16. After nearly three months of development and the merging of over 12,000 patches from almost 1200 developers, this kernel is now considered ready for wider use. Highlights of this release include the ath5k (Atheros wireless) driver, a bunch of realtime work including realtime group scheduling, preemptable RCU, LatencyTop support, a number of new ext4 filesystem features, support for the controller area network protocol, more network namespace work, the return of the timerfd() system call, the page map patches (providing much better information on system memory use), the SMACK security module, better kernel support for Intel and ATI R500 graphics chipsets, the memory use controller, ACPI thermal regulation support, MN10300 architecture support, and much more. See the KernelNewbies 2.6.25 page for lots of details, or the full changelog for unbelievable amounts of detail. 2.6.24.5 was released on April 18. It contains a relatively long list of fixes for significant 2.6.24 problems. For older kernels: 2.4.36.3 was released on April 19. "Nothing outstanding here, I've just decided to release pending fixes. Those already running 2.4.36.2 have no particular reason to upgrade, unless they already experience troubles in the fixed areas."
Kernel development news Quotes of the week
In any case, we'll continue to use the fact that mprotect is also
broken to get our WC mapping working (using mprotect PROT_NONE
followed by mprotect PROT_READ|PROT_WRITE causes the CD and WT bits
to get cleared). We're fortunate in this case that we've found a
bug to exploit that gives us the desired behaviour.
-- Keith Packard
Nice-looking code - kgdb has improved rather a lot. I'm glad we
finally got it in. Maybe one day I'll get to use it again.
-- Andrew Morton
/me duly notes this request to break Andrew's systems even more frequently ;-)
-- Ingo Molnar
The 2.6.26 merge window opens That shiny new 2.6.25 kernel which was released on April 16 is now ancient history; some 3500 changesets have been merged into the mainline git repository since then. Some of the most significant user-visible changes include:
Changes visible to kernel developers include:
Needless to say, this development series is still young and, as of this writing, the merge window has over a week to run. So there will be a lot more code going into the mainline before the shape of 2.6.26 becomes clear.
4K stacks by default? The kernel stack is a rather important chunk of memory in any Linux system. The unpleasant kernel memory corruption that results from overflowing it is something that is to be avoided at all costs. But the stack is allocated for each process and thread in the system, so those who are looking to reduce memory usage target the 8K stack used by default on x86. In addition, an 8K stack requires two physically contiguous pages (an "order 1" allocation) which can be difficult to satisfy on a running system due to fragmentation. Linux has had optional support for 4K stacks for nearly four years now, with Fedora and RHEL enabling it on the kernels they ship, but a recent patch to make it the default for x86 has raised some eyebrows. Andrew Morton sees it as bypassing the normal patch submission process:
This patch will cause kernels to crash.
It has no changelog which explains or justifies the alteration. afaict the patch was not posted to the mailing list and was not discussed or reviewed. It is not surprising that patch author Ingo Molnar sees things a little differently:
what mainline kernels crash and how will they crash? Fedora and other
distros have had 4K stacks enabled for years [ ... ]
and we've conducted tens of thousands of bootup tests with all sorts of
drivers and kernel options enabled and have yet to see a single crash
due to 4K stacks. So basically the kernel default just follows the
common distro default now. (distros and users can still disable it)
As described in an earlier LWN article, the main concerns about only providing 4K for the kernel stack are for complicated storage configurations or for people using NDISwrapper. There is fairly high disdain for the latter case—as it is done to load proprietary Windows drivers into the kernel—but it could lead to a pretty hideous failure in the former. Data corruption certainly seems like a possibility, but, regardless, a kernel crash is definitely not what an administrator wants to have to deal with. Arjan van de Ven summarized the current state, noting that NDISwrapper really requires 12K stacks, so having 8K only makes it less likely those kernels will crash. The stacking of multiple storage drivers (network filesystems, device mapper, RAID, etc.) is a bigger issue:
we need to know which they are, and then solve them, because even on x86-64
with 8k stacks they can be a problem (just because the stack frames are
bigger, although not quite double, there).
Proponents of default 4K stacks seem to be puzzled why there is objection to the change since there have been no problems with Red Hat kernels. But Andi Kleen notes:
One way they do that is by marking significant parts of the kernel
unsupported. I don't think that's an option for mainline.
The xfs filesystem, which is not supported in RHEL or Fedora, can potentially use a great deal of stack. This leads some kernel hackers to worry that a complicated configuration that uses it, an "nfs+xfs+md+scsi writeback" configuration as Eric Sandeen puts it, could overflow. Work is already proceeding to reduce the xfs stack usage, but it clearly is a problem that xfs hackers have seen. David Chinner responds to a question about stack overflows:
We see them regularly enough on x86 to know that the first question
to any strange crash is "are you using 4k stacks?". In comparison,
I have never heard of a single stack overflow on x86_64....
It would seem premature to make 4K stacks the default. There is good reason to believe that folks using xfs could run into problems. But there is a larger issue, one that Morton brought up in his initial message, then reiterated later in the thread:
Anyway. We should be having this sort of discussion _before_ a patch
gets merged, no?
The memory savings can be significant, especially in the embedded world. Coupled with the elimination of order 1 allocations each time a process gets created, there is good reason to keep working toward 4K stacks by default. As of this writing, the default remains for 4K stacks in Linus's tree, but that could change before long.
ELC: Morton and Saxena on working with the kernel community In many ways, Andrew Morton's keynote set the tone for this year's Embedded Linux Conference (ELC) by describing the ways that embedded companies and developers can work with the kernel community in a way that will be "mutually beneficial". Morton provided reasons, from a purely economic standpoint, why it makes sense for companies to get their code into the mainline kernel. He also provided concrete suggestions on how to make that happen. The theme of the conference seemed to be "working with the community" and Morton's speech provided an excellent example of how and why to do just that. Conference organizer Tim Bird introduced the keynote as "the main event" for ELC, noting that he often thought of Morton as "kind of like the adult in the room" on linux-kernel. Readers of that mailing list tend to get the impression that there's more than one of him around because of all that he does. He also noted that it was surprising to some that Morton has an embedded Linux background—from his work at Digeo. Morton believes that embedded development is underrepresented in kernel.org work relative to its economic importance. This is caused by a number of factors, not least the financial constraints under which much embedded development is done. An exceptional case is the chip and board manufacturers who have a big interest in seeing Linux run well on their hardware so that they can attract more customers. But even those do not contribute as much as he would like to see to kernel development. An effect of this underrepresentation is a risk that it will tilt kernel development more toward the server and desktop. The kernel team is already accused of being server-centric, and there is some truth to that, "but not as much as one might think". Kernel hackers do care about the desktop as well as embedded devices, but without an advocate for embedded concerns, sometimes things get missed. Something Morton would like to see is a single full-time "embedded maintainer". That person would serve as the advocate for embedded concerns, ensuring that they didn't get overlooked in the process. An embedded maintainer could make a significant impact for embedded development. Not all kernel contributions need to be code, he said. There is a need just to hear the problems that are being faced by the embedded community along with lists of things that are missing. "Senior, sophisticated people" are needed to help prioritize the features that are being considered as well. Morton often finds out things he didn't know at conferences, things that he should have known about much earlier: "That's bad!" Morton is trying to incite the embedded community to interact with the kernel hackers more on linux-kernel. He said that a great way to get the attention of the team is to come onto the mailing list and make them look bad. Unfavorable comparisons to other systems or earlier kernels, for example, especially when backed up with numbers, are noticed quickly. He said that it is important to remember that the person who makes the most noise gets the most attention. One of the areas that he is most concerned about is the practice of "patch hoarding"—holding on to kernel changes as patches without submitting them upstream to the kernel hackers. It is hopefully only due to a lack of resources, but he has heard that some are doing it to try and gain a competitive advantage. This is simply wrong, he said, companies have a "moral if not legal obligation" to submit those patches. [PULL QUOTE: The code will be better because of the review done by the kernel hackers; once it is done, the maintenance cost falls to near zero as well. He also touted the competitive advantage, noting that getting your code merged means that you have won—competing proposals won't get in. END QUOTE]There are many good reasons for getting code merged upstream that Morton outlined. The code will be better because of the review done by the kernel hackers; once it is done, the maintenance cost falls to near zero as well. He also touted the competitive advantage, noting that getting your code merged means that you have won—competing proposals won't get in. Being the first to merge a feature can make it easier on yourself and harder on your competition. There are downsides to getting your code upstream as well. Most of those stem from not getting code out there early enough for review. The kernel developers can ask for significant changes to the code especially in the area of user space interfaces. If a company already has lots of code using the new feature and/or interface, it could be very disruptive; "sorry, there's no real fix for that except getting your code out early enough". Another downside that companies may run into is with competitors being brought into the process. Morton and other kernel hackers will try to find others who might have a stake in a new feature to get them involved so that everybody's needs are taken into account. This can blunt the "win" of getting your feature merged. Some are also concerned that competitors will get access to the code once it has been submitted; "tough luck" Morton said, everything in the kernel is GPL. Morton had specific suggestions for choosing a kernel version to use for an embedded project. 2.6.24 is not a lot better than 2.4.18 for embedded use, but it has one important feature: the kernel team will be interested in bugs in the current kernel. He suggests starting with the current kernel, upgrading it while development proceeds, freezing it only when it is time to ship the product. He also suggests that a company create an internal kernel team with one or two people who are the interface to linux-kernel. This will help with name recognition on the mailing list, which will in turn get patches submitted more attention. Over time, by participating and reviewing others' code, the interface people will build up "brownie points" that will allow them to call in favors to get their code reviewed, or to help smooth the path for inclusion. The kernel.org developers appear to give free support, generally very good support, Morton said, but it is not truly free. Kernel hackers do it as a "mutually beneficial transaction"; they don't do it to make more money for your company, they do it to make the kernel better. Morton is definitely a big part of that, inviting people to email him, especially if "five minutes of my time can save months of yours". The decision about when to merge a new feature is hard for some to understand. Many consider Linux a dictatorship, which is incorrect, it is instead "a democracy that doesn't vote". The merge decision is made on the model of the "rule of law" with kernel hackers playing the role of judges. Unfortunately, there are few written rules. Some of the factors that go into his decision about a particular feature are its maintainability, whether there will be an ongoing maintenance team, as well as the general usefulness of the feature. Depending on the size of the feature, an ongoing maintenance team can be the deciding factor. It is not so important for a driver, but a new architecture, for example, needs ongoing maintenance that can only be done by people with knowledge of and access to the hardware. MontaVista kernel hacker, Deepak Saxena, gave a presentation entitled "Appropriate Community Practices: Social and Technical Advice" later in the conference that mirrored many of Morton's points. He showed some examples of hardware vendors making bad decisions that got shot down by the kernel developers, mostly because they didn't "release early and release often". There is a dangerous attitude that "it's Linux, it's open source, I can do anything I want" which is true, but won't get you far with the community. Saxena has high regard for the benefits of working with the system: if your competitor is active in the community, they are getting an advantage that you aren't. Like Morton, he believes that some members of the development team need to get involved in kernel.org activities. "The community is an extension of your team, your team is an extension of the community." He also has specific advice for hardware vendors: avoid abstraction layers, recognize that your hardware is not unique, and think beyond the reference board implementation. Generalizing your code so that others can use it will make it much more acceptable, as will talking with the developers responsible for the subsystems you are touching. Abstraction layers may be helpful for hardware vendors trying to support multiple operating systems, but they make it difficult for the kernel hackers to understand and maintain the code. The kernel.org folks are not interested in finding and fixing bugs in an abstraction layer. He also points out additional benefits of getting code merged. Once it is in the kernel, the company's team will no longer have to keep up with kernel releases, updating their patches to follow the latest changes. The code will still need to be maintained, but day-to-day changes will be handled by the kernel.org folks. An additional benefit is that the code will be enhanced by various efforts to automatically find bugs in mainline kernel code with tools like lockdep. It is clear that the kernel hackers are making a big effort to not only get code from the embedded folks, but also some of their expertise. There are various outreach efforts to try and get more people involved in the Linux development process; these two talks are certainly a part of that. By making it clear that there are benefits to both parties, they hope to make an argument that will reach up from engineering to management resulting in a better kernel for all.
Integrating and Validating dynticks and Preemptable RCU IntroductionRead-copy update (RCU) is a synchronization mechanism that was added to the Linux kernel in October of 2002. RCU is most frequently described as a replacement for reader-writer locking, but it has also been used in a number of other ways. RCU is notable in that RCU readers do not directly synchronize with RCU updaters, which makes RCU read paths extremely fast, and also permits RCU readers to accomplish useful work even when running concurrently with RCU updaters. In early 2008, a preemptable variant of RCU was accepted into mainline Linux in support of real-time workloads, a variant similar to the RCU implementations in the -rt patchset since August 2005. Preemptable RCU is needed for real-time workloads because older RCU implementations disable preemption across RCU read-side critical sections, resulting in excessive real-time latencies. However, one disadvantage of the -rt implementation was that each grace period required work to be done on each CPU, even if that CPU is in a low-power “dynticks-idle” state, and thus incapable of executing RCU read-side critical sections. The idea behind the dynticks-idle state is that idle CPUs should be physically powered down in order to conserve energy. In short, preemptable RCU can disable a valuable energy-conservation feature of recent Linux kernels. Although Josh Triplett and Paul McKenney had discussed some approaches for allowing CPUs to remain in low-power state throughout an RCU grace period (thus preserving the Linux kernel's ability to conserve energy), matters did not come to a head until Steve Rostedt integrated a new dyntick implementation with preemptable RCU in the -rt patchset. This combination caused one of Steve's systems to hang on boot, so in
October, Paul coded up a dynticks-friendly modification to preemptable RCU's
grace-period processing.
Steve coded up Paul reviewed the code repeatedly from October 2007 to February 2008, and almost always found at least one bug. In one case, Paul even coded and tested a fix before realizing that the bug was illusory, but in all cases, the “bug” was in fact illusory. Near the end of February, Paul grew tired of this game. He therefore decided to enlist the aid of Promela and spin, as described in the LWN article Using Promela and Spin to verify parallel algorithms. This article presents a series of seven increasingly realistic Promela models, the last of which passes, consuming about 40GB of main memory for the state space.
Quick Quiz 1:
Yeah, that's great!!!
Now, just what am I supposed to do if I don't happen to have a machine with
40GB of main memory???
More important, Promela and Spin did find a very subtle bug for me!!! This article is organized as follows: These sections are followed by conclusions and answers to the Quick Quizzes. Introduction to Preemptable RCU and dynticksThe per-CPU
Preemptable RCU's grace-period machinery samples the value of
the The following three sections give an overview of the task
interface, the interrupt/NMI interface, and the use of
the Task InterfaceWhen a given CPU enters dynticks-idle mode because it has no more
tasks to run, it invokes 1 static inline void rcu_enter_nohz(void)
2 {
3 mb();
4 __get_cpu_var(dynticks_progress_counter)++;
5 WARN_ON(__get_cpu_var(dynticks_progress_counter) & 0x1);
6 }
This function simply increments Similarly, when a CPU that is in dynticks-idle mode prepares to
start executing a newly runnable task, it invokes
1 static inline void rcu_exit_nohz(void)
2 {
3 __get_cpu_var(dynticks_progress_counter)++;
4 mb();
5 WARN_ON(!(__get_cpu_var(dynticks_progress_counter) & 0x1));
6 }
This function again increments The Interrupt InterfaceThe Interrupt entry is handled by the 1 void rcu_irq_enter(void)
2 {
3 int cpu = smp_processor_id();
4
5 if (per_cpu(rcu_update_flag, cpu))
6 per_cpu(rcu_update_flag, cpu)++;
7 if (!in_interrupt() &&
8 (per_cpu(dynticks_progress_counter, cpu) & 0x1) == 0) {
9 per_cpu(dynticks_progress_counter, cpu)++;
10 smp_mb();
11 per_cpu(rcu_update_flag, cpu)++;
12 }
13 }
Quick Quiz 2:
Why not simply increment
Line 3 fetches the current CPU's number, while lines 4 and 5
increment the rcu_update_flag, and then only
increment dynticks_progress_counter if the old value
of rcu_update_flag was zero???
Quick Quiz 3:
But if line 7 finds that we are the outermost interrupt, wouldn't
we always need to increment rcu_update_flag nesting counter if it
is already non-zero.
Lines 6 and 7 check to see whether we are the outermost level of
interrupt, and, if so, whether dynticks_progress_counter
needs to be incremented.
If so, line 9 increments dynticks_progress_counter,
line 10 executes a memory barrier, and line 11 increments
rcu_update_flag.
As with rcu_exit_nohz(), the memory barrier ensures that
any other CPU that sees the effects of an RCU read-side critical section
in the interrupt handler (following the rcu_irq_enter()
invocation) will also see the increment of
dynticks_progress_counter.
Interrupt entry is handled similarly by
1 void rcu_irq_exit(void)
2 {
3 int cpu = smp_processor_id();
4
5 if (per_cpu(rcu_update_flag, cpu)) {
6 if (--per_cpu(rcu_update_flag, cpu))
7 return;
8 WARN_ON(in_interrupt());
9 smp_mb();
10 per_cpu(dynticks_progress_counter, cpu)++;
11 WARN_ON(per_cpu(dynticks_progress_counter, cpu) & 0x1);
12 }
13 }
Line 3 fetches the current CPU's number, as before.
Line 5 checks to see if the These two sections have described how the
Grace-Period InterfaceOf the four preemptable RCU grace-period states shown below
(taken from
The Design of Preemptable Read-Copy Update),
only the Of course, if a given CPU is in dynticks-idle state, we shouldn't
wait for it.
Therefore, just before entering one of these two states,
the preceding state takes a snapshot of each CPU's
1 static void dyntick_save_progress_counter(int cpu)
2 {
3 per_cpu(rcu_dyntick_snapshot, cpu) =
4 per_cpu(dynticks_progress_counter, cpu);
5 }
The 1 static inline int
2 rcu_try_flip_waitack_needed(int cpu)
3 {
4 long curr;
5 long snap;
6
7 curr = per_cpu(dynticks_progress_counter, cpu);
8 snap = per_cpu(rcu_dyntick_snapshot, cpu);
9 smp_mb(); /* force ordering with cpu entering/leaving dynticks. */
10 if ((curr == snap) && ((curr & 0x1) == 0))
11 return 0;
12 if ((curr - snap) > 2 || (snap & 0x1) == 0)
13 return 0;
14 return 1;
15 }
Lines 7 and 8 pick up current and snapshot versions of
For its part, the 1 static inline int
2 rcu_try_flip_waitmb_needed(int cpu)
3 {
4 long curr;
5 long snap;
6
7 curr = per_cpu(dynticks_progress_counter, cpu);
8 snap = per_cpu(rcu_dyntick_snapshot, cpu);
9 smp_mb(); /* force ordering with cpu entering/leaving dynticks. */
10 if ((curr == snap) && ((curr & 0x1) == 0))
11 return 0;
12 if (curr != snap)
13 return 0;
14 return 1;
15 }
This is quite similar to
Quick Quiz 4:
Can you spot any bugs in any of the code in this section?
We now have seen all the code involved in the interface between RCU and the dynticks-idle state. The next section builds up the Promela model used to validate this code. Validating Preemptable RCU and dynticksThis section develops a Promela model for the interface between dynticks and RCU step by step, with each of the following sections illustrating one step, starting with the process-level code, adding assertions, interrupts, and finally NMIs. Basic ModelThis section translates the process-level dynticks entry/exit
code and the grace-period processing into
Promela.
We start with 1 proctype dyntick_nohz()
2 {
3 byte tmp;
4 byte i = 0;
5
6 do
7 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break;
8 :: i < MAX_DYNTICK_LOOP_NOHZ ->
9 tmp = dynticks_progress_counter;
10 atomic {
11 dynticks_progress_counter = tmp + 1;
12 assert((dynticks_progress_counter & 1) == 1);
13 }
14 tmp = dynticks_progress_counter;
15 atomic {
16 dynticks_progress_counter = tmp + 1;
17 assert((dynticks_progress_counter & 1) == 0);
18 }
19 i++;
20 od;
21 }
Lines 6 and 20 define a loop.
Line 7 exits the loop once the loop counter
Quick Quiz 5:
Why isn't the memory barrier in
rcu_exit_nohz()
and rcu_enter_nohz() modeled in Promela?
Quick Quiz 6:
Isn't it a bit strange to model Each pass through the loop therefore models a CPU exiting dynticks-idle mode (for example, starting to execute a task), then re-entering dynticks-idle mode (for example, that same task blocking). The next step is to model the interface to RCU's grace-period
processing.
For this, we need to model
1 proctype grace_period()
2 {
3 byte curr;
4 byte snap;
5
6 atomic {
7 printf("MAX_DYNTICK_LOOP_NOHZ = %d\n", MAX_DYNTICK_LOOP_NOHZ);
8 snap = dynticks_progress_counter;
9 }
10 do
11 :: 1 ->
12 atomic {
13 curr = dynticks_progress_counter;
14 if
15 :: (curr == snap) && ((curr & 1) == 0) ->
16 break;
17 :: (curr - snap) > 2 || (snap & 1) == 0 ->
18 break;
19 :: 1 -> skip;
20 fi;
21 }
22 od;
23 snap = dynticks_progress_counter;
24 do
25 :: 1 ->
26 atomic {
27 curr = dynticks_progress_counter;
28 if
29 :: (curr == snap) && ((curr & 1) == 0) ->
30 break;
31 :: (curr != snap) ->
32 break;
33 :: 1 -> skip;
34 fi;
35 }
36 od;
37 }
Lines 6-9 print out the loop limit (but only into the .trail file
in case of error) and model a line of code
from Lines 10-22 model the relevant code in
Line 23 models a line from Finally, lines 24-36 model the relevant code in
Quick Quiz 7:
Wait a minute!
In the Linux kernel, both
dynticks_progress_counter and
rcu_dyntick_snapshot are per-CPU variables.
So why are they instead being modeled as single global variables?
The resulting model, when run with the runspin.sh script, generates 691 states and passes without errors, which is not at all surprising given that it completely lacks the assertions that could find failures. The next section therefore adds safety assertions. Validating SafetyA safe RCU implementation must never permit a grace period to
complete before the completion of any RCU readers that started
before the start of the grace period.
This is modeled by a 1 #define GP_IDLE 0 2 #define GP_WAITING 1 3 #define GP_DONE 2 4 byte grace_period_state = GP_DONE; The 1 proctype grace_period()
2 {
3 byte curr;
4 byte snap;
5
6 grace_period_state = GP_IDLE;
7 atomic {
8 printf("MAX_DYNTICK_LOOP_NOHZ = %d\n", MAX_DYNTICK_LOOP_NOHZ);
9 snap = dynticks_progress_counter;
10 grace_period_state = GP_WAITING;
11 }
12 do
13 :: 1 ->
14 atomic {
15 curr = dynticks_progress_counter;
16 if
17 :: (curr == snap) && ((curr & 1) == 0) ->
18 break;
19 :: (curr - snap) > 2 || (snap & 1) == 0 ->
20 break;
21 :: 1 -> skip;
22 fi;
23 }
24 od;
25 grace_period_state = GP_DONE;
26 grace_period_state = GP_IDLE;
27 atomic {
28 snap = dynticks_progress_counter;
29 grace_period_state = GP_WAITING;
30 }
31 do
32 :: 1 ->
33 atomic {
34 curr = dynticks_progress_counter;
35 if
36 :: (curr == snap) && ((curr & 1) == 0) ->
37 break;
38 :: (curr != snap) ->
39 break;
40 :: 1 -> skip;
41 fi;
42 }
43 od;
44 grace_period_state = GP_DONE;
45 }
Quick Quiz 8:
Given there are a pair of back-to-back changes to
Lines 6, 10, 25, 26, 29, and 44 update this variable (combining
atomically with algorithmic operations where feasible) to
allow the grace_period_state on lines 25 and 26,
how can we be sure that line 25's changes won't be lost?
dyntick_nohz() process to validate the basic
RCU safety property.
The form of this validation is to assert that the value of the
grace_period_state variable cannot jump from
GP_IDLE to GP_DONE during a time period
over which RCU readers could plausibly persist.
The 1 proctype dyntick_nohz()
2 {
3 byte tmp;
4 byte i = 0;
5 bit old_gp_idle;
6
7 do
8 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break;
9 :: i < MAX_DYNTICK_LOOP_NOHZ ->
10 tmp = dynticks_progress_counter;
11 atomic {
12 dynticks_progress_counter = tmp + 1;
13 old_gp_idle = (grace_period_state == GP_IDLE);
14 assert((dynticks_progress_counter & 1) == 1);
15 }
16 atomic {
17 tmp = dynticks_progress_counter;
18 assert(!old_gp_idle || grace_period_state != GP_DONE);
19 }
20 atomic {
21 dynticks_progress_counter = tmp + 1;
22 assert((dynticks_progress_counter & 1) == 0);
23 }
24 i++;
25 od;
26 }
Line 13 sets a new | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||