Kernel development [LWN.net]

Kernel release status

The current 2.6 development kernel is 2.6.27-rc5, released on August 28. "The most exciting (well, for me personally - my life is apparently too boring for words) was how we had some stack overflows that totally corrupted some basic thread data structures. That's exciting because we haven't had those in a long time. The cause turned out to be a somewhat overly optimistic increase in the maximum NR_CPUS value, but it also caused some introspection about our stack usage in general." More excitement can be found in the full changelog.

Fixes continue to flow into the mainline repository; the 2.6.27-rc6 prepatch can be expected sometime soon.

No stable kernel releases have been made over the last week. The 2.6.25.17 and 2.6.26.4 stable updates are in the review process as of this writing; they can be expected on or after September 6.

Comments (4 posted)

Quotes of the week

Quite frankly, most programmers aren't "supposedly bad". And if you think that the hard-RT "real man" programmers aren't bad, I really have nothing to say.

-- Linus Torvalds

"real man" programmers stare at the code in Zen contemplation and debug by powercycling - thats one thing even hard RT processes can't beat.

-- Alan Cox

The last burst of checkins has brought Tux3 to the pointer where it undeniably acts like a filesystem: one can write files, go away, come back later and read those files by name. We can see some of the hoped for attractiveness starting to emerge: Tux3 clearly does scale from the very small to the very big at the same time. We have our Exabyte file with 4K blocksize and we can also create 64 Petabyte files using 256 byte blocks. How cool is that? Not much chance for internal fragmentation with 256 byte blocks.

-- Daniel Phillips

Comments (2 posted)

Linux 3.0?

By Jake Edge
September 3, 2008

The Linux kernel summit is happening this month, so various discussion topics are being tossed around on the Ksummit-2008-discuss mailing list. Alan Cox suggested a Linux release that would "throw out" some accumulated, unmaintained cruft as a topic to be discussed. Cox would like to see that release be well publicized, with a new release number, so that the intention of the release would be clear. While there will be disagreements about which drivers and subsystems can be removed, participants in the thread seem favorably disposed to the idea—at least enough that it should be discussed.

There is already a process in place for deprecating and eventually removing parts of the kernel that need it, but it is somewhat haphazardly used. Cox proposes:

At some point soon we add all the old legacy ISA drivers (barring the odd ones that turn up in embedded chipsets on LPC bus) into the feature-removal list and declare an 'ISA death' flag day which we brand 2.8 or 3.0 or something so everyone knows that we are having a single clean 'throw out' of old junk.

It would also be a chance to throw out a whole pile of other "legacy" things like ipt_tos, bzImage symlinks, ancient SCTP options, ancient lmsensor support, V4L1 only driver stuff etc.

Cox's list sparked immediate protest about some of the items on it, but the general idea was well received. There are certainly sizable portions of the kernel, especially for older hardware, that are unmaintained and probably completely broken. No one seems to have any interest in carrying that stuff forward, but, without a concerted effort to identify and remove crufty code, it is likely to remain. Cox has suggested one way to make that happen; discussion at the kernel summit might refine his idea or come up with something entirely different.

Part of the reason that unmaintained code tends to hang around is that the kernel hackers have gotten much better at fixing all affected code when they make an API change. While that is definitely a change for the better, it does have the effect of sometimes hiding code that might be ready to be removed. In earlier times, dead code would have become unbuildable after an API change or two leading to either a maintainer stepping up or the code being removed.

The need to make a "major" kernel release, with a corresponding change to the major or minor release number is the biggest question that the kernel hackers seem to have. Greg Kroah-Hartman asks:

Can't we do all of the above today in our current model? Or is it just a marketing thing to bump to 3.0? If so, should we just pick a release and say, "here, 2.6.31 is the last 2.6 kernel and for the next 3 months we are just going to rip things out and create 3.0"?

There is an element of "marketing" to Cox's proposal. Publicizing a major release, along with the intention to get rid of "legacy" code, will allow interested parties to step up to maintain pieces that they do not want to see removed. As Cox, puts it:

I thought it might be useful to actually draw some definite lines so we can actually get around to throwing stuff out rather than letting it rot forever and also if its well telegraphed both give people a chance to fix where the line goes and - yes - as a marketing thing as much as anything else to define the line in a way that non-techies, press etc get.

Plus it appeals to my sense of the open source way of doing things differently - a major release about getting rid of old junk not about adding more new wackiness people don't need 8)

Arjan van de Ven thinks that gathering the list of things to be removed is a good exercise:

I like the idea of at least discussing this, and for a bunch of people making a long list of what would go. Based on that whole list it becomes a value discussion/decision; is there enough of this to make it worth doing.

Once the list has been gathered and discussed, van de Ven notes, it may well be that it can be done under the current development model, without a major release. "But let's at least do the exercise. It's worth validating the model we have once in a while ;)"

This may not be the only discussion of kernel version numbers that takes place at the summit. Back in July, Linus Torvalds mentioned a bikeshed painting project that he planned to bring up. It seems that Torvalds is less than completely happy with how large the minor release number of the kernel is; he would like to see numbers that have more meaning, possibly date-based:

The only thing I do know is that I agree that "big meaningless numbers" are bad. "26" is already pretty big. As you point out, the 2.4.x series has much bigger numbers yet.

And yes, something like "2008" is obviously numerically bigger, but has a direct meaning and as such is possibly better than something arbitrary and non-descriptive like "26".

Version numbers are not important, per se, but having a consistent, well-understood numbering scheme certainly is. The current system has been in place for four years or so without much need to modify it. That may still be the case, but with ideas about altering it coming from multiple directions, there could be changes afoot as well.

For the kernel hackers themselves, there is little benefit—except, perhaps, preventing the annoyance of ever-increasing numbers—but version numbering does provide a mechanism to communicate with the "outside world". Users have come to expect the occasional major release, with some sizable and visible chunk of changes, but the current incremental kernel releases do not provide that numerically; instead, big changes come with nearly every kernel release. There may be value in raising the visibility of one particular release, either as a means to clean up the kernel or to move to a different versioning scheme—perhaps both at once.

Comments (31 posted)

High- (but not too high-) resolution timeouts

By Jonathan Corbet
September 2, 2008

Linux provides a number of system calls that allow an application to wait for file descriptors to become ready for I/O; they include select(), pselect(), poll(), ppoll(), and epoll_wait(). Each of these interfaces allows the specification of a timeout putting an upper bound on how long the application will be blocked. In typical fashion, the form of that timeout varies greatly. poll() and epoll_wait() take an integer number of milliseconds; select() takes a struct timeval with microsecond resolution, and ppoll() and pselect() take a struct timespec with nanosecond resolution.

They are all the same, though, in that they convert this timeout value to jiffies, with a maximum resolution between one and ten milliseconds. A programmer might program a pselect() call with a 10 nanosecond timeout, but the call may not return until 10 milliseconds later, even in the absence of contention for the CPU. An error of six orders of magnitude seems like a bit much, especially given that contemporary hardware can easily support much more accurate timing.

Arjan van de Ven recently surfaced with a patch set aimed at addressing this problem. The core idea is simple: have the code implementing poll() and select() use high-resolution timers instead of converting the timeout period to low-resolution jiffies. The implementation relied on a new function to provide the timeouts:

    long schedule_hrtimeout(struct timespec *time, int mode);

Here, time is the timeout period, as interpreted by mode (which is either HRTIMER_MODE_ABS or HRTIMER_MODE_REL).

High-resolution timeouts are a nice feature, but one can immediately imagine a problem: higher-resolution timeouts are less likely to coincide with other events which wake up the processor. The result will be more wakeups and greater power consumption. As it happens, there are few developers who are more aware of this fact than Arjan, who has done quite a bit of work aimed at keeping processors asleep as much as possible. His solution to this problem was to only use high-resolution timeouts if the timeout period is less than one second. For longer timeout periods, the old, jiffie-based mechanism was used as before.

Linus didn't like that solution, calling it "ugly." His preference, instead, was to have schedule_hrtimeout() apply an appropriate amount of fuzz to all timeout values; the longer the timeout, the less resolution would be supplied. Alan Cox suggested that a better mechanism would be for the caller to supply the required accuracy with the timeout value. The problem with that idea, as Linus pointed out, is that the current system call interfaces provide no way for an application to supply the accuracy value. One could create more poll()-like system calls - as if there weren't enough of them already - with an accuracy parameter, but that looks like a lot of trouble to create a non-standard interface which few programmers would bother to use.

A different solution came in the form of Arjan's range-capable timer patch set. This patch extends hrtimers to accept two timeout values, called the "soft" and "hard" timeouts. The soft value - the shorter of the two - is the first time at which the timeout can expire; the kernel will make its best effort to ensure that it does not expire after the hard period has elapsed. In between the two, the kernel is free to expire the timer at any convenient time.

It's a useful feature, but it comes at the cost of some significant API changes. To begin with, the expires field of struct hrtimer goes away. Rather than manipulate expires directly, kernel code must now use one of the new accessor functions:

    void hrtimer_set_expires(struct hrtimer *timer, ktime_t time);
    void hrtimer_set_expires_tv64(struct hrtimer *timer, s64 tv64);
    void hrtimer_add_expires(struct hrtimer *timer, ktime_t time);
    void hrtimer_add_expires_ns(struct hrtimer *timer, unsigned long ns);
    ktime_t hrtimer_get_expires(const struct hrtimer *timer);
    s64 hrtimer_get_expires_tv64(const struct hrtimer *timer);
    s64 hrtimer_get_expires_ns(const struct hrtimer *timer);
    ktime_t hrtimer_expires_remaining(const struct hrtimer *timer);

Once that's done, the range capability is added to hrtimers. By default, the soft and hard expiration times are the same; code which wishes to set them independently can use the new functions:

    void hrtimer_set_expires_range(struct hrtimer *timer, ktime_t time, 
                                   ktime_t delta);
    void hrtimer_set_expires_range_ns(struct hrtimer *timer, ktime_t time,
                                      unsigned long delta);
    ktime_t hrtimer_get_softexpires(const struct hrtimer *timer);
    s64 hrtimer_get_softexpires_tv64(const struct hrtimer *timer)

In the new "set" functions, the specified time is the soft timeout, while time+delta provides the hard timeout value. There is also another form of schedule_timeout():

    int schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
				 const enum hrtimer_mode mode);

With this infrastructure in place, poll() and friends can be given approximate timeouts; the only remaining question is just how wide the range of times should be. In Arjan's patch, that range comes from two different sources. The first is a new field in the task structure called timer_slack_ns; as one might expect, it specifies the maximum expected timer accuracy in nanoseconds. This value can be adjusted via the prctl() system call. The default value is set to 50 microseconds - approximate to a certain degree, but still far more accurate than the timeouts in current kernels.

Beyond that, though, there is a heuristic function which provides an accuracy value depending on the requested timeout period. In the case of especially long timeouts - more than ten seconds - the accuracy is set to 100ms; as the timeouts get shorter, the amount of acceptable error drops, down to a minimum of 10ns for very brief timeouts. Normally, poll() and company will use the value returned by the heuristic, but with the exception that the accuracy will never exceed the value found in timer_slack_ns.

The end result is the provision of more accurate timeouts on the polling functions while, simultaneously, preserving the ability to combine timeouts with other system events.

Comments (14 posted)

SCHED_FIFO and realtime throttling

By Jonathan Corbet
September 1, 2008

The SCHED_FIFO scheduling class is a longstanding, POSIX-specified realtime feature. Processes in this class are given the CPU for as long as they want it, subject only to the needs of higher-priority realtime processes. If there are two SCHED_FIFO processes with the same priority contending for the CPU, the process which is currently running will continue to do so until it decides to give the processor up. SCHED_FIFO is thus useful for realtime applications where one wants to know, with great assurance, that the highest-priority process on the system will have full access to the processor for as long as it needs it.

One of the many features merged back in the 2.6.25 cycle was realtime group scheduling. As a way of balancing CPU usage between competing groups of processes, each of which can be running realtime tasks, the group scheduler introduced the concept of "realtime bandwidth," or rt_bandwith. This bandwidth consists of a pair of values: a CPU time accounting period, and the amount of CPU that the group is allowed to use - at realtime priority - during that period. Once a SCHED_FIFO task causes a group to exceed its rt_bandwidth, it will be pushed out of the processor whether it wants to go or not.

This feature is required if one wants to allow multiple groups to split a system's realtime processing power. But it also turns out to have its uses in the default situation, where all processes on the system are contained within a single, default group. Kernels shipped since 2.6.25 have set the rt_bandwidth value for the default group to be 0.95 out of every 1.0 seconds. In other words, the group scheduler is configured, by default, to reserve 5% of the CPU for non-SCHED_FIFO tasks.

It seems that nobody really noticed this feature until mid-August, when Peter Zijlstra posted a patch which set the default value to "unlimited." At that point it became clear that some developers have a different idea about how this kind of policy should be set than others do.

Ingo Molnar disagreed with the patch, saying:

The thing is, i got far more bugreports about locked up RT tasks where the lockup was unintentional, than real bugreports about anyone _intending_ for the whole box to come to a grinding halt because a high-prio RT tasks is monopolizing the CPU.

Ingo's suggestion was to raise the limit to ten seconds of CPU time. As he (and others) pointed out: any SCHED_FIFO application which needs to monopolize the CPU for that long has serious problems and needs to be fixed.

There are real problems associated with letting a SCHED_FIFO process run indefinitely. Should that process never get around to relinquishing the CPU, the system will simply hang forevermore; there is no possibility of the administrator slipping in with a kill command. This process will also block important things like kernel threads; even if it releases the processor after ten seconds, it will have seriously degraded the operation of the rest of the system. Even on a multiprocessor system, there will typically be processes bound to the CPU where the SCHED_FIFO process is running; there will be no way to recover those processes without breaking their CPU affinity, which is not a step anybody wants to take.

So, it is argued, the rt_bandwidth limit is an important safety breaker. With it in place, even a runaway SCHED_FIFO cannot prevent the administrator from (eventually) regaining control of the system and figuring out what is going on. In exchange for this safety, this feature only robs SCHED_FIFO tasks of a small amount of CPU time - the equivalent of running the application on a slightly weaker processor.

Those opposed to the default rt_bandwidth limit cite two main points: it is a user-space API change (which also breaks POSIX compliance) and represents an imposition of policy by the kernel. On the first point, Nick Piggin worries that this change could lead to broken applications:

It's not common sense to change this. It would be perfectly valid to engineer a realtime process that uses a peak of say 90% of the CPU with a 10% margin for safety and other services. Now they only have 5%.

Or a realtime app could definitely use the CPU adaptively up to 100% but still unable to tolerate an unexpected preemption.

What could make the problem worse is that the throttle might not cut in during testing; it could, instead, wait until something unexpected comes up in a production system. Needless to say, that is a prospect which can prove scary for people who create and deploy this kind of system.

The "policy in the kernel" argument was mostly shot down by Linus, who pointed out that there's lots of policy in the kernel, especially when it comes to the default settings of tunable parameters. He says:

And the default policy should generally be the one that makes sense for most people. Quite frankly, if it's an issue where all normal distros would basically be expected to set a value, then that value should _be_ the default policy, and none of the normal distros should ever need to worry.

Linus carefully avoided taking a position on which setting makes sense for the most people here. One could certainly argue that making systems resistant to being taken over by runaway realtime processes is the more sensible setting, especially considering that there is a certain amount of interest in running scary applications like PulseAudio with realtime priority. On the other hand, one can also make the case that conforming to the standard (and expected) SCHED_FIFO semantics is the only option which makes sense at all.

There has been some talk of creating a new realtime scheduling class with throttling being explicitly part of its semantics; this class could, with a suitably low limit, even be made available to unprivileged processes. Meanwhile, as of this writing, the 0.95-second limit - the one option that nobody seems to like - remains unchanged. It will almost certainly be raised; how much is something we'll have to wait to see.

Comments (25 posted)

Linus Torvalds Linux 2.6.27-rc5 ?

Steven Rostedt 2.6.26.3-rt4 ?

Yinghai Lu resource/x86: add sticky resource type ?

Andi Kleen [1/3] MCE: Implement dynamic machine check banks support ?

Thomas Gleixner Fix TSC calibration issues ?

Nick Piggin queueing spinlocks? ?

Vegard Nossum bitfields API ?

Nick Piggin bitops: remove volatile ?

Gregory Haskins seqlock: serialize against writers ?

Paul E. McKenney v3 scalable classic RCU implementation ?

Arjan van de Ven Nano/Microsecond resolution for select() and poll() ?

Arjan van de Ven Turn hrtimers into a range capable timer ?

Steven Rostedt ftrace stack tracer ?

Subrata Modak [ANNOUNCE] The Linux Test Project has been Released for AUGUST 2008 ?

Andy Whitcroft checkpatch: update to version 0.23 ?

Tejun Heo [PATCH RESEND] char_dev: add cdev->release() and convert cdev_alloc() to use it ?

Tejun Heo CUSE: implement CUSE ?

Tejun Heo OSS Proxy using CUSE ?

Li Yang usb: add Freescale QE/CPM USB peripheral controller driver ?

Elias Oltmanns Disk shock protection in GNU/Linux (take 2) ?

Andrew Patterson detect online disk resize ?

Ron Mercer qlge: New Qlogic 10Gb Ethernet Driver. ?

Mike Anderson scsi: scsi_decide_dispostion update ?

Guennadi Liakhovetski soc-camera: add API documentation ?

Tejun Heo FUSE: extend FUSE to support more operations ?

Christoph Hellwig XFS status update for August 2008 ?

KOSAKI Motohiro hugepage: support ZERO_PAGE() ?

Luis R. Rodriguez cfg80211: Add new wireless regulatory infrastructure ?

Jeff Kirsher pkt_sched: Add multiqueue scheduler support ?

Jeff Kirsher IPROUTE: add support for multiq qdisc ?

Gerrit Renker dccp: Feature negotiation - last call for comments ?

Julius Volz Add first IPv6 support to IPVS ?

Serge Hallyn file capabilities: add no_file_caps switch (v2) ?

Geert Uytterhoeven [RFC] zlib crypto module ?

Paul Moore Labeled networking patches for 2.6.28 ?

Andrey Mirkin OpenVZ kernel based checkpointing/restart ?

Rafael J. Wysocki 2.6.27-rc5-git2: Reported regressions from 2.6.26 ?

Kay Sievers udev 127 release ?

Hans de Goede libv4l release: 0.4.3 ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quotes of the week

Linux 3.0?

High- (but not too high-) resolution timeouts

SCHED_FIFO and realtime throttling

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous