Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.38-rc5, released on February 15. The patch volume is dropping (a bit) as this kernel stabilizes, so there's not a lot of new features, but there are some important bug fixes here. Details can be found in the full changelog.

Stable updates: the 2.6.32.29 (115 patches), 2.6.36.4 (176 patches), and 2.6.37.1 (272 patches!) updates are currently in the review process; these updates can be expected on or after February 18.

Comments (none posted)

Quotes of the week

So never _ever_ mark anything "deprecated". If you want to get rid of something, get rid of it and fix the callers. Don't say "somebody else should get rid of it, because it's deprecated".

And yes, next time this discussion comes up, I _will_ remove that piece-of-sh*t. It's a disease. It's just a stupid way to say "somebody else should deal with this problem". It's a way to make excuses. It's crap. It was a mistake to ever take any of that to begin with.

-- Linus Torvalds

Hey, if that's what it takes to get __deprecated removed i'll bring it up tomorrow!!

-- Ingo Molnar

Comments (7 posted)

Remnant: The Proc Connector and Socket Filters

Scott James Remnant has posted a surprisingly detailed description of how to use the process connector to get process events from the kernel, combined with use of socket filters to reduce the information flow. "As I mentioned before, the proc connector is built on top of the generic connector and that itself is on top of netlink so sending that subscription message also involves embedded a message, inside a message inside a message. If you understood Christopher Nolan's Inception, you should do just fine."

Comments (10 posted)

The MD roadmap

By Jonathan Corbet
February 16, 2011

Users of the MD (multiple disk or RAID) subsystem in Linux may be interested in the MD roadmap posted by maintainer Neil Brown. It discusses a number of things he has planned for MD in quite a bit of detail; as Neil put it:

A particular need I am finding for this road map is to make explicit the required ordering and interdependence of certain tasks. Hopefully that will make it easier to address them in an appropriate order, and mean that I waste less time saying "this is too hard, I might go read some email instead".

There are a lot of enhancements in the pipeline. A bad block log would allow RAID arrays to continue functioning in the presence of bad blocks without needing to immediately eject the offending drive. There is a variant on "hot replace" which would allow a new drive to be inserted before removing the old one, thus allowing the array to continue with a full complement of drives while the new one is being populated. Tracking of areas which are known not to contain useful data would reduce synchronization costs. A number of proposed enhancements to the "reshape" functionality would make it more robust and flexible and allow operations to be undone. A number of other changes are contemplated as well; see Neil's post for the full list.

Comments (4 posted)

CFS bandwidth control

By Jonathan Corbet
February 16, 2011

The CFS scheduler does its best to divide the available CPU time between contending processes, keeping the CPU utilization of each about the same. The scheduler will not, however, insist on equal utilization when there is free CPU time available; rather than let the CPU go idle, the scheduler will give any left-over time to processes which can make use of it. This approach makes sense; there is little point in throttling runnable processes when nobody else wants the CPU anyway.

Except that, sometimes, that's exactly what a system administrator may want to do. Limiting the maximum share of CPU time that a process (or group of processes) may consume can be desirable if those processes belong to a customer who has only paid for a certain amount of CPU time or in situations where it is necessary to provide strict resource-use isolation between processes. The CFS scheduler cannot limit CPU use in that manner, but the CFS bandwidth control patches, posted by Paul Turner, may change that situation.

This patch adds a couple of new control files to the CPU control group mechanism: cpu.cfs_period_us defines the period over which the group's CPU usage is to be regulated, and cpu.cfs_quota_us controls how much CPU time is available to the group over that period. With these two knobs, the administrator can easily limit a group to a certain amount of CPU time and also control the granularity with which that limit is enforced.

Paul's patch is not the only one aimed at solving this problem; the CFS hard limits patch set from Bharata B Rao provides nearly identical functionality. The implementation is different, though; the hard limits patch tries to reuse some of the bandwidth-limiting code from the realtime scheduler to impose the limits. Paul has expressed concerns about the overhead of using this code and how well it will work in situations where the CPU is almost fully subscribed. These concerns appear to have carried the day - there has not been a hard limits patch posted since early 2010. So the CFS bandwidth control patches look like the form this functionality will take in the mainline.

Comments (3 posted)

Kernel development news

Go's memory management, ulimit -v, and RSS control

By Jonathan Corbet
February 15, 2011

Many years ago, your editor ported a borrowed copy of the original BSD vi editor to VMS; after all, using EDT was the sort of activity that lost its charm relatively quickly. DEC's implementation of C for VMS wasn't too bad, so most of the port went reasonably well, but there was one hitch: the vi code assumed that two calls to sbrk() would return virtually contiguous chunks of memory. That was true on early BSD systems, but not on VMS. Your editor, being a fan of elegant solutions to programming problems, solved this one by simply allocating a massive array at the beginning, thus ensuring that the second sbrk() call would never happen. Needless to say, this "fix" was never sent back upstream (the VMS uucp port hadn't been done yet in any case) and has long since vanished from memory.

That said, your editor was recently amused by this message on the golang-dev list indicating that the developers of the Go language have adopted a solution of equal elegance. Go has memory management and garbage collection built into it; the developers believe that this feature is crucial, even in a systems-level programming language. From the FAQ:

One of the biggest sources of bookkeeping in systems programs is memory management. We feel it's critical to eliminate that programmer overhead, and advances in garbage collection technology in the last few years give us confidence that we can implement it with low enough overhead and no significant latency.

In the process of trying to reach that goal of "low enough overhead and no significant latency," the Go developers have made some simplifying assumptions, one of which is that the memory being managed for a running application comes from a single, virtually-contiguous address range. Such assumptions can run into the same problem your editor hit with vi - other code can allocate pieces in the middle of the range - so the Go developers adopted the same solution: they simply allocate all the memory they think they might need (they figured, reasonably, that 16GB should suffice on a 64-bit system) at startup time.

That sounds like a bit of a hack, but an effort has been made to make things work well. The memory is allocated with an mmap() call, using PROT_NONE as the protection parameter. This call is meant to reserve the range without actually instantiating any of the memory; when a piece of that range is actually used by the application, the protection is changed to make it readable and writable. At that point, a page fault on the pages in question will cause real memory to be allocated. Thus, while this mmap() call will bloat the virtual address size of the process, it should not actually consume much more memory until the running program actually needs it.

This mechanism works fine on the developers' machines, but it runs into trouble in the real world. It is not uncommon for users to use ulimit -v to limit the amount of virtual memory available to any given process; the purpose is to keep applications from getting too large and causing the entire system to thrash. When users go to the trouble to set such limits, they tend, for some reason, to choose numbers rather smaller than 16GB. Go applications will fail to run in such an environment, even though their memory use is usually far below the limit that the user set. The problem is that ulimit -v does not restrict memory use; it restricts the maximum virtual address space size, which is a very different thing.

One might argue that, given what users typically want to do with ulimit -v, it might make more sense to have it restrict resident set size instead of virtual address space size. Making that change now would be an ABI change, though; it would also make Linux inconsistent with the behavior of other Unix-like systems. Restricting resident set size is also simply harder than restricting the virtual address space size. But even if this change could be made, it would not help current users of Go applications, who may not update their kernels for a long time.

One might also argue that the Go developers should dump the continuous-heap assumption and implement a data structure which allows allocated memory to be scattered throughout the virtual address space. Such a change also appears not to be in the cards, though; evidently that assumption makes enough things easy (and fast) that they are unwilling to drop it. So some other kind of solution will need to be found. According to the original message, that solution will be to shift allocations for Go programs (on 64-bit systems) up to a range of memory starting at 0xf800000000. No memory will be allocated until it is needed; the runtime will simply assume that nobody else will take pieces of that range in between allocations. Should that assumption prove false, the application will die messily.

For now, that assumption is good; the Linux kernel will not hand out memory in that range unless the application asks for it explicitly. As with many things that just happen to work, though, this kind of scheme could break at any time in the future. Kernel policy could change, the C library might begin doing surprising things, etc. That is always the hazard of relying on accidental, undocumented behavior. For now, though, it solves the problem and allows Go programs to run on systems where users have restricted virtual address space sizes.

It's worth considering what a longer-term solution might look like. If one assumes that Go will continue to need a large, virtually-contiguous heap, then we need to find a way to make that possible. On 64-bit systems, it should be possible; there is a lot of address space available, and the cost of reserving unused address space should be small. The problem is that ulimit -v is not doing exactly what users are hoping for; it regulates the maximum amount of virtual memory an application can use, but it has relatively little effect on how much physical memory an application consumes. It would be nice if there were a mechanism which controlled actual memory use - resident set sizes - instead.

As it turns out, we have such a mechanism in the memory controller. Even better, this controller can manage whole groups of processes, meaning that an application cannot increase its effective memory limit by forking. The memory controller is somewhat resource-intensive to use (though work is being done to reduce its footprint) and, like other control group-based mechanisms, it's not set up to "just work" by default. With a bit of work, though, the memory controller could replace ulimit -v and do a better job as well. With a suitably configured controller running, a Go process could run without limits on address space size and still be prevented from driving the system into thrashing. That seems like a more elegant solution, somehow.

Comments (13 posted)

Security modules and ioctl()

By Jonathan Corbet
February 16, 2011

The ioctl() system call has a bad reputation for a number of reasons, most of which are related to the fact that every implemented command is, in essence, a new system call. There is no way to effectively control what is done in ioctl(), and, for many obscure drivers, no way to really even know what is going on without digging through a lot of old code. So it's not surprising that code adding new ioctl() commands tends to be scrutinized heavily. Recently it turned out that there's another reason to be nervous about ioctl() - it doesn't play well with security modules, and SELinux has been treating it incorrectly for the last couple of years.

SELinux works by matching a specific access attempt against the permissions granted to the calling process. For system calls like write(), the type of access is obvious - the process is attempting to write to an object. With ioctl(), things are not quite so clear. In past times, SELinux would attempt to deal with ioctl() calls by looking at the specific command to figure out what the process was actually trying to do; a FIBMAP command, for example (which reads a map of a file's block locations) would be allowed to proceed if the calling process had the permission to read the file's attributes.

There are a couple of problems with this approach, starting with the fact that the number of possible ioctl() commands is huge. Even without getting into obscure commands implemented by a single driver, trying to enumerate them all and determine their effects is a road to madness. But it gets worse, in that the intended behavior of a given command may not match what a specific driver actually does in response to that command. So the only way to really know what an ioctl() command will do is to figure out what driver is behind the call, and to have some knowledge of what each driver does. Simply creating this capability is not a task for sane people; maintaining it would not be a task for anybody wanting to remain sane. So security module developers were looking for a better way.

They thought they had found one when somebody realized that the command codes used by ioctl() implementations are not random numbers. They are, instead, a carefully-crafted 32-bit quantity which includes an 8-bit "type" field (approximately identifying the driver implementing the command), a driver-specific command code, a pair of read/write bits, and a size field. Using the read/write bits seemed like a great way to figure out what sort of access the ioctl() call needed without actually understanding the command. Thus, a patch to SELinux was merged for 2.6.27 which ripped out the command recognition and simply used the read/write bits in the command code to determine whether a specific call should be allowed or not.

That change remained for well over two years until Eric Paris noticed that, in fact, it made no sense at all. Most ioctl() calls involve the passing of a data structure into or out of the kernel; that structure describes the operation to be performed or holds data returned from the kernel - or both. The size field in the command code is the size of this structure, and the permission bits describe how the structure will be accessed by the kernel. Together, that information can be used by the core ioctl() code to determine whether the calling process has the proper access rights to the memory behind the pointer passed to the kernel.

What those bits do not do, as Eric pointed out, is say anything about what the ioctl() call will do to the object identified by the file descriptor passed to the kernel. A call passing read-only data to the kernel may reformat a disk, while a call with writable data may just be querying hardware information. So using those bits to determine whether the call should proceed is unlikely to yield good results. It's an observation which seems obvious when spelled out in this way, but none of the developers working on security noticed the problem at the time.

So that code has to go - but, as of this writing, it has not been changed in the mainline kernel. There is a simple reason for that: nobody really knows what sort of logic should replace it. As discussed above, simply enumerating command codes with expected behavior is not a feasible solution either. So something else needs to be devised, but it's not clear what that will be.

Stephen Smalley pointed out one approach which was posted back in 2005. That patch required drivers (and other code implementing ioctl()) to provide a special table associating each command code with the permissions required to execute the command. The obvious objections were raised at that time: changing every driver in the system would be a pain, ioctl() implementations are already messy enough as it is, the tables would not be maintained as the driver changed, and so on. The idea was eventually dropped. Bringing it back now seems unlikely to make anybody popular, but there is probably no other way to truly track what every ioctl() command is actually doing. That knowledge resides exclusively in the implementing code, so, if we want to make use of that knowledge elsewhere, it needs to be exported somehow.

Of course, the alternative is to conclude that (1) ioctl() is a pain, and (2) security modules are a pain. Perhaps it's better to just give up and hope that discretionary access controls, along with whatever checks may be built into the driver itself, will be enough. That is, essentially, the solution we have now.

Comments (8 posted)

Hierarchical group I/O scheduling

By Jonathan Corbet
February 15, 2011

There has recently been much attention paid to the group CPU scheduling feature built into the Linux kernel. Using group scheduling, it is possible to ensure that some groups of processes get a fair share of the CPU without being crowded out by a rather larger number of CPU-intensive processes in a different group. Linux has supported this feature for some years, but it has languished in relative obscurity; it is only with recent efforts to make group scheduling "just work" that it has started to come into wider use. As it happens, the kernel has a very similar feature for managing access to block I/O devices which is also, arguably, underused. In this case, though, I/O group scheduling is not as completely implemented as CPU scheduling, but some ongoing work may change that situation.

The "completely fair queueing" (CFQ) I/O scheduler tries to divide the available bandwidth on any given device fairly between the processes which are contending for that device. "Bandwidth" is measured not in the number of bytes transferred, but the amount of time that each process gets to submit requests to the queue; in this way, the code tries to penalize [Group hierarchy] processes which create seek-heavy I/O patterns. (There is also a mode based solely on the number of I/O operations submitted, but your editor suspects it sees relatively little use). The CFQ scheduler also supports group scheduling, but in an incomplete way.

Imagine the group hierarchy shown on the right; here we have three control groups (plus the default root group), and four processes running within those groups. If every process were contending fully for the available I/O bandwidth, and they all had the same I/O priority, one would expect that bandwidth to be split equally between P0, Group1, and Group2; thus P0 should get twice as much I/O bandwidth as either P1 or P3. If more processes were to be added to the root, they should be able to take I/O bandwidth at the expense of the processes in the other control groups. Similarly, the creation of new control groups underneath Group1 should not affect anybody outside of that branch of the hierarchy. In current kernels, though, that is not how things work.

With the current implementation of CFQ group scheduling, the above hierarchy is transformed into something that looks like this:

The CFQ group scheduler currently treats all groups - including the root group - as being equal, at the same level in the hierarchy. Every group is a top-level group. This level of grouping will be adequate for a number of situations, but there will be other users who want the full hierarchical model. That is why control groups were made to be hierarchical in the first place, after all.

The hierarchical CFQ group scheduling patch set from Gui Jianfeng aims to make that feature available. These patches introduce a new cfq_entity structure which is used for the scheduling of both processes and groups; it is clearly modeled after the sched_entity structure used in the CPU scheduling code. With this in place, the I/O scheduler can just give bandwidth to the top-level cfq_entity which has run up the least "vdisktime" so far; if that entity happens to be a group, the scheduling code drops down a level and repeats the process. Sooner or later, the entity which is scheduled for I/O will be an actual process, and the scheduler can start dispatching I/O requests.

This patch set is on its fourth revision; the previous iterations have led to significant changes. It appears that there are a few things to fix up still, but this work seems to be getting closer to being ready.

One thing is worth bearing in mind: there are two I/O bandwidth controllers in contemporary Linux kernels: the proportional bandwidth controller (built into the CFQ scheduler) and the throttling controller built into the block layer. The group scheduling changes only apply to the proportional bandwidth controller. Arguably there is less need for full group scheduling with the throttling controller, which puts absolute caps on the bandwidth available to specific processes.

Controlling I/O bandwidth has a lot of applications; providing some isolation between customers on a shared hosting service is an obvious example. But this feature may yet prove to have value on the desktop as well; many interactivity problems come down to contention for I/O bandwidth. Anybody who has tried to start an office suite while simultaneously copying a video image on the same drive understands how bad it can be. If the group I/O scheduling feature can be made to "just work" like the group CPU scheduling, we may have made another step toward a truly responsive Linux desktop.

Comments (1 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.38-rc5 ?

Architecture-specific

Borislav Petkov amd64_edac: Add Bulldozer support ?

Tejun Heo x86-64, NUMA: bring sanity to NUMA configuration ?

Grant Likely Basic ARM devicetree support ?

Build system

Nathaniel McCallum RFC: kdrvscan ?

Core kernel code

Thomas Gleixner genirq: Overhaul for 2.6.39 ?

Paul Turner CFS Bandwidth Control: Introduction ?

John Stultz Introduce CLOCK_BOOTTIME ?

Device drivers

Laurent Pinchart Media controller (core and V4L2) ?

Laurent Pinchart V4L2 subdev userspace API ?

Laurent Pinchart Sub-device pad-level operations ?

Laurent Pinchart OMAP3 ISP driver ?

Greg KH Platform: add Samsung Laptop platform driver ?

Subhasish Ghosh can: pruss CAN driver. ?

Subhasish Ghosh tty: pruss SUART driver ?

achew@nvidia.com media: ov9740: Initial submission of OV9740 driver ?

Documentation

NeilBrown md road-map: 2011 ?

Filesystems and block I/O

Christoph Hellwig XFS status update for January 2011 ?

Gui Jianfeng [PATCH 0/6 v4] cfq-iosched: Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface ?

Memory management

David Rientjes memcg: add oom killer delay ?

Mel Gorman [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT ?

Security-related

Kees Cook hide kernel addresses via %pK in /proc/timer_list ?

Kees Cook drm: do not leak kernel addresses via /proc/dri/*/vma ?

Benchmarks and bugs

Rafael J. Wysocki 2.6.38-rc4-git5: Reported regressions from 2.6.37 ?

Rafael J. Wysocki 2.6.38-rc4-git5: Reported regressions 2.6.36 -> 2.6.37 ?

Miscellaneous

Karel Zak util-linux v2.19 ?

Hans de Goede Announcing v4l-utils-0.8.3 ?

Page editor: Jonathan Corbet
Next page: Distributions>>