Brief itemsreleased on March 8. This release came a bit earlier than usual, but Linus has reserved the right to pull in a few more trees yet. "So if you feel like you sent me a pull request bit might have been over-looked, please point that out to me, but in general the merge window is over. And as promised, if you left your pull request to the last day of a two-week window, you're now going to have to wait for the 2.6.35 window." Nouveau users should note that they can't upgrade to this kernel without updating their user-space as well.
There have been no stable updates releases since 18.104.22.168 on February 23.
Recently, version 2 of the SCHED_DEADLINE patch was posted. The changes reflect a number of comments which were made the first time around; among other things, there is a new implementation of the group scheduling mechanism. Perhaps most significant in this patch, though, is an early attempt at addressing priority inversion problems, where a low-priority process can, by holding shared resources, prevent a higher-priority process from running. Priority inversion is a hard problem, and, in the deadline scheduling area, it remains without a definitive solution.
In classic realtime scheduling, priority inversion is usually addressed by raising the priority of a process which is holding a resource required by a higher-priority process. But there are no priorities in deadline scheduling, so a variant of this approach is required. The new patch works by "deadline inheritance" - if a process holds a resource required by another process which has a tighter deadline, the holding process has its deadline shortened until the resource is released. It is also necessary to exempt the process from bandwidth throttling (exclusion from the CPU when the stated execution time is exceeded) during this time. That, in turn, could lead to the CPU being oversubscribed - something deadline schedulers are supposed to prevent - but the size of the problem is expected to be small.
The "to do" list for this patch still has a number of entries, including less disruptive bandwidth throttling, a port to the realtime preemption tree, truly global deadline scheduling on multiprocessor systems (another hard problem), and more. The code is progressing, though, and Linux can be expected to have a proper deadline scheduler at some point in the not-too-distant future - though no deadline can be given as the worst case development time is still unknown.
Kernel development newslast week's summary; that makes a total of just over 6000 changesets for the 2.6.34-rc1 release. Some of the most significant, user-visible changes merged since last week include:
Changes visible to kernel developers include:
At "only" 6000 changesets, 2.6.34 looks like a relatively calm development cycle; both 2.6.32 and 2.6.33 had over 8000 changesets by the time the -rc1 release came out. It may be that there is less work to be done, but it may also be that some trees got caught out in the cold by Linus's decision to close the merge window early. Linus suggested that he might yet consider a few pull requests, so we might still see some new features added to this kernel; stay tuned.
A recent linux-kernel discussion, which descended into flames at times, took on the question of the stability of user-space interfaces. The proximate cause was a change in the interface for the Nouveau drivers for NVIDIA graphics hardware, but the real issues go deeper than that. Though the policy for the main kernel is that user-space interfaces live "forever", the policy in the staging tree has generally been looser. But some, including Linus Torvalds, believe that staging drivers that have been shipped by major distributions should be held to a higher standard.
As part of the just-completed 2.6.34 merge window, Torvalds pulled from the DRM tree at Dave Airlie's request, but immediately ran into problems on his Fedora 12 system:
(II) NOUVEAU(0): [drm] nouveau interface version: 0.0.16 (EE) NOUVEAU(0): [drm] wrong version, expecting 0.0.15
The problem stemmed from the Nouveau driver changing its interface, which required an upgrade to libdrm—an upgrade that didn't exist for Fedora 12. The Nouveau changes have been backported into the Fedora 13 2.6.33 kernel, which comes with a new libdrm, but there are no plans to put that kernel into Fedora 12. Users that stick with Fedora kernels upgraded via yum won't run into the problem as Airlie explains:
That makes it impossible to test newer kernels on Fedora 12 systems with NVIDIA graphics, though, which reduces the number of people who are able to test. In addition, there is no "forward compatibility" either—the kernel and DRM library must upgrade (or downgrade) in lockstep. Torvalds is concerned about losing testers who run Fedora 12, as well as problems for those on Fedora 13 (Rawhide right now) who might need to bisect a kernel bug—going back and forth across the interface-change barrier is not possible, at least easily. In his original complaint, Torvalds is characteristically blunt: "Flag days aren't acceptable."
The Nouveau drivers were only merged for 2.6.33 at Torvalds's request—or demand—and they were put into the staging tree. The staging tree configuration option clearly spells out the instability of user-space interfaces: "Please note that these drivers are under heavy development, may or may not work, and may contain userspace interfaces that most likely will be changed in the near future.". So several kernel hackers were clearly confused by Torvalds's outburst. Jesse Barnes put it this way:
Yes, it sucks, but what else should the nouveau developers have done? They didn't want to push nouveau into mainline because they weren't happy with the ABI yet, but it ended up getting pushed anyway as a staging driver at your request, and now they're stuck? Sorry this whole thing is a bit of a wtf...
But Torvalds doesn't disagree that the interface needs changing, he is just unhappy with the way it was done. Because the newer libdrm is not available for Fedora 12, he can't test it:
It is not just Torvalds who can't test it, of course, so he would like to see something done that will enable Fedora users to test and bisect kernels. The Nouveau developers don't want to maintain multiple interfaces, and the Fedora (and other distribution) developers don't want to have to test multiple versions of the DRM library. As Red Hat's Nouveau developer Ben Skeggs put it: "we have no intention of keeping crusty APIs around when they aren't what we require."
Torvalds would like to see a way for the various libdrms to co-exist, preferably with the X server choosing the right one at runtime. As he notes, the server has the information and, if multiple libraries are installed, the right one is only a dlopen() away:
See what I'm saying? What I care about is that right now, it's impossible to switch kernels on a particular setup. That makes it effectively impossible to test new kernels sanely. And that really is a _technical_ problem.
In the end, Airlie helped him get both of the proper libraries installed on his system, with a symbolic link to (manually) choose between them. That was enough to allow testing of the kernel, thus Torvalds didn't revert the Nouveau patch in question. But there is a larger question here: When should a user-space interface be allowed to change, and, just how should it be done?
The Nouveau developers seem rather unhappy that Torvalds and others are trying to change their development model, at least partially because they never requested that Nouveau be merged. But Torvalds is not really pushing the Nouveau developers so much as he is pushing the distributor who shipped Nouveau to handle these kinds of problems. In his opinion, once a major distributor has shipped a library/kernel combination that worked, it is responsible for ensuring that it continues to work, especially for those who might want to run newer kernels.
The problem for testers exists because the distribution, in this case Fedora, shipped the driver before getting it into the upstream kernel, which violates the "upstream first" principle. Torvalds makes it clear that merging the code didn't cause the problem, shipping it did:
Alan Cox disagrees, even quoting Torvalds from 2004 back at himself, because the Nouveau developers are just developing the way they always have; it's not their fault that the code was shipped and is now upstream:
But the consensus, at least among those who aren't graphics driver developers, seems to be that user-space interfaces should only be phased out gradually. That gives users and distributions plenty of time to gracefully handle the interface change. That is essentially how mainline interface changes are done; even though user-space interfaces are supposed to be maintained forever, they sometimes do change—after a long deprecation period. In fact, Ingo Molnar claimed that breaking an ABI often leads to projects that either die on the vine or do not achieve the success that they could:
It's really that simple IMO. There's very few unconditional rules in OSS, but this is one of them.
Ted Ts'o sees handling interface changes gracefully as part of being a conscientious member of the community. If developers don't want to work that way, they shouldn't get their code included into distributions:
Had this occurred with a different driver, say for an obscure WiFi device, it is likely there would have been less, or no, outcry. Because X is such an important, visible part of a user's experience, as well as an essential tool for testers, breaking it is difficult to hide. Torvalds has always pushed for more testing of the latest mainline kernels, so it shouldn't come as a huge surprise that he was less than happy with what happened here.
This situation has cropped up in various guises along the way. While developers would like to believe they can control when an ABI falls under the compatibility guarantee, that really is almost never the case. Once the interface gets merged, and user space starts to use it, there will be pressure to maintain it. It makes for a more difficult development environment in some ways, but the benefit for users is large.examined the problem of 4K-sector drives and the reasons for their existence. In short, going to 4KB physical sectors allows drive manufacturers to increase storage density, always welcome in that competitive market. Recently, there have been a number of reports that Linux is not ready to work with these drives; kernel developer Tejun Heo even posted an extensive, worth-reading summary stating that "4 KiB logical sector support is broken in both the kernel and partitioners." As the subsequent discussion revealed, though, the truth of the matter is that we're not quite that badly prepared.
Linux is fully prepared for a change in the size of physical sectors on a storage device, and has been for a long time. The block layer was written with an avoidance of hardwired sector sizes in mind. Sector counts and offsets are indeed managed as 512-byte units at that level of the kernel, but the block layer is careful to perform all I/O in units of the correct size. So, one would hope, everything would Just Work.
But, as Tejun's document notes, "unfortunately, there are complications." These complications result from the fact that the rest of the world is not prepared to deal with anything other than 512-byte sectors, starting with the BIOS found on almost all systems. In fact, a BIOS which can boot from a 4K-sector drive is an exceedingly rare item - if, indeed, it exists at all. Fixing the BIOS is evidently harder than one might think, and, evidently, there is little motivation to do so. Martin Petersen, who has done much of the work around supporting these drives in Linux, noted:
The problem does not just exist at the BIOS level: bootloaders (whether they are Linux-oriented or not) are not set up to handle larger sectors; neither are partitioning tools, not to mention a wide variety of other operating systems. Something must be done to enable 4K-sector drives to work with all of this software.
That something, of course, is to interpose a mapping layer in the middle. So most 4K-sector drives will implement separate logical and physical sector sizes, with the logical size - the one presented to the host computer - remaining 512 bytes. The system can then pretend that it's dealing with the same kind of hardware it has always dealt with, and everything just works as desired.
Except that, naturally enough, there are complications. A 512-byte sector written to a 4K-sector drive will now force the drive to perform a read-modify-write cycle to avoid losing the data in the rest of the sector. That slows things down, of course, and also increases the risk of data loss should something go wrong in the middle. To avoid this kind of problem, the operating system should do transfers that are a multiple of the physical sector size whenever possible. But, to do that, it must know the physical sector size. As it happens, that information has been made available; the kernel makes use of this information internally and exports it via sysfs.
It is not quite that simple, though. The Linux kernel can go out of its way to use the physical sector size, and to align all transfers on 4KB boundaries from the beginning of the partition. But that goes badly wrong if the partition itself is not properly aligned; in this case, every carefully-arranged 4KB block will overlap two physical sectors - hardly an optimal outcome.
As it happens, badly-aligned partitions are not just common; they are the norm. Consider an example: your editor was a lucky recipient of an Intel solid-state drive at the Kernel Summit which was quickly plugged into his system and partitioned for use. It has been a great move: git repositories on an SSD are much nicer to work with. A quick look at the partition table, though, shows this:
Disk /dev/sda: 80.0 GB, 80026361856 bytes 255 heads, 63 sectors/track, 9729 cylinders, total 156301488 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x5361058c Device Boot Start End Blocks Id System /dev/sda1 63 52452224 26226081 83 Linux
Note that fdisk, despite having been taken out of the "DOS compatibility" mode, is displaying the drive dimensions in units of heads and cylinders. Needless to say, this device has neither; even on rotating media, those numbers are entirely fictional; they are a legacy from a dark time before Linux even existed. But that legacy is still making life difficult now.
Once upon a time, it was determined that 63 (512-byte) sectors was far more than anybody would be able to fit into a single disk track. Since track-aligned I/O is faster on a rotating drive, it made sense to align partitions so that the data began at the beginning of a track. So, traditionally, the first partition on a drive begins at (logical) sector 63, the last sector of the first track. That sector holds the boot block; any filesystem stored on the partition will follow at the beginning of the next track. That placement, of course, misaligns the filesystem with regard to any physical sector size larger than 512 bytes; logical sector 64 (the first data sector in the partition) will be placed at the end of a 4K physical sector. Any subsequent partitions on the device will almost certainly be misaligned in the same way.
One might argue that the right thing to do is to simply ditch this particular practice and align partitions properly; it should not be all that hard to teach partitioning tools about physical sector sizes. This can certainly be done. The tools have been slow to catch on, but a suitably motivated system administrator can usually convince them to place partitions sensibly even now. So weird alignments should not be an insurmountable problem.
Unfortunately, there are complications. It would appear that Windows XP not only expects misaligned partitions; it actually will not function properly without them. One simply cannot run XP on a device which has been properly partitioned for 4K physical sector sizes. To cope with that, drive manufacturers have introduced an even worse hack: shifting all 512-byte logical sectors forward by one, so that logical sector 64 lands at the beginning of a physical sector. So any partitioning tool which wants to lay things out properly must know where the origin of the device actually is - and not all devices are entirely forthcoming with that information.
With luck, the off-by-one problem will go away before it becomes a big issue. As James Bottomley put it: "...fortunately very few of these have been seen in the wild and we're hopeful they can be shot before they breed." But that doesn't fix the problem with the alignment of partitions for use by XP. Later versions of Windows need not concern themselves with this problem, since they rarely coexist with XP (and Windows has never been greatly concerned about coexistence with other systems in general). Linux, though, may well be installed on the same drive as XP; that leads to differing alignment requirements for different partitions. Making that just work is not going to be fun.
Martin suggests that it might be best to just ignore the XP issue:
It may well be that there will not be a significant number of XP installations on new-generation storage devices, but failure to support XP may still create some misery in some quarters.
A related issue pointed out by Tejun is that the DOS partition format, which is still widely used, tops out at 2TB, which just does not seem all that large anymore. Using 4K logical sectors in the partition table can extend that limit as far as 16TB, but, again, that requires cooperation from the BIOS - and it still does not seem all that large. The long-term solution would appear to be moving to a partition format like GPT, but that is not likely to be an easy migration.
In summary: Linux is not all that badly placed to support 4K-sector drives, especially when there is no need to share a drive with older operating systems. There is still work required at the tools level to make that support work optimally without the need for low-level intervention by system administrators, but that is, as they say, just a matter of a bit of programming. As these drives become more widely available, we will be able to make good use of them.
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds