Brief items
The current development kernel is 2.6.34-rc1,
released on March 8. This
release came a bit earlier than usual, but Linus has reserved the right to pull
in a few more trees yet. "
So if you feel like you sent me a pull
request bit might have been
over-looked, please point that out to me, but in general the merge window
is over. And as promised, if you left your pull request to the last day of
a two-week window, you're now going to have to wait for the 2.6.35
window." Nouveau users should note that they can't upgrade to this
kernel without updating their user-space as well.
There have been no stable updates releases since 2.6.32.9 on
February 23.
Comments (none posted)
This is a big motivation behind our "fish" names for boards --
they're pretty unappetizing to the pr/marketing folks so they never
get mixed up with final product names and we can concentrate on
making the hardware work.
--
Brian Swetland
selinux relabels are the new fsck
--
Dave Airlie
The lack of any changelog in a patch is usually a good sign that the
patch needs a changelog.
--
Andrew Morton
In the end neither side is right. There are useful things that you
can do with either, but as everyone and his demented gerbil has
pointed out, no one has the True Security Solution. Not even
SELinux, which violates some pretty fundamental security principles
(see: "small enough to be analyzed") in actual deployment. TOMOYO
violates "non-circumventable", just in case anyone thinks I'm
picking on someone. Heck, even Smack isn't perfect, although I will
leave it to others to autoclave that puppy.
--
Casey Schaufler
If we are only talking about obligations under the GPL, sure, no
one violated copyright licenses. But what *did* happen is someone
basically said, "I want to experiment on a whole bunch of users,
but I don't want to spend the effort to do things in the right way.
I want to take short cuts; I don't want to worry about the fact
that it will be impossible to test kernels without pulling
Frankenstein combinations of patches between Fedora 13 and Fedora
12." It's much like people who drill oil in the Artic Ocean, but
use single-hulled tankers and then leave so much toxic spillage in
their wake, but then say, "hey, the regulations said what we did
was O.K. Go away; don't bother us."
--
Ted Ts'o
Comments (none posted)
It was ugly enough in <compress/mm.h> (which really should be nuked
from orbit - it's the only way to be sure), but when I see it
spreading, I go into full zombie-attack mode, and want to start up
the chainsaw and run around naked.
--
Linus Torvalds
OK, we really really don't want that... simply because there just
aren't enough zombies to go around already. Last I heard, the EPA
was considering classifying them as an endangered species, only to
get stuck in a bureaucratic mess if they can be classified as a
"species" at all, or if they should be classified together with
bread mold, toxic waste and Microsoft salesmen.
--
H. Peter Anvin
Personally I think we should all get together and agree on a
framework and fix the framework to meet all of the needs and look
like a swiss army hammer driver drill thing rather than having 4
options, none of which meet all the needs, and then forcing our
uneducated users to choose between them. But, hey, we all know
that isn't going to happen so I'll just go back to happy go lucky
dream land where Linus is not running around naked with a chain
saw.
--
Eric Paris
Comments (4 posted)
LWN first
looked at LogFS, a
new filesystem aimed at solid-state storage devices, back in 2007. It has
taken a long time, but, as of 2.6.34, LogFS will be in the mainline kernel
and available for use; let the benchmarking begin.
Comments (34 posted)
By Jonathan Corbet
March 10, 2010
The POSIX approach to realtime scheduling is based on priorities: the
highest-priority task gets the CPU. The research community has long since
moved on from priorities, though, and has been putting a lot of effort into deadline
scheduling instead. Deadline schedulers allow each process to provide a
"worst case execution time" and the deadline by which it must get that
time; it can then schedule all tasks so that they meet their deadlines while
refusing tasks which would cause that promise to be broken. There
are a few deadline scheduler patches in circulation, but the
SCHED_DEADLINE patch by Dario Faggioli and friends looks like the
most likely one to make it into the mainline at this time; LWN
looked at this patch back in
October.
Recently, version 2 of the
SCHED_DEADLINE patch was posted. The changes reflect a number
of comments which were made the first time around; among other things,
there is a new implementation of the group scheduling mechanism. Perhaps
most significant in this patch, though, is an early attempt at addressing
priority inversion problems, where a low-priority process can, by holding
shared resources, prevent a higher-priority process from running. Priority
inversion is a hard problem, and, in the deadline scheduling area, it
remains without a definitive solution.
In classic realtime scheduling, priority inversion is usually addressed by
raising the priority of a process which is holding a resource required by a
higher-priority process. But there are no priorities in deadline
scheduling, so a variant of this approach is required. The new patch works
by "deadline inheritance" - if a process holds a resource required by
another process which has a tighter deadline, the holding process has its
deadline shortened until the resource is released. It is also necessary to
exempt the process from bandwidth throttling (exclusion from the CPU when
the stated execution time is exceeded) during this time. That, in turn,
could lead to the CPU being oversubscribed - something deadline schedulers
are supposed to prevent - but the size of the problem is expected to be
small.
The "to do" list for this patch still has a number of entries, including
less disruptive bandwidth throttling, a port to the realtime preemption
tree, truly global deadline scheduling on multiprocessor systems (another hard
problem), and more. The code is progressing, though, and Linux can be
expected to have a proper deadline scheduler at some point in the
not-too-distant future - though no deadline can be given as the worst case
development time is still unknown.
Comments (3 posted)
Mel Gorman's series on the use of huge pages in Linux is taking a one-week
intermission, so there will be no installment this week. The fourth
installment (on huge page benchmarking) will appear next week.
Comments (none posted)
Kernel development news
By Jonathan Corbet
March 10, 2010
There have been nearly 1600 non-merge changesets incorporated into the mainline
kernel since
last week's
summary; that makes a total of just over 6000 changesets for the
2.6.34-rc1 release. Some of the most significant, user-visible changes merged
since last week include:
- Signal-handling semantics have been changed so that "synchronous"
signals (SIGSEGV, for example) are delivered prior to asynchronous
signals like SIGUSR1. This fixes a problem where synchronous signal
handlers could be invoked with the wrong context, something that
apparently came up occasionally in WINE. Users are unlikely to notice
the change, but it is a slight semantics change that developers may
want to be aware of.
- A new Nouveau driver with an incompatible interface has been merged;
as of this writing, it will break all user-space code which worked
with the older API. See this article for more
information on the Nouveau changes. Nouveau also no longer needs
external firmware for NV50-based cards.
- The direct rendering layer now supports "VGA switcheroo" on systems
which provide more than one graphical processor. For most needs, a
simple, low-power GPU can be used, but the system can switch to the
more power-hungry GPU when its features are needed.
- The umount() system call supports a new
UMOUNT_NOFOLLOW flag which prevents the following of symbolic
links. Without this flag, local users who can perform unprivileged
mounts can use a symbolic link to unmount arbitrary filesystems.
- The exofs filesystem (for object storage devices) has gained support
for groups and for RAID0 striping.
- The LogFS filesystem for
solid-state storage devices has been merged.
- New drivers:
- Media: Wolfson Microelectronics WM8994 codecs, and
Broadcom Crystal HD video decoders (staging).
- Miscellaneous: Freescale MPC512x built-in DMA engines,
Andigilog aSC7621 monitoring chips,
Analog Devices ADT7411 monitoring chips,
Maxim MAX7300 GPIO expanders,
HP Processor Clocking Control interfaces,
DT3155 Digitizers (staging),
Intel SCH GPIO controllers,
Intel Langwell APB Timers,
ST-Ericsson Nomadik/Ux500 I2C controllers,
Maxim Semiconductor MAX8925 power management ICs,
Max63xx watchdog timers,
Technologic TX-72xx watchdog timers, and
Hilscher NetX based fieldbus cards.
Changes visible to kernel developers include:
- There has been a subtle change to the early boot code, wherein the
kernel will open the console device prior to switching to the root
filesystem. That eliminates problems where booting fails on a system
with an empty /dev directory because the console device
cannot be found, and eliminates the need to use devtmpfs in such
situations.
- The kprobes jump
optimization patch has been merged.
- The write_inode() method in struct super_operations
is now passed a pointer to the relevant writeback_control
structure.
- Two new helper functions - sysfs_create_files() and
sysfs_remove_files() - ease the process of creating a whole
array of attribute files.
- The show() and store() methods of struct
class_attribute have seen a prototype change: the associated
struct class_attribute pointer is now passed in. A similar
change has been made to struct sysdev_class_attribute.
- The sem lock found in struct device should no longer
be accessed directly; instead, use device_lock() and
device_unlock().
At "only" 6000 changesets, 2.6.34 looks like a relatively calm
development cycle; both 2.6.32 and 2.6.33 had over 8000 changesets by the
time the -rc1 release came out. It may be that there is less work to be
done, but it may also be that some trees got caught out in the cold by
Linus's decision to close the merge window early. Linus suggested that he
might yet consider a few pull requests, so we might still see some new
features added to this kernel; stay tuned.
Comments (8 posted)
By Jake Edge
March 10, 2010
A recent linux-kernel discussion, which descended into flames at times,
took on the
question of the stability of user-space interfaces. The proximate cause
was a change in the interface for the Nouveau
drivers for NVIDIA graphics hardware, but the real
issues go deeper than that. Though the policy for the main kernel is that
user-space interfaces live "forever", the policy in the staging tree has
generally been looser. But some, including Linus Torvalds, believe that
staging drivers that have been shipped by major distributions should be
held to a higher standard.
As part of the just-completed 2.6.34 merge window, Torvalds pulled from the
DRM tree at Dave Airlie's request, but immediately ran into problems on his
Fedora 12 system:
Hmm. What the hell am I supposed to do about
(II) NOUVEAU(0): [drm] nouveau interface version: 0.0.16
(EE) NOUVEAU(0): [drm] wrong version, expecting 0.0.15
The problem stemmed from the Nouveau driver changing its interface, which
required an upgrade to libdrm—an upgrade that didn't exist
for Fedora 12. The Nouveau changes have been backported into the Fedora 13
2.6.33 kernel, which comes with a new libdrm, but there are no
plans to put that kernel into Fedora 12.
Users that stick with Fedora kernels upgraded via yum won't run
into the problem as Airlie explains:
At the moment in Fedora we deal with this for our users, we have dependencies
between userspace and kernel space and we upgrade the bits when they upgrade
the kernels, its a pain in the ass, but its what we accepted we needed
to do to get
nouveau in front of people. We are currently maintain 3 nouveau APIs
across F11, F12
and F13.
That makes it impossible to test newer kernels on Fedora 12 systems with
NVIDIA graphics, though, which
reduces the number of people who are able to test. In addition, there is
no "forward compatibility" either—the kernel and DRM library must
upgrade (or downgrade) in lockstep.
Torvalds is concerned
about losing testers who run Fedora 12, as well as problems for those on
Fedora 13 (Rawhide right now)
who might need to bisect a kernel bug—going back and forth across the
interface-change barrier is not possible, at least easily. In his original
complaint, Torvalds is characteristically blunt: "Flag days aren't
acceptable."
The Nouveau drivers were only merged for 2.6.33 at Torvalds's
request—or demand—and they were put into the staging tree. The
staging tree configuration option clearly spells out the instability of
user-space interfaces: "Please note that these drivers are under
heavy development, may or may not work, and may contain userspace
interfaces that most likely will be changed in the near future.".
So several kernel hackers were clearly confused by Torvalds's outburst.
Jesse Barnes put it this way:
Whoa, so breaking ABI in staging drivers isn't ok? Lots of other
staging drivers are shipped by distros with compatible userspaces, but I
thought the whole point of staging was to fix up ABIs before they
became mainstream and had backwards compat guarantees, meaning that
breakage was to be expected?
Yes, it sucks, but what else should the nouveau developers have done?
They didn't want to push nouveau into mainline because they weren't
happy with the ABI yet, but it ended up getting pushed anyway as a
staging driver at your request, and now they're stuck? Sorry this
whole thing is a bit of a wtf...
But Torvalds doesn't disagree that the interface needs changing, he is just
unhappy with the way it was done. Because the newer libdrm is not
available for Fedora 12, he can't test it:
I'm not going to release a kernel that I can't test. So if I can't get a
libdrm that works in my F12 environment, I will _have_ to revert that
patch that you asked me to merge.
It is not just Torvalds who can't test it, of course, so he would like to
see something done that will enable Fedora users to test and bisect
kernels. The Nouveau developers don't want to maintain multiple
interfaces, and the Fedora (and other distribution) developers don't want
to have to test multiple versions of the DRM library. As Red Hat's Nouveau
developer Ben Skeggs put it: "we have no intention of
keeping crusty APIs around when they aren't what we require."
Torvalds would like to see a way for the various libdrms to
co-exist, preferably with the X server choosing the right one at
runtime. As he notes, the server has the
information and, if multiple libraries are installed, the right one is only
a dlopen() away:
Who was the less-than-rocket-scientist that decided that the right thing
to do was to "check the kernel DRM version support, and exit with an error
if it doesn't match"?
See what I'm saying? What I care about is that right now, it's impossible
to switch kernels on a particular setup. That makes it effectively
impossible to test new kernels sanely. And that really is a _technical_
problem.
In the end, Airlie helped him get both of the proper libraries installed on
his system, with a symbolic link to (manually) choose between them. That
was enough to allow testing of the kernel, thus Torvalds didn't revert the
Nouveau patch in question. But there is a larger question here: When
should a user-space interface be allowed to change, and, just how should it
be done?
The Nouveau developers seem rather unhappy that Torvalds and others are
trying to change their development model, at least partially because they
never requested that Nouveau be merged. But Torvalds is not really pushing
the Nouveau developers so much as he is pushing the distributor who shipped
Nouveau to handle these kinds of problems. In his opinion, once a major
distributor has shipped a library/kernel combination that worked, it is
responsible for ensuring that it continues to work, especially for those
who might want to run newer kernels.
The problem for testers exists because the distribution, in this case
Fedora, shipped the driver before getting it into the upstream kernel,
which violates the "upstream first" principle. Torvalds makes it clear that merging the code didn't
cause the problem, shipping it did:
So the watershed moment was _never_ the "Linus merged it". The watershed
moment was always "Fedora started shipping it". That's when the problems
with a standard upstream kernel started.
Alan Cox disagrees, even quoting Torvalds
from 2004 back at himself, because the Nouveau developers are just developing the way
they always have; it's not their fault that the code was shipped and is now
upstream:
Someone who never made a commitment to stability decided to do the
logical thing. They deleted all the old broken interfaces, they cleaned
up their ioctls numbering and they tided up afterwards. I read it as the
action of someone who simply doesnt acknowledge that you have a right to
control their development and is continuing to work in the way they
intended.
But the consensus, at least among those who aren't graphics driver developers,
seems to be that user-space interfaces should only be phased out gradually.
That
gives users and distributions plenty of time to gracefully handle the
interface change. That is essentially how mainline interface changes are
done; even though user-space interfaces are supposed to be maintained
forever, they sometimes do change—after a long deprecation period.
In fact, Ingo Molnar claimed that breaking
an ABI often leads to projects that either die on the vine or do not
achieve the success that they could:
I have _never_ seen a situation where in hindsight breaking the ABI of a
widely deployed project could be considered 'good', for just about any sane
definition of 'good'.
It's really that simple IMO. There's very few unconditional rules in OSS, but
this is one of them.
Ted Ts'o sees handling interface changes gracefully as part of being a
conscientious member of the community. If developers don't want to work
that way, they shouldn't get their code included into distributions:
You say you don't want to do that? Then keep it to your self and
don't get it dropped into popular distributions like Fedora or Ubuntu.
You want a larger pool of testers? Great! The price you need to pay
for that is to be able to do some kind of of ABI versioning so that
you don't have "drop dead flag days".
Had this occurred with a different driver, say for an obscure WiFi device, it is likely there would have been less, or no, outcry. Because X
is such an important, visible part of a user's experience, as well as an
essential tool for testers, breaking it is difficult to hide. Torvalds
has always pushed for more testing of the latest mainline kernels, so it
shouldn't come as a huge surprise that he was less than happy with what
happened here.
This situation has cropped up in various guises along the way. While
developers would like to believe they can control when an ABI falls under
the compatibility guarantee, that really is almost never the case. Once
the interface gets merged, and user space starts to use it, there will be
pressure to maintain it. It makes for a more difficult development
environment in some ways, but the benefit for users is large.
Comments (3 posted)
By Jonathan Corbet
March 9, 2010
Almost exactly one year ago, LWN
examined the problem of 4K-sector
drives and the reasons for their existence. In short, going to 4KB
physical sectors allows drive manufacturers to increase storage density,
always welcome in that competitive market. Recently, there have been a
number of reports that Linux is not ready to work with these drives; kernel
developer Tejun Heo even
posted an extensive,
worth-reading
summary stating that "
4 KiB logical sector support is broken in
both the kernel and partitioners." As the subsequent discussion
revealed, though, the truth of the matter is that
we're not quite that badly prepared.
Linux is fully prepared for a change in the size of physical sectors on a
storage device, and has been for a long time. The block layer was written
with an avoidance of hardwired sector sizes in mind. Sector counts and
offsets are indeed managed as 512-byte units at that level of the kernel,
but the block layer is careful to perform all I/O in units of the correct
size. So, one would hope, everything would Just Work.
But, as Tejun's document notes, "unfortunately, there are
complications." These complications result from the fact that the
rest of the world is not prepared to deal with anything other than 512-byte
sectors, starting with the BIOS found on almost all systems. In fact, a
BIOS which can boot from a 4K-sector drive is an exceedingly rare item -
if, indeed, it exists at all. Fixing the BIOS is evidently harder than one
might think, and, evidently, there is little motivation to do so. Martin
Petersen, who has done much of the work around supporting these drives in
Linux, noted:
Part of the hesitation to work on booting off of 4 KB lbs drives is
motivated by a general trend in the industry to move boot
functionality to SSD. There are 4 KB LBS SSDs out there but in
general the industry is sticking to ATA for local boot.
The problem does not just exist at the BIOS level: bootloaders (whether
they are Linux-oriented or not) are not set up to handle larger sectors;
neither are partitioning tools, not to mention a wide variety of other operating
systems. Something must be done to enable 4K-sector drives to work with
all of this software.
That something, of course, is to interpose a mapping layer in the middle.
So most 4K-sector drives will implement separate logical and physical
sector sizes, with the logical size - the one presented to the host
computer - remaining 512 bytes. The system can then pretend that it's
dealing with the same kind of hardware it has always dealt with, and
everything just work as desired.
Except that, naturally enough, there are complications. A 512-byte sector
written
to a 4K-sector drive will now force the drive to perform a
read-modify-write cycle to avoid losing the data in the rest of the
sector. That slows things down, of course, and also increases the risk of
data loss should something go wrong in the middle. To avoid this kind of
problem, the operating system should do transfers that are a multiple of
the physical sector size whenever possible. But, to do that, it must know
the physical sector size. As it happens, that information has been made
available; the kernel makes use of this information internally and exports
it via sysfs.
It is not quite that simple, though. The Linux kernel can go out of its
way to use the physical sector size, and to align all transfers on 4KB
boundaries from the beginning of the partition. But that goes badly wrong
if the partition itself is not properly aligned; in this case, every
carefully-arranged 4KB block will overlap two physical sectors - hardly an
optimal outcome.
As it happens, badly-aligned partitions are not just common; they are the
norm. Consider an example: your editor was a lucky recipient of an Intel
solid-state drive
at the Kernel Summit which was quickly plugged into his system and partitioned
for use. It has been a great move: git repositories on an SSD are much
nicer to work with. A quick look at the partition table, though, shows
this:
Disk /dev/sda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders, total 156301488 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x5361058c
Device Boot Start End Blocks Id System
/dev/sda1 63 52452224 26226081 83 Linux
Note that fdisk, despite having been taken out of the "DOS
compatibility" mode, is displaying the drive dimensions in units of heads
and cylinders. Needless to say, this device has neither; even on rotating
media, those numbers are entirely fictional; they are a legacy from a dark
time before Linux even existed. But that legacy is still making life
difficult now.
Once upon a time, it was determined that 63 (512-byte) sectors was far more
than anybody would be able to fit into a single disk track. Since
track-aligned I/O is faster on a rotating drive, it made sense to align
partitions so that the data began at the beginning of a track. So,
traditionally, the first partition on a drive begins at (logical)
sector 63, the last sector of the first track. That sector holds the
boot block; any filesystem stored on the partition will follow at the
beginning of the next track.
That placement, of course, misaligns the filesystem with
regard to any physical sector size larger than 512 bytes; logical sector 64
(the first data sector in the partition) will be placed at the end of a 4K
physical sector. Any subsequent
partitions on the device will almost certainly be misaligned in the same
way.
One might argue that the right thing to do is to simply ditch this
particular practice and align partitions properly; it should not be all
that hard to teach partitioning tools about physical sector sizes. This
can certainly be done. The tools have been slow to catch on, but a
suitably motivated system administrator can usually convince them to place
partitions sensibly even now. So weird alignments should not be an
insurmountable problem.
Unfortunately, there are complications. It would appear that Windows XP
not only expects misaligned partitions; it actually will not function
properly without them. One simply cannot run XP on a device which has been
properly partitioned for 4K physical sector sizes. To cope with that, drive
manufacturers have introduced an even worse hack: shifting all 512-byte
logical sectors forward by one, so that logical sector 64 lands at the
beginning of a physical sector. So any partitioning tool which wants to
lay things out properly must know where the origin of the device actually
is - and not all devices are entirely forthcoming with that information.
With luck, the off-by-one problem will go away before it becomes a big
issue. As James Bottomley put it:
"...fortunately very few of these have been seen in the wild and we're
hopeful they can be shot before they breed." But that doesn't fix
the problem with the alignment of partitions for use by XP. Later versions
of Windows need not concern themselves with this problem, since they rarely
coexist with XP (and Windows has never been greatly concerned about
coexistence with other systems in general). Linux, though, may well be
installed on the same drive as XP; that leads to differing alignment
requirements for different partitions. Making that just work is
not going to be fun.
Martin suggests that it might be best to
just ignore the XP issue:
With regards to XP compatibility I don't think we should go too
much out of our way to accommodate it. XP has been disowned by its
master and I think virtualization will take care of the rest.
It may well be that there will not be a significant number of XP
installations on new-generation storage devices, but failure to support XP
may still create some misery in some quarters.
A related issue pointed out by Tejun is that the DOS partition format,
which is still widely used, tops out at 2TB, which just does not seem all
that large anymore. Using 4K logical sectors in the partition table can
extend that limit as far as 16TB, but, again, that requires cooperation
from the BIOS - and it still does not seem all that large. The long-term
solution would appear to be moving to a partition format like GPT, but that
is not likely to be an easy migration.
In summary: Linux is not all that badly placed to support 4K-sector drives,
especially when there is no need to share a drive with older operating
systems. There is still work required at the tools level to make that
support work optimally without the need for low-level intervention by
system administrators, but that is, as they say, just a matter of a bit of
programming. As these drives become more widely available, we will be able
to make good use of them.
Comments (30 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>