The current development kernel remains 2.6.35-rc3
; Linus is on
vacation so the flow of patches into the mainline is stalled for now.
There have been no stable updates since 126.96.36.199 on June 1.
Comments (3 posted)
So V4L1 will finally disappear from the kernel in 2.6.37! There
should be an LWN article as well so everyone (we hope) will be
informed of this in time.
-- Hans Verkuil
(does this count?)
If you decide to base your file system on some algorithms then
please use the original ones from proper academic papers. DO NOT
modify the algorithms in solitude: this is very fragile thing! All
such modifications must be reviewed by specialists in the theory of
algorithms. Such review can be done in various scientific magazines
of proper level.
-- Edward Shishkin
I considered it a while ago when trying to work out removing the
BKL from this path. After much head banging and an overwhelming
desire to go and get drunk instead I concluded it wasn't possible
to tell by analysis.
So I ack this patch - it's the only way to find out.
-- Alan Cox
Comments (1 posted)
Power management under Linux is getting more complex as the kernel's
capabilities grow. It is now possible to control power use through
scheduling policies, idle state management, device states, and so on.
Unfortunately, some power management choices have performance consequences;
depending on the use to which the system is being put, those consequences
may not be welcome. So there must be a way for system administrators to
control how power management decisions are made.
Currently, that control is exercised through a number of individual system
parameters. One controls whether the scheduler tries to coalesce processes
onto a subset of the system's CPUs in the hope of letting others sleep.
Another knob tells the idle governor which sleep states it is able to use.
Yet another controls CPU frequency and voltage response. Simply knowing
about all of the available parameters is hard; keeping them all tuned
properly can be harder yet.
Len Brown has proposed the addition of an
overall control parameter for power management, to be found in
/sys/power/policy_preference. This knob would have five settings,
ranging from "maximum performance at all times" to "save as much power as
possible without actually turning the system off." With a control like
this, system administrators could control system power policy without
having to learn about all of the individual parameters involved; policy
choices would also be applied to any new power-management parameters added
in the future.
The idea was not universally loved, though. Some commenters asked for more
than five settings, but Len argued that anybody needing more complex
configurations should just continue to use the individual parameters.
Others fear that the single policy might be interpreted differently by
different drivers, leading to inconsistent results; they would rather see
the continued use of individual parameters which exactly describe the
desired behavior. The real discussion, though, cannot happen until some
actual code has been posted, if and when that happens.
Comments (12 posted)
Systems running PHP are naturally beset with more than the usual number of
challenges from the outset. In some cases, though, it can get even worse;
consider this story
For example, a common configuration for PHP web-servers includes
apache's prefork MPM, mod_php and a PHP opcode cache utilizing
shared memory. In certain failure modes, all requests serviced by
PHP result in a segfault. Enabling coredumps might lead to 10-20
coredumps per second, all attempting to write a 150-200MB core
file. This leads to the whole system becoming entirely unresponsive
for many minutes.
Edward's response to this non-fun situation was a patch limiting the number
of core dumps which can
be underway simultaneously; any dumps which would exceed the limit would
simply be skipped.
It was generally agreed that a better approach would be to limit the I/O
bandwidth of offending processes when contention gets too high. But that
approach is not entirely straightforward to implement, especially since
core dumps are considered to be special and not subject to normal bandwidth
control. So what's likely to happen instead is a variant of Edward's patch
where processes trying to dump core simply wait if too many others are
already doing the same.
Comments (4 posted)
The suspend blocker discussion has faded away for the time being - a
situation which has drawn few complaints. Developers are still thinking
about the underlying problems addressed by suspend blockers, though, as can
be seen from this patch
Rafael Wysocki. Rafael is trying to solve the problem of "wakeup events"
(events requiring action which would wake a suspended device) being lost if
they show up while the system is suspending.
In Rafael's approach, there would be a new sysfs attribute called
/sys/power/wakeup_count; it would contain the number of wakeup
events seen by the system so far. Any process can read this attribute at
any time to obtain this count; privileged processes can also write a count
back to the file. There is a twist, though: if the count written to the
file does not match the count which would be read from it, the write will
fail. A write also triggers a mechanism whereby any subsequent wakeup
events will cause an attempted suspend operation to abort.
As with some other scenarios which have been posted, Rafael is assuming the
existence of a user-space power management daemon which would decide when
to suspend the system. This decision would be made when the daemon knows
that no important user-space program has work to do. Without extra help,
though, there will be a window between the time that the daemon decides to
suspend the system and when that suspend actually happens; a wakeup event
which arrives within that window could be lost, or at least improperly
delayed until after the system is resumed again. But, with the
wakeup_count mechanism, the kernel would notice when this kind of race had
happened and abort the suspend process, allowing user space to process the
For this mechanism to work, the kernel must be able to count wakeup events;
that, in turn, requires sprinkling calls to a new
pm_wakeup_event() function into drivers which can generate such
events. So internally it doesn't look all that different from suspend
blockers. Some of the comments have suggested that the scheme is quite
similar to suspend blockers on the user-space side too, though Rafael
believes that it avoids the aspects of that API which generated the most
criticism. Regardless, reviews were mixed, and Android developer Arve Hjønnevåg thinks that this approach will not meet that
project's needs. So this discussion probably has more rounds to go in the
Comments (none posted)
Kernel development news
Kees Cook is back with another proposal for
a kernel change that would, at least in his mind, provide more security,
this time by restricting the ptrace() system call.
But, like his earlier symbolic
link patch, this one is not being particularly well-received on linux-kernel.
It has, however, sparked some discussion of a topic that seems to recur
with some frequency in that venue: stacking Linux security modules (LSMs).
Cook's patch is fairly straightforward; it creates a sysctl called
ptrace_scope that defaults to zero, which chooses the existing
behavior. If it is set to one, though, it only allows ptrace() to
be called on descendants of the tracing process. The idea is to stop a
vulnerability in one program (Pidgin, say) from being used to trace another
program (like Firefox or GPG-agent), which would allow extracting credentials or other
sensitive information. Like the previous symlink patch, it is based on a
patch that has long been in the grsecurity kernel.
As with the previous proposal, Alan Cox was quick to suggest that it be put into an LSM:
So NAK. If you want to use bits of grsecurity then please just write
yourselves a grsecurity kernel module that uses the security hooks
properly and stop messing up the core code. It's all really quite simple,
the [infrastructure] is there, so use it.
But, one problem with that plan is that LSMs do not stack. One can
have SELinux, Smack, and TOMOYO enabled in a kernel, but only
one—chosen at boot time—can be active. There have been discussions and proposals for LSM stacking (or
chaining) along the way, but nothing has ever been merged. So, two
"specialized" LSMs cannot do their separate jobs in the kernel and users
will have to choose between them.
For "full-featured" solutions, like SELinux, that isn't really a problem,
as users can find or create policies to handle their security
requirements. In addition, James Morris points
out that SELinux has a boolean, allow_ptrace, to do what
Cook is trying to do: "You don't need to write any policy, just set
it [allow_ptrace] to 0". But, for those that don't want to use
SELinux, that's no solution. As Ted Ts'o puts
i think we really need to have stacked LSM's, because there is a large set
of people who will never use SELinux. Every few years, I take another
look at SELinux, my head explodes with the (IMHO unneeded complexity),
and I go away again...
Yet I would really like a number of features such as this ptrace scope idea ---
which I think is a useful feature, and it may be that stacking is the only
way we can resolve this debate. The SELinux people will never believe that
their system is too complicated, and I don't like using things that are impossible
for me to understand or configure, and that doesn't seem likely to change anytime
in the near future.
Others were also favorably disposed toward combining LSMs, though the
consensus seems to be for chaining LSMs in the security core rather than
stacking, as was done with SELinux and Linux capabilities
(i.e. security/commoncap.c). In the stacking model, each LSM
is responsible for calling out to any other secondary LSMs for each
operation, whereas chaining is "just a walk over a list of
security_operations" calling each LSM's version from the core, as
W. Biederman described. But it's not as easy as it might seem at
first glance, as Serge E. Hallyn, who proposed a stacking mechanism in
2004, points out:
The general answer tends to be "generic stacking doesn't work, LSMs
need to know about each other." But even for that (as evidenced by
the selinux+commoncap experience with stacking) is hairy, and more
to the point it probably does not scale when we have 5-10 small
LSMs. I.e. LSM 1 wants to prevent some action while LSM 2 requires
that action to succeed so that it can properly prevent another
action. Concrete examples are buried in the stacker discussions
on the lsm list from 2004-2005.
It seems that there may be some discussion of LSM stacking/chaining at the
security summit, as part of Cook's presentation on "widely used, but
out-of-tree" security solutions, but perhaps also in a "beer BOF" that
Hallyn is proposing.
The way forward for both of Cook's recent proposals
looks to be as an LSM and, to that end, he has posted the Yama LSM, which incorporates the
symlink protections and ptrace() limitations that he
previously posted. In addition, it adds the ability to restrict hard links
such that they cannot be created for files that are either sensitive
(e.g. setuid) or those that are not readable and writable by the link
creator. Each of these measures can be enabled separately by sysctls in
While "Yama" might make one start looking for completions of an acronym
("Yet Another ..."), it is actually named for a deity: "Yama is roughly the 'lord of
death/underworld' in Buddhist and Hindu tradition, kind of over-seeing
the rehabilitation of impure entities", Cook said. Given the number
of NAKs that his recent patch proposals have received,
calling Yama the "NAKed Access Control
system", shows a bit of a sense of humor about the situation.
DAC, MAC, RBAC, and others would now be joined by NAC if Yama gets merged.
So far, discussion of Yama has been fairly light, and without any major
complaints. While some are rather skeptical of the protections that Cook
has been proposing, they are much less likely to care if they live in an
LSM, rather than "junk randomly spewed [all] over the tree",
as Cox put it.
Once these simpler security
tasks are encapsulated into an LSM, Morris said,
the kernel hackers "can evaluate the possible need for some form of
stacking or a security library API" to allow these
measures to coexist with SELinux, AppArmor, and others. Given the fairly
broad support for the LSM approach, it would seem that Yama, or some
descendant, could make it into the mainline. Whether that translates
to some kind of general mechanism for combining LSMs in interesting ways
remains to be seen—it should be worth watching, stay tuned.
Comments (25 posted)
The Btrfs filesystem is seen by many as the primary Linux filesystem for
the next decade or so. It brings a next-generation design and a wide range
of features (snapshots, data checksums, internal RAID, etc.) that users
have been waiting for. Despite being merged for 2.6.29, Btrfs remains an
experimental development, but some of the more adventurous distributions
are beginning to offer Btrfs installation options and Meego has chosen Btrfs
default filesystem. So when a filesystem developer started calling Btrfs
"broken by design," people took notice.
Edward Shishkin, perhaps better known for his efforts to keep reiser4
development alive, first posted some
concerns on June 3. It seems he ran a simple test: create a new
Btrfs filesystem, then create 2048-byte files until space runs out. Others
have talked about suboptimal space efficiency in Btrfs before, but Edward
was still surprised that he was only able to use 17% of the nominal space
in the filesystem before it was reported as being full. Such poor
efficiency was, according to Edward,
evidence the Btrfs was "broken by design" and should not be used:
The first obvious point here is that we *can not* put such file
system to production. Just because it doesn't provide any
guarantees for our users regarding disk space utilization.... As
to current Btrfs code: *NOT ACK*!!! I don't think we need such
Part of the problem comes down to the use of "inline extents" in Btrfs.
The core data structure on a Btrfs filesystem is a B-tree
which provides access to all of the objects stored in the filesystem. For
larger files, the actual file data is stored in extents, which are pointed
to from within the tree. Small extents, though, can be stored in the tree
itself, hopefully yielding both better space efficiency and better
performance. If these extents are sized inconveniently, though, they can
cause a lot of wasted space. There's only room for one 2048-byte inline
extent in a B-tree node, leaving 1800 bytes or so of unused space. That is
a lot of internal fragmentation - a lot of wasted space.
As noted in Chris Mason's response, there
are two approaches which can be taken to mitigate this kind of problem.
One is to turn off inline extents altogether; Btrfs has a
max_inline= mount option which can be used for just that purpose.
The other approach would be to allow inline extents to be split between
tree nodes so that the pieces could be sized to fill those nodes exactly.
Btrfs cannot do that, and probably will not be able to anytime soon:
I didn't put in the splitting simply because the complexity was
high while the benefits were low (in comparison with just turning
off the inline extents).
Chris also noted that most of the other variable-size items stored in
B-tree nodes - extended attributes, for example - can be split between
nodes if need be. So these items should not cause fragmentation problems;
it's mainly the inline extents which are at fault there.
But, as Edward pointed out, there's more to
the problem than inline extents. In his investigations, he's found
numerous places where groups of nearly-empty nodes exist; some were less
than 1% utilized. That, in all likelihood, is the real source of the worst
space utilization problems. To Edward, this behavior is another sign that
the algorithms used in Btrfs are all wrong and in need of a redesign.
Chris sees it a little differently, though:
The current code is clearly choosing not to merge two leaves that
it should have merged, which is just a plain old bug.
He has promised to track it down and post a fix. Between the bug fix and
turning off inline extents (or, at least, reducing their maximum size), it
is hoped that the worst space utilization problems in Btrfs will be no
That fix has not been posted as of this writing, so its effectiveness
cannot yet be judged. But, chances are, this is not a case of a filesystem
needing a fundamental redesign. Instead, all it needs is more extensive
testing, some performance tuning, and, inevitably, some bug fixes. The
good news is that the process seems to be working as it should be: these
problems have been found before any sort of wide-scale deployment of this
very new filesystem.
Comments (22 posted)
The original workqueue code found its way into the mainline without a great
deal of discussion or debate; it was a clear improvement over what came
before. Tejun Heo's concurrency-managed workqueues
(CMWQ) rework has the potential to be a significant improvement as well, but its
path toward merging has not been so smooth. The fifth iteration of the patch set
is currently under discussion. While a number of concerns have been
addressed, others have come out of the woodwork to replace them.
The CMWQ work is intended to address a number of problems with current
kernel workqueues. At the top of the list is the proliferation of kernel
threads; current workqueues can, on a large system, run the kernel out of
process IDs before user space ever gets a chance to run. Despite all these
threads, current workqueues are not particularly good at keeping the system
busy; workqueues may contain a backlog of work while the CPU sits idle.
Workqueues can also be subject to deadlocks if locking is not handled very
carefully. As a result, the kernel has grown a number of
workarounds and some competing deferred-work mechanisms.
To resolve these problems, the CMWQ code maintains a set of worker threads
on each processor; these threads are shared between workqueues, so the
system is not overrun with workqueue-specific threads. The special
scheduler class once used by CMWQ is long gone, but the code still has hooks
into the scheduler which it can use to track which worker threads are
actually executing at any given time. If all workqueue threads on a CPU
have blocked waiting on some resource, and if there is queued work to do,
the CMWQ code will kick off a new thread to work on it. The CMWQ code can
run multiple jobs from the same CPU concurrently - something the current
workqueue code will not do. In this way, the
CPU is always kept busy as long as there is work to be done.
The first complaint that came back this time was that many developers had
long since forgotten what CMWQ was all about, and Tejun had not put that
information into the patch series introduction. He made up for that with
an overview document explaining the basics
of the code. That led quickly to a new complaint: the lack of dedicated
worker threads means that it is no longer possible to change the scheduling
behavior of specific workqueues.
There were two variants of this complaint. Daniel Walker lamented the loss of the ability to change the
priority of workqueue threads from user space. Tejun has firmly denied
that this is a useful thing to be able to do, and Daniel has not, yet,
shown an example of where it would be desirable. Andrew Morton, instead,
worries about being able to change
scheduling behavior from within the kernel; that is something that at least
one driver does now. He might be willing
to let this capability go, but he's not happy about it:
Oh well. Kernel threads should not be running with RT policy
anyway. RT is a userspace feature, and whenever a kernel thread
uses RT it degrades userspace RT qos. But I expect that using RT
in kernel threads is sometimes the best tradeoff, so let's not
pretend that we're getting something for nothing here!
Tejun's reply to this concern takes a couple of forms. One is that
workqueues are intended for general-purpose asynchronous work, and that is
how almost all callers use it. It would be better, he says, to make
special mechanisms for situations where they are really needed. To that
end, he has posted a simple kthread_worker API which can be
used for the creation of special-purpose worker threads. Essentially, one
starts by setting up a kthread_worker structure:
/* ... or ... */
struct kthread_worker worker;
Then, a kernel thread should be set up using the (existing)
kthread_create() or kthread_run() utilities, but passing
a pointer to kthread_worker_fn() as the actual function to run:
struct task_struct thread;
thread = kthread_run(kthread_worker_fn, &worker, "name" ...);
Thereafter, it's just a matter of filling in kthread_work
structures with actual work to be done and queueing them with:
bool queue_kthread_work(struct kthread_worker *worker,
struct kthread_work *work);
So far, there has been no real commentary on this patch.
The other thing which could be done is to associate attributes like
priority and CPU affinity with the work to be done instead of with the
thread doing the work. That would require expanding the workqueue API to
allow this information to be specified; the CMWQ code would then tweak
worker threads accordingly when passing jobs to them. At this point,
though, it's not clear that there is enough need for this feature to
justify the added complexity that it would require.
The CMWQ code certainly adds a bit of complexity already, though it makes
up for some of that by replacing the slow work and asynchronous function call
mechanisms. Tejun is hoping to drop it into linux-next shortly, and,
presumably, to get it merged for 2.6.36. Whether that will happen remains
to be seen; core kernel changes can be hard, and this one may not, yet,
have cleared its last hurdle.
Comments (none posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>