The current development kernel is 2.6.38-rc5
on February 15. The patch
volume is dropping (a bit) as this kernel stabilizes, so there's not a lot
of new features, but there are some important bug fixes here. Details can
be found in the
Stable updates: the
18.104.22.168 (115 patches),
22.214.171.124 (176 patches), and
126.96.36.199 (272 patches!) updates are
currently in the review process; these updates can be expected on or after
Comments (none posted)
So never _ever_ mark anything "deprecated". If you want to get rid
of something, get rid of it and fix the callers. Don't say
"somebody else should get rid of it, because it's deprecated".
And yes, next time this discussion comes up, I _will_ remove that
piece-of-sh*t. It's a disease. It's just a stupid way to say
"somebody else should deal with this problem". It's a way to make
excuses. It's crap. It was a mistake to ever take any of that to
-- Linus Torvalds
Hey, if that's what it takes to get __deprecated removed i'll bring
it up tomorrow!!
-- Ingo Molnar
Comments (7 posted)
Scott James Remnant has posted a
surprisingly detailed description
of how to use the process connector
to get process events from the kernel, combined with use of socket filters
to reduce the information flow. "As I mentioned before, the proc
connector is built on top of the generic connector and that itself is on
top of netlink so sending that subscription message also involves embedded
a message, inside a message inside a message. If you understood
Christopher Nolan's Inception, you should do just fine.
Comments (10 posted)
Users of the MD (multiple disk or RAID) subsystem in Linux may be
interested in the MD roadmap
maintainer Neil Brown. It discusses a number of things he has planned for
MD in quite a bit of detail; as Neil put it:
A particular need I am finding for this road map is to make
explicit the required ordering and interdependence of certain
tasks. Hopefully that will make it easier to address them in an
appropriate order, and mean that I waste less time saying "this is
too hard, I might go read some email instead".
There are a lot of enhancements in the pipeline. A bad block log would
allow RAID arrays to continue functioning in the presence of bad blocks
without needing to immediately eject the offending drive. There is a
variant on "hot replace" which would allow a new drive to be inserted
before removing the old one, thus allowing the array to continue with a
full complement of drives while the new one is being populated. Tracking
of areas which are known not to contain useful data would reduce
synchronization costs. A number of proposed enhancements to the "reshape"
functionality would make it more robust and flexible and allow operations
to be undone. A number of other changes are contemplated as well; see
Neil's post for the full list.
Comments (4 posted)
The CFS scheduler does its best to divide the available CPU time between
contending processes, keeping the CPU utilization of each about the same.
The scheduler will not, however, insist on equal utilization when there is
free CPU time available; rather than let the CPU go idle, the scheduler
will give any left-over time to processes which can make use of it. This
approach makes sense; there is little point in throttling runnable
processes when nobody else wants the CPU anyway.
Except that, sometimes, that's exactly what a system administrator may want
to do. Limiting the maximum share of CPU time that a process (or group of
processes) may consume can be desirable if those processes belong to a
customer who has only paid for a certain amount of CPU time or in
situations where it is necessary to provide strict resource-use isolation
between processes. The CFS scheduler cannot limit CPU use in that manner,
but the CFS bandwidth control patches,
posted by Paul Turner, may change that situation.
This patch adds a couple of new control files to the CPU control group
mechanism: cpu.cfs_period_us defines the period over which the
group's CPU usage is to be regulated, and cpu.cfs_quota_us
controls how much CPU time is available to the group over that period.
With these two knobs, the administrator can easily limit a group to a
certain amount of CPU time and also control the granularity with which that
limit is enforced.
Paul's patch is not the only one aimed at solving this problem; the CFS hard limits patch set from Bharata B Rao
provides nearly identical functionality. The implementation is
different, though; the hard limits patch tries to reuse some of the
bandwidth-limiting code from the realtime scheduler to impose the limits.
Paul has expressed concerns about the overhead of using this code and how
well it will work in situations where the CPU is almost fully subscribed.
These concerns appear to have carried the day - there has not been a hard
limits patch posted since early 2010. So the CFS bandwidth control patches look
like the form this functionality will take in the mainline.
Comments (3 posted)
Kernel development news
Many years ago, your editor ported a borrowed copy of the original BSD
editor to VMS; after all, using EDT was the sort of activity
that lost its
charm relatively quickly. DEC's implementation of C for VMS wasn't too
bad, so most of the port went reasonably well, but there was one hitch: the
code assumed that two calls to sbrk()
virtually contiguous chunks of memory. That was true on early BSD systems,
but not on VMS. Your editor, being a fan of elegant solutions to
programming problems, solved this one by simply allocating a massive array
at the beginning, thus ensuring that the second sbrk()
never happen. Needless to say, this "fix" was never sent back upstream
(the VMS uucp port hadn't been done yet in any case) and has long since
vanished from memory.
That said, your editor was recently amused by this
message on the golang-dev list indicating that the developers of the Go
language have adopted a solution of equal elegance. Go has memory
management and garbage collection built into it; the developers believe
that this feature is crucial, even in a systems-level programming
language. From the FAQ:
One of the biggest sources of bookkeeping in systems programs is
memory management. We feel it's critical to eliminate that
programmer overhead, and advances in garbage collection technology
in the last few years give us confidence that we can implement it
with low enough overhead and no significant latency.
In the process of trying to reach that goal of "low enough overhead and no
significant latency," the Go developers have made some simplifying
assumptions, one of which is that the memory being managed for a running
application comes from a single, virtually-contiguous address range. Such
assumptions can run into the same problem your editor hit with vi
- other code can allocate pieces in the middle of the range - so the Go
developers adopted the same solution: they simply allocate all the memory
they think they might need (they figured, reasonably, that 16GB should
suffice on a 64-bit system) at startup time.
That sounds like a bit of a hack, but an effort has been made to make
things work well. The memory is allocated with an mmap() call,
using PROT_NONE as the protection parameter. This call is meant
to reserve the range without actually instantiating any of the memory; when
a piece of that range is actually used by the application, the protection
is changed to make it readable and writable. At that point, a page fault
on the pages in question will cause real memory to be allocated. Thus,
while this mmap() call will bloat the virtual address size of the
process, it should not actually consume much more memory until the running
program actually needs it.
This mechanism works fine on the developers' machines, but it runs into
trouble in the real world. It is not uncommon for users to use
ulimit -v to limit the amount of virtual memory available to
any given process; the purpose is to keep applications from getting too
large and causing the entire system to thrash. When users go to the
trouble to set such limits, they tend, for some reason, to choose numbers
rather smaller than 16GB. Go applications will fail to run in such an
even though their memory use is usually far below the limit that the user
set. The problem is that ulimit -v does not restrict memory
use; it restricts the maximum virtual address space size, which is a very
One might argue that, given what users typically want to do with
ulimit -v, it might make more sense to have it restrict
resident set size instead of virtual address space size. Making that
change now would be an ABI change, though; it would also make Linux
inconsistent with the behavior of other Unix-like systems. Restricting
resident set size is also simply harder than restricting the virtual
address space size. But even if this change could be
made, it would not help current users of Go applications, who may not
update their kernels for a long time.
One might also argue that the Go developers should dump the continuous-heap
assumption and implement a data structure which allows allocated memory to
be scattered throughout the virtual address space. Such a change also
appears not to be in the cards, though; evidently that assumption makes
enough things easy (and fast) that they are unwilling to drop it. So some
other kind of solution will need to be found. According to the original
message, that solution will be to shift allocations for Go programs (on
64-bit systems) up to a range of memory starting at 0xf800000000.
No memory will be allocated until it is needed; the runtime will simply
assume that nobody else will take pieces of that range in between
allocations. Should that assumption prove false, the application will die
For now, that assumption is good; the Linux kernel will not hand out memory
in that range unless the application asks for it explicitly. As with many
things that just happen to work, though, this kind of scheme could break at
any time in the future. Kernel policy could change, the C library might
begin doing surprising things, etc. That is always the hazard of relying
on accidental, undocumented behavior. For now, though, it solves the
problem and allows Go programs to run on systems where users have
restricted virtual address space sizes.
It's worth considering what a longer-term solution might look like. If one
assumes that Go will continue to need a large, virtually-contiguous heap,
then we need to find a way to make that possible. On 64-bit systems, it
should be possible; there is a lot of address space available, and the cost
of reserving unused address space should be small. The problem is that
ulimit -v is not doing exactly what users are hoping for; it
regulates the maximum amount of virtual memory an application can use, but
it has relatively little effect on how much physical memory an application
consumes. It would be nice if there were a mechanism which controlled
actual memory use - resident set sizes - instead.
As it turns out, we have such a mechanism in the memory controller. Even better, this
controller can manage whole groups of processes, meaning that an
application cannot increase its effective memory limit by forking. The
memory controller is somewhat resource-intensive to use (though work is
being done to reduce its footprint) and, like other control group-based
mechanisms, it's not set up to "just work" by default. With a bit of work,
though, the memory controller could replace ulimit -v and do
a better job as well. With a suitably configured controller running, a Go
process could run without limits on address space size and still be
prevented from driving the system into thrashing. That seems like a more
elegant solution, somehow.
Comments (13 posted)
system call has a bad reputation for a number of
reasons, most of which are related to the fact that every implemented
command is, in essence, a new system call. There is no way to effectively
control what is done in ioctl()
, and, for many obscure drivers, no
way to really even know what is going on without digging through a lot of
old code. So it's not surprising that code adding new ioctl()
commands tends to be scrutinized heavily. Recently it turned out that
there's another reason to be nervous about ioctl()
- it doesn't
play well with security modules, and SELinux has been treating it
incorrectly for the last couple of years.
SELinux works by matching a specific access attempt against the permissions
granted to the calling process. For system calls like write(),
the type of access is obvious - the process is attempting to write to an
object. With ioctl(), things are not quite so clear. In past
times, SELinux would attempt to deal with ioctl() calls by looking
at the specific command to figure out what the process was actually trying
to do; a FIBMAP command, for example (which reads a map of a
file's block locations) would be allowed to proceed if the calling process
had the permission to read the file's attributes.
There are a couple of problems with this approach, starting with the fact
that the number of possible ioctl() commands is huge. Even
without getting into obscure commands implemented by a single driver,
trying to enumerate them all and determine their effects is a road to
madness. But it gets worse, in that the intended behavior of a given
command may not match what a specific driver actually does in response to
that command. So the only way to really know what an ioctl()
command will do is to figure out what driver is behind the call, and to
have some knowledge of what each driver does. Simply
creating this capability is not a task for sane people; maintaining it
would not be a task for anybody wanting to remain sane. So security module
developers were looking for a better way.
They thought they had found one when somebody realized that the command
codes used by ioctl() implementations are not random numbers.
They are, instead, a carefully-crafted 32-bit quantity which includes an
8-bit "type" field (approximately identifying the driver implementing the
command), a driver-specific command code, a pair of read/write bits, and a
size field. Using the read/write bits seemed like a great way to figure
out what sort of access the ioctl() call needed without actually
understanding the command. Thus, a
patch to SELinux was merged for 2.6.27 which ripped out the command
recognition and simply used the read/write bits in the command code to
determine whether a specific call should be allowed or not.
That change remained for well over two years until Eric Paris noticed that, in fact, it made no sense at
all. Most ioctl() calls involve the passing of a data structure
into or out of the kernel; that structure describes the operation to be
performed or holds data returned from the kernel - or both. The size field
in the command code is the size of this structure, and the permission bits
describe how the structure will be accessed by the kernel. Together, that
can be used by the core ioctl() code to determine whether the
calling process has the proper access rights to the memory behind the
pointer passed to the kernel.
What those bits do not do, as Eric pointed out, is say anything
about what the ioctl() call will do to the object identified by
the file descriptor passed to the kernel. A call passing read-only data to
the kernel may reformat a disk, while a call with writable data may just be
querying hardware information. So using those bits to determine whether
the call should proceed is unlikely to yield good results. It's an
observation which seems obvious when spelled out in this way, but none of
the developers working on security noticed the problem at the time.
So that code has to go - but, as of this writing, it has not been changed
in the mainline kernel. There is a simple reason for that: nobody really
knows what sort of logic should replace it. As discussed above, simply
enumerating command codes with expected behavior is not a feasible solution
either. So something else needs to be devised, but it's not clear what
that will be.
Stephen Smalley pointed out one approach
which was posted
back in 2005. That patch required drivers (and other code implementing
ioctl()) to provide a special table associating each command code
with the permissions required to execute the command. The obvious
objections were raised at that time: changing every driver in the system
would be a pain, ioctl() implementations are already messy enough
as it is, the tables would not be maintained as the driver changed, and so
on. The idea was eventually dropped. Bringing it back now seems unlikely
to make anybody popular, but there is probably no other way to truly track
what every ioctl() command is actually doing. That knowledge
resides exclusively in the implementing code, so, if we want to make use of
that knowledge elsewhere, it needs to be exported somehow.
Of course, the alternative is to conclude that (1) ioctl() is a
pain, and (2) security modules are a pain. Perhaps it's better to
just give up and hope that discretionary access controls, along with
whatever checks may be built into the driver itself, will be enough. That
is, essentially, the solution we have now.
Comments (8 posted)
There has recently been much attention paid to the group CPU scheduling
feature built into the Linux kernel. Using group scheduling, it is
possible to ensure that some groups of processes get a fair share of the
CPU without being crowded out by a rather larger number of CPU-intensive
processes in a different group. Linux has supported this feature for some
years, but it has languished in relative obscurity; it is only with recent
efforts to make group scheduling "just work" that it has started to come
into wider use. As it happens, the kernel has a very similar feature for
managing access to block I/O devices which is also, arguably, underused.
In this case, though, I/O group scheduling is not as completely implemented
as CPU scheduling, but some ongoing work may change that situation.
The "completely fair queueing" (CFQ) I/O scheduler tries to divide the
available bandwidth on any given device fairly between the processes which
are contending for that device. "Bandwidth" is measured not in the number
of bytes transferred, but the amount of time that each process gets to
submit requests to the queue; in this way, the code tries to penalize
processes which create seek-heavy I/O patterns. (There is also a mode based
solely on the number of I/O operations submitted, but your editor suspects
it sees relatively little use). The CFQ scheduler also supports group
scheduling, but in an incomplete way.
Imagine the group hierarchy shown on the right; here we have three control
groups (plus the default root group), and four processes running within
those groups. If every process were contending fully for the available I/O
bandwidth, and they all had the same I/O priority, one would expect that
bandwidth to be split equally between
P0, Group1, and Group2; thus P0 should
get twice as much I/O bandwidth as either P1 or P3. If more
processes were to be added to the root, they should be able to take I/O
bandwidth at the expense of the processes in the other control groups.
Similarly, the creation of new control groups underneath Group1
should not affect anybody outside of that branch of the hierarchy. In
current kernels, though, that is not how things work.
With the current implementation of CFQ group scheduling, the above
hierarchy is transformed into something that looks like this:
The CFQ group scheduler currently treats all groups - including the root
group - as being equal, at the same level in the hierarchy. Every group is
a top-level group. This level of grouping will be adequate for a number of
situations, but there will be other users who want the full hierarchical
model. That is why control groups were made to be hierarchical in the
first place, after all.
The hierarchical CFQ group scheduling patch
set from Gui Jianfeng aims to make that feature available. These
patches introduce a new cfq_entity structure which is used for
the scheduling of both processes and groups; it is clearly modeled after
the sched_entity structure used in the CPU scheduling code. With
this in place, the I/O scheduler can just give bandwidth to the top-level
cfq_entity which has run up the least "vdisktime" so far;
if that entity happens to be a group, the scheduling code drops down a
level and repeats the process. Sooner or later, the entity which is
scheduled for I/O will be an actual process, and the scheduler can start
dispatching I/O requests.
This patch set is on its fourth revision; the previous iterations have led
to significant changes. It appears that there are a few things to fix up
still, but this work seems to be getting closer to being ready.
One thing is worth bearing in mind: there are two I/O bandwidth controllers
in contemporary Linux kernels: the proportional bandwidth controller (built
into the CFQ scheduler) and the throttling controller built into the block
layer. The group scheduling changes only apply to the proportional
bandwidth controller. Arguably there is less need for full group
scheduling with the throttling controller, which puts absolute caps on the
bandwidth available to specific processes.
Controlling I/O bandwidth has a lot of applications; providing some
isolation between customers on a shared hosting service is an obvious
example. But this feature may yet prove to have value on the desktop as
well; many interactivity problems come down to contention for I/O
bandwidth. Anybody who has tried to start an office suite while
simultaneously copying a video image on the same drive understands how bad
it can be. If the group I/O scheduling feature can be made to "just work"
like the group CPU scheduling, we may have made another step toward a truly
responsive Linux desktop.
Comments (1 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>