Brief items
The current development kernel is 2.6.39-rc3,
released on April 11. Linus said:
It's been another almost spookily calm week. Usually this kind of
calmness happens much later in the -rc series (during -rc7 or -rc8,
say), but I'm not going to complain. I'm just still waiting for the
other shoe to drop.
And it is possible that this really ended up being a very calm
release cycle. We certainly didn't have any big revolutionary
changes like the name lookup stuff we had last cycle. So I'm
quietly optimistic that no shoe-drop will happen.
The short-form changelog is in the
announcement, or see the
full changelog for all the details.
Stable updates: no stable updates have been released in the last
week. The 2.6.38.3 update is in the review
process as of this writing; it can be expected sometime on or after
April 14. The 2.6.33.10 and 2.6.32.37 updates are
also in the review process, with an expected release on or after
April 15.
Comments (none posted)
If we were going to use the SELinux reference policy, I would
completely agree. However, looking at something that would focus
more on the SELinux privacy controls would limit the complexity.
You're correct that it might be easier to create an LSM that does
exactly what we want. Because of MeeGo policies though, that would
mean I would have to get it upstreamed first before we could use it
and that would be problematic.
--
Ryan Ware
That field will contain internal state information which is not
going to be exposed to anything outside the core code - except via
accessor functions. I'm tired of everyone fiddling in
irq_desc.status.
core_internal_state__do_not_mess_with_it is clear enough, annoying
to type and easy to grep for. Offenders will be tracked down and
slapped with stinking trouts.
--
Thomas Gleixner
Comments (none posted)
As many LWN readers will have heard, long-time kernel developer David Brownell recently passed away. His contributions to the code are many, but it is clear that they were outweighed by his contributions to the community. He will be much missed.
A collection point has been set up for condolences to be passed on to David's family. People outside of our community are often not fully aware of the role a loved one plays in the community; this is a chance to let David's family know more about how many lives he touched and how valuable his work was. If you would like to share your memories of David, they may be sent to dbrownell-condolences@kernel.org; from there, they will be passed on to his family.
Comments (4 posted)
By Jonathan Corbet
April 12, 2011
The KVM subsystem provides native virtualization support in the Linux
kernel. To that end, it provides a virtualized CPU and access to memory,
but not a whole lot more; some other software component is needed to
provide virtual versions of all the hardware (console, disk drives, network
adapters, etc) that a kernel normally expects to find when it boots. With
KVM, a version of the
QEMU
emulator is normally used to provide that hardware. While QEMU is stable
and capable, it is not universally loved; a competitor has just come along
that may not displace QEMU, but it may claim some of its limelight.
Just over one year ago, LWN covered an extended
discussion about KVM, and, in particular, about the version of QEMU
used by KVM. At that time, there were some suggestions that QEMU should be
forked and brought into the kernel source tree; the idea was that faster
and more responsive development would result. That fork never happened,
and the idea seemed to fade away.
That idea is now back, in a rather different form, with Pekka Enberg's announcement of the "native KVM tool." In
short, this tool provides a command (called kvm) which can
substitute for QEMU - as long as nobody cares about most of the features
provided by QEMU. The native tool is able to boot a kernel which can talk
over a serial console. It lacks graphics support, networking, SMP support,
and much more, but it can get to a login prompt when run inside a terminal
emulator.
Why is such a tool interesting? There seem to be a few, not entirely
compatible reasons. Replacing QEMU is a nice idea because, as Avi Kivity
noted, "It's an ugly gooball."
The kvm code - being new and with few features - is compact,
clean, and easy to work with. Some developers have said that kvm
makes debugging (especially for early-boot problems) easier, but others
doubt that it can ever replace QEMU, with its extensive hardware emulation,
in that role. There's also talk of moving kvm toward the
paravirtualization model in the interest of getting top performance, but
there is also resistance to doing anything which would make it unable to
run native kernels.
Developers seem to like the idea of this project, and chances are that it
will go somewhere even if it never threatens to push QEMU aside. There are
a few complaints about the kvm name - QEMU already has a
kvm command and the name is hard to search for anyway - but no
alternative names seem to be in the running as of this writing. Regardless
of its name, this project may be worth watching; it's clearly the sort of
tool that people want to hack on.
Comments (13 posted)
Kernel development news
By Jonathan Corbet
April 12, 2011
The
Berkeley
Packet Filter (BPF) is a mechanism for the fast filtering of network
packets on their way to an application. It has its roots in BSD in the
very early 1990's, a history that was not enough to prevent the SCO Group
from claiming ownership of it. Happily, that claim proved to be as
valid as the rest of SCO's assertions, so BPF remains a part of the Linux
networking stack. A recent patch from Eric Dumazet may make BPF faster, at
least on 64-bit x86 systems.
The purpose behind BPF is to let an application specify a filtering
function to select only the network packets that it wants to see. An early
BPF user was the tcpdump, which used BPF to implement the filtering behind
its complex command-line syntax. Other packet capture programs also make
use of it. On Linux, there is another interesting application of BPF: the
"socket filter" mechanism allows an application to filter incoming packets
on any type of socket with BPF. In this mode, it can function as a sort of
per-application firewall, eliminating packets before the application ever
sees them.
The original BPF distribution came in the form of a user-space library, but
the BPF interface quickly found its way into the kernel. When network
traffic is high, there is a lot of value in filtering unwanted packets
before they are copied into user space. Obviously, it is also important
that BPF filters run quickly; every bit of per-packet overhead is going to
hurt in a high-traffic situation. BPF was designed to allow a wide variety
of filters while keeping speed in mind, but that does not mean that it
cannot be made faster.
BPF defines a virtual machine which is almost Turing-machine-like in its
simplicity. There are two registers: an accumulator and an index
register. The machine also has a small scratch memory area, an implicit
array containing the packet in question, and a small set of arithmetic,
logical, and jump instructions. The accumulator is used for arithmetic
operations, while the index register provides offsets into the packet or
into the scratch memory areas. A very simple BPF program (taken from the
1993 USENIX
paper [PDF]) might be:
ldh [12]
jeq #ETHERTYPE_IP, l1, l2
l1: ret #TRUE
l2: ret #0
The first instruction loads a 16-bit quantity from offset 12 in the packet
to the accumulator; that value is the Ethernet protocol type field. It
then compares the value to see if the packet is an IP packet or not; IP
packets are accepted, while anything else is rejected. Naturally, filter
programs get more complicated quickly. Header length can vary, so the
program will have to calculate the offsets of (for example) TCP header
values; that is where the index register comes into play. Scratch memory
(which is the only place a BPF program can store to) is used when
intermediate results must be kept.
The Linux BPF implementation can be found in net/core/filter.c; it
provides "standard" BPF along with a number of Linux-specific ancillary
instructions which can test whether a packet is marked, which CPU the
filter is running on, which interface the packet arrived on, and more. It
is, at its core, a long switch statement designed to run the BPF
instructions quickly. This code has seen a number of enhancements and
speed improvements over the years, but there has not been any fundamental
change for a long time.
Eric Dumazet's patch is a fundamental
change: it puts a just-in-time compiler into the kernel to translate BPF
code directly into the host system's assembly code. The simplicity of the
BPF machine makes the JIT translation relatively simple; every BPF
instruction maps to a straightforward x86 instruction sequence. There are
a few assembly language helpers which help to implement the virtual
machine's semantics; the accumulator and index are just stored in the
processor's registers. The resulting program is placed in a bit of
vmalloc() space and run directly when a packet is to be tested.
A simple benchmark shows a 50ns savings for
each invocation of a simple filter - that may seem small, but, when
multiplied by the number of packets going through a system, that difference
can add up quickly.
The current implementation is limited to the x86-64 architecture; indeed,
that architecture is wired deeply into the code, which is littered with
hard-coded x86 instruction opcodes. Should anybody want to add a second
architecture, they will be faced with the choice of simply replicating the
whole thing (it is not huge) or trying to add a generalized opcode
generator to the existing JIT code.
An obvious question is: can this same approach be applied to iptables,
which is more heavily used than BPF? The answer may be "yes," but it might
also make more sense to bring back the nftables idea, which is built on a BPF-like
virtual machine of its own. Given that there has been some talk of using
nftables in other contexts (internal packet classification for packet
scheduling, for example), the value of a JIT-translated nftables could be
even higher. Nftables is a job for another day, though; meanwhile, we have
a proof of the concept for BPF that appears to get the job done nicely.
Comments (19 posted)
April 13, 2011
This article was contributed by Jens Axboe
Since the dawn of time, or for at least as long as I have been involved,
the Linux kernel has deployed a concept called "plugging" on block devices.
When I/O is queued to an empty device, that device enters a plugged
state. This means that I/O isn't immediately dispatched to the low level
device driver, instead it is held back by this plug. When a process is
going to wait on the I/O to finish, the device is unplugged and request
dispatching to the device driver is started. The idea behind plugging is to
allow a buildup of requests to better utilize the hardware and to allow
merging of sequential requests into one single larger request. The latter
is an especially big win on most hardware; writing or reading bigger chunks
of data at the time usually yields good improvements in bandwidth. With the
release of the 2.6.39-rc1 kernel, block device plugging was drastically
changed. Before we go into that, lets take a historic look at how plugging
has evolved.
Back in the early days, plugging a device involved global state. This was
before SMP scalability was an issue, and having global state made it easier
to handle the unplugging. If a process was about to block for I/O, any
plugged device was simply unplugged. This scheme persisted in pretty much
the same form until the early versions of the 2.6 kernel, where it began to
severely impact SMP scalability on I/O-heavy workloads.
In response to this problem, the plug state was
turned into a per-device entity in
2004. This scaled well, but now you suddenly had
no way to unplug all devices when going to sleep waiting for page I/O. This
meant that the virtual memory subsystem had to be able to unplug the
specific device that would
be servicing page I/O. A special hack was added for this:
sync_page() in struct
address_space_operations; this hook would unplug
the device of interest.
If you have a more complicated I/O setup with device
mapper or RAID components, those layers would in turn unplug any lower-level
device. The unplug event would thus percolate down the stack. Some
heuristics were also added to auto-unplug the device if a certain depth of
requests had been added, or if some period of time had passed before the
unplug event was seen. With the asymmetric nature of plugging where the
device was automatically plugged but had to be explicitly unplugged, we've
had our fair share of I/O stall bugs in the kernel. While crude, the
auto-unplug would at least ensure that we would chuck along if someone
missed an unplug call after I/O submission.
With really fast devices hitting the market, once again plugging had become
a scalability problem and hacks were again added to avoid this. Essentially
we disabled plugging on solid-state devices that were able to do queueing. While
plugging originally was a good win, it was time to reevaluate things. The
asymmetric nature of the API was always ugly and a source of bugs, and the
sync_page() hook was always hated by the memory management
people. The time had come to rewrite the whole thing.
The primary use of plugging was to allow an I/O submitter to send down
multiple pieces of I/O before handing it to the device. Instead of
maintaining these I/O fragments as shared state in the device, a new
on-stack structure
was created to contain this I/O for a short period, allowing the submitter
to build up a small queue of related requests.
The state is now tracked in struct blk_plug, which is little
more than a linked list and a should_sort flag informing
blk_finish_plug() whether or not to sort this list before flushing
the I/O. We'll come back to that later.
struct blk_plug {
unsigned long magic;
struct list_head list;
unsigned int should_sort;
};
The magic member is a temporary addition to detect uninitialized use cases,
it will eventually be removed. The new API to do this is straightforward
and simple to use:
struct blk_plug plug;
blk_start_plug(&plug);
submit_batch_of_io();
blk_finish_plug(&plug);
blk_start_plug() takes care of initializing the structure and
tracking it inside the task structure of the current process. The latter is
important to be able to automatically flush the queued I/O should the task
end up blocking between the call to blk_start_plug() and
blk_finish_plug(). If that happens, we want to ensure that pending
I/O is sent off to the devices immediately. This is important from a
performance perspective, but also to ensure that we don't deadlock. If the
task is blocking for a memory allocation, memory management reclaim could
end up wanting to free a page belonging to a request that is currently
residing on our private plug. Similarly, the caller may itself end up
waiting for some of the plugged I/O to finish. By flushing this list when
the process goes to sleep, we avoid these types of deadlocks.
If blk_start_plug() is called and the task already has a plug structure
registered, it is simply ignored. This can happen in cases where the upper
layers plug for submitting a series of I/O, and further down in the call
chain someone else does the same. I/O submitted without the knowledge of the
original plugger will thus end up on the originally assigned plug, and be
flushed whenever the original caller ends the plug by calling
blk_finish_plug(), or if some part of the call path goes to sleep or is
scheduled out.
Since the plug state is now device agnostic, we may end up in a situation
where multiple devices have pending I/O on this plug list. These may end up
on the plug list in an interleaved fashion, potentially causing
blk_finish_plug() to grab and release the related queue locks multiple
times. To avoid this problem, a should_sort flag in the
blk_plug structure
is used to keep track of whether we have I/O belonging to more than I/O
distinct queue pending. If we do, the list is sorted to group identical
queues together. This scales better than grabbing and releasing the same
locks multiple times.
With this new scheme in place, the device need no longer be notified of
unplug events. The queue unplug_fn() used to exist for this purpose
alone, it has now been removed. For most drivers it is safe to just remove
this hook and the related code. However, some drivers used plugging to
delay I/O operations in response to resource shortages. One example of that
was the SCSI
midlayer; if we failed to map a new SCSI request due to a memory shortage,
the queue was plugged to ensure that we would call back into the dispatch
functions later on. Since this mechanism no longer exists, a similar API
has been provided for such use cases. Drivers may now use blk_delay_queue()
for this:
blk_delay_queue(queue, delay_in_msecs);
The block layer will re-invoke request queueing after the specified number
of milliseconds have passed. It will be invoked from process context, just
as it would have been with the unplug event. blk_delay_queue()
honors the queue stopped state, so if blk_stop_queue() was called
before blk_delay_queue(), or if is called after the fact but
before the delay has passed, the request handler will not be
invoked. blk_delay_queue() must only be used for conditions where
the caller doesn't necessarily know when that condition will change
states. If resources internal to the driver cause it to need to halt
operations for a while, it is more efficient to use
blk_stop_queue() and blk_start_queue() to manage those
directly.
These changes have been merged for the 2.6.39 kernel. While a
few problems have been found (and fixed), it would appear that the plugging
changes have been integrated without greatly disturbing Linus's calm
development cycle.
Comments (10 posted)
By Jake Edge
April 13, 2011
The recently held Linux
Foundation Collaboration Summit (LFCS) had its traditional kernel panel
on April 6 at which
Andrew Morton, Arnd Bergmann, James Bottomley, and Thomas
Gleixner sat down to discuss the kernel with moderator Jonathan Corbet.
Several topics were covered, but the current struggles in the ARM community were clearly at
the forefront of the minds of participants and audience members alike.
Each of the kernel hackers introduced themselves, some with tongue planted
firmly in cheek, such as Bottomley with a declaration that he was on the
panel "to meet famous kernel developers", and Morton who said
he spent most of his time trying to figure out what the other kernel
hackers are doing to the memory management subsystem. Bergmann was a bit
modest about his contributions, so Gleixner pointed out that Bergmann had
done the last chunk of work required to remove the big kernel lock, which
was greeted with a big round of applause. For his part, Gleixner was a bit
surprised to find out that he manages bug reports for NANA flash (based
on a typo on the giant slides on either side of the stage), but noted that he
specialized in
"impossible tasks" like getting the realtime preemption
patches into the mainline piecewise.
There is a "high-level architectural issue" that Corbet wanted
the panel to tackle first, and that was the current problems in the ARM
world. It is "one of our more important architectures", he
said, without which we wouldn't have all these different Android phones to
play with. So, it is "discouraging to see that there is a
mess" in the ARM kernel community right now. What's the situation,
he asked, and how can we improve things?
For a long time, the problem in the ARM community was convincing
system-on-chip (SoC) and
board vendors to get their code upstream, Bergmann said, but now there is a
new problem in that they all "have their own subtrees that don't work
very well together". Each of those trees is going their own way,
which means that core and driver code gets copied "five times or twenty
times" into different SoC trees.
Corbet asked how the kernel community can do better with respect to
ARM. Gleixner noted that ARM maintainer Russell King tries to push back on
bad code coming in, "but he simply doesn't scale". There are
70 different sub-architectures and 500 different SoCs in the ARM tree, he
said. In addition, "people have been pushing sub-arch trees directly
to Linus", Bergmann said, so King does not have any control over
those. It is a consequence of the "tension between cleanliness and
time-to-market", Bottomley said.
Gleixner thinks that the larger kernel community should be providing the
ARM vendors with "proper abstractions" and that because of a
lack of a big picture view, those vendors cannot be expected to come up
with those themselves. By and large the ARM vendor community has a
different mindset that comes from other operating systems where changes to
the core code were impossible, so 500 line workarounds in drivers were the
norm. Bergmann suggested that the vendors get the code reviewed and
upstream before shipping products with that code. Morton said that
as the "price of admission" vendors need to be asked to
maintain various pieces horizontally across the ARM trees. Actually
motivating them to do that is difficult, he said.
From the audience, Wolfram Sang asked whether more code review for the ARM
patches would help. All agreed that more code review is good, but
Bottomley expressed some reservations because there are generally only a
few reviewers that a subsystem maintainer can trust to spot important
issues, so all code review is not created equal. Morton suggested a
"review economy" where one patch submitter needs to review the
code of another and vice versa. That would allow developers to justify the
time spent reviewing code to their managers. But, Bottomley said,
"collaborating with competitors" is a hard concept for
organizations that are new to open source development.
If a driver looks like one that is already in the tree, it should not be
merged, and instead someone needs to
get the developers to work with the existing driver, Bergmann said. There
is a lot of reuse of IP blocks in SoCs, but the developers aren't aware of
it because different teams are work on the different SoCs, Gleixner said.
The kernel
community needs people that can figure that out, he said. Bottomley
observed that "the first question should be: did anyone do it before and
can I 'steal' it?".
In response to an audience question about the panel's thoughts on Linaro,
Bergmann, who works with Linaro, said "I think it's great" with
a smile. He went on to say that Linaro is doing work that is closely related
to the ARM problems that had been discussed. Getting different SoC vendors
to work together is a big part of what Linaro is doing, and that
"everyone suffers" if that collaboration doesn't happen.
"ARM is one of the places where it [collaboration] is needed
most", he said.
Control groups
The discussion soon shifted to control groups, with Corbet noting that they
are becoming more pervasive in the kernel, but that lots of kernel hackers
hate them. It will soon be difficult to get a distribution to boot and run
without control groups, he said, and wondered if adding them to the kernel
was the right move: "did we make a
mistake?" Gleixner said that there is nothing wrong with control
groups conceptually, "just that the code is a
horror". Bottomley lamented the code that is getting "grafted
onto the side of control groups" as each resource in the kernel that is
getting controlled requires reaching into multiple subsystems in rather
intrusive ways.
As with "everything that sucks" in the kernel, control groups
needs to be
cleaned up by someone who looks at it from a global perspective; that
person will have to
"reimplement it and radically modify it", Gleixner said. That
is difficult to do because it is both a technical and a political problem,
Bottomley said. The technical part is to get the interaction right,
while the political part is that it is difficult to make changes across
subsystem boundaries in the kernel.
But Morton said that he hadn't seen much in the way of specific complaints
about control groups cross his desk. Conceptually, it extends what an
operating system should do in terms of limiting resources. "If it's
messy, it's because of how it was developed" on top of a production
kernel that gets updated every three months. Bottomley said that the
problem with doing cross-subsystem work is often just a matter of
communication, but it also requires someone to take ownership and talk to
all of the affected subsystems rather than just picking the "weakest
subsystem" and getting changes in through there.
Corbet wondered if the independence of subsystems in the kernel, something
that was very helpful in allowing its development to scale, was changing.
The panel seemed to think there wasn't much of an issue there, that while
control groups crossed a lot of boundaries, naming five things like that in
the kernel would be hard to do as Bottomley pointed out.
Twenty years ahead
With the 20 year anniversary of Linux being celebrated this year, Jon
Masters asked from the audience, what would things be like 20 years from
now. Bottomley promptly replied that four-fifths of the panel would be
retired, but Gleixner expected that the 2038 bug would have brought them
all back out of retirement. Morton said that unless some kind of quantum
computer came along to make Linux obsolete, it would still be there in 20 years.
He also expected that the first thing to be done with any new quantum
computer would be to add an x86 emulation layer.
When Corbet posited that perhaps the realtime preempt code would be merged
by then, Gleixner made one his strongest predictions yet for merging that
code: "I am planning to be done with it before I retire".
More seriously, he said that it is on a good track, he has talked to the
relevant subsystem maintainers, and is optimistic about getting it all
merged—eventually.
In 20 years, the kernel will still be supporting the existing user-space
interfaces, Corbet said. He quoted Morton from a recent kernel mailing list post: "Our hammer is kernel patches and all problems
look like nails", and wondered whether there was a problem with how
the kernel hackers developed user-space interfaces. Morton noted that the
quote was about doing more pretty printing inside the kernel, which he is
generally opposed to. It has been done in the past because it was
difficult for the kernel hackers to ship user-space code, so that it would
stay in sync with kernel changes. But perf has demonstrated that the
kernel can ship user-space code, which could be a way forward.
Gleixner noted that there was quite a bit of resistance to shipping perf,
but that it worked out pretty well as a way to "keep the strict
connection between the kernel and user space". Perf is meant to be
a simple tool to allow users to try out perf events gathering, he said, and
that people are building more full-blown tools on top of perf. Having
tools shipped with the kernel allows more freedom to experiment with the
ABI, Bottomley said. Morton said that there needs to be a middle ground,
noting that Google had a patch that exported a procfs file that
contained a shell script inside.
Ingo Molnar recently pointed out that FreeBSD is getting Linux-like quality
with a much smaller development community and suggested that it was because
the user space and kernel are developed together. Corbet asked whether
Linux was holding itself back by not taking that route. Bottomley thought
that Molnar was "both right and wrong", and that FreeBSD has
an entire distribution in its kernel tree. "I hope Linux never gets
to that", he said.
From perf to control groups, FreeBSD to ARM, as usual, the panel ranged
over a number
of topics in the hour allotted. The format and participants vary from year
to year, but it is always interesting to hear what kernel developers are
thinking about issues that Linux is facing.
Comments (22 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Page editor: Jonathan Corbet
Next page: Distributions>>