Brief items
The current development kernel remains 2.6.34-rc3, but -rc4 can be
expected at almost any time. Quite a few patches have gone in since -rc3;
most are fixes, but there's also a 4200-file cleanup from Tejun Heo and
a new driver for Chelsio T4-based Ethernet adapters.
Stable updates:
Greg Kroah-Hartman has announced the release of four separate stable kernels:
2.6.27.46,
2.6.31.13,
2.6.32.11, and
2.6.33.2. These are fairly sizable
updates, weighing in at 45, 89, 116, and 156 patches respectively (at
review time anyway, a few patches may have been dropped). As usual, all users are
strongly encouraged to upgrade. In addition, it sounds like stable updates
for the 2.6.31 series are nearing their end, so users of that kernel should
move to .32 or .33.
Comments (none posted)
4208 files changed, 3717 insertions(+), 717 deletions(-)
--
Tejun
Heo casts a wide net for -rc4
I have two machines that show very different performance numbers.
After digging a little I found out that the first machine has, in
/proc/cpuinfo:
model name : Intel(R) Celeron(R) M processor 1.00GHz
while the other has:
model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
and that seems to be the main difference.
Now the problem is that /proc/cpuinfo is read only. Would it be possible
to make /proc/cpuinfo writable so that I could do:
echo -n "model name : Intel(R) Core(TM)2 Quad CPU Q6600 @
2.40GHz" > /proc/cpuinfo
in the first machine and get a performance similar to the second machine?
--
Paulo Marques
It's probably buggy as hell, I don't dare try to actually boot the
crap I write.
--
Linus Torvalds
Comments (4 posted)
By Jake Edge
April 7, 2010
The perf tracing tool has evolved quickly. When last we looked, Tom
Zanussi had added Python and Perl
scripting to perf. Next up would seem to be perf "live mode", where perf
no longer requires two steps: record the data, then analyze. Live mode
will allow perf trace record and perf trace report to
operate via a pipe, which allows instantaneous, as well as continuously
updating (a la top), output.
So that no existing perf users need to change their scripts, Zanussi only
added the new capabilities when perf recognizes that its record output is going to
stdout or report input is coming from stdin. In that case, perf
handles the data through a pipe, and
uses special synthesized events to provide header information. This will
also allow perf to operate over the network by piping its record
output to netcat, and then reading it via netcat on another system and
piping it into report.
All of the scripts that are installed in the standard perf location
(i.e. those which are listed in perf trace -l) are automatically
able to be run in live mode:
$ perf trace syscall-counts
will run both ends of the the syscall-counts script with a pipe in
between, a more usable shorthand for:
$ perf trace record syscall-counts -o - | perf trace report syscall-counts -i -
which itself is shorthand for:
perf record -c 1 -f -a -M -R -e raw_syscalls:sys_enter -o - | \
perf trace -i - -s ~/libexec/perf-core/scripts/python/syscall-counts.py
Zanussi also included several sample top-style scripts that can be used
to monitor read/write or system call activity updated every three seconds.
It looks to be a very useful addition to perf, which is rapidly becoming
the "swiss army knife" of kernel monitoring.
Comments (4 posted)
By Jonathan Corbet
April 7, 2010
Every kernel development cycle seems to involve one set of patches which
turn out to be more trouble than had been expected. With 2.6.34, that
award should probably go to the patches found under the somewhat confusing
CONFIG_NO_BOOTMEM option.
"Bootmem" is a simple, low-level memory allocator used by the kernel during
the early parts of the bootstrap process. One might think that the kernel
does not need yet another allocator, but the memory management code used
during operation requires that much of the kernel already be functional
before it can be called. Getting to that point involves a chain of
increasingly complicated memory allocation mechanisms; on the x86
architecture, those begin the "early_res" mechanism which takes over from
the BIOS "e820" facility. Once things get a little farther, the
architecture-independent bootmem allocator takes over, followed,
eventually, by the full buddy allocator.
Yinghai Lu came to the conclusion that things could be simplified
considerably if the bootmem stage were taken out of the picture. The
result was a series of patches which extends the use of the early_res
mechanism for long enough to bootstrap the buddy allocator. These changes
were merged for 2.6.34, but the old bootmem-based code was left behind.
The CONFIG_NO_BOOTMEM option controls which allocator is used,
with the default being to short out bootmem.
This is a significant change to the crucial and tricky early bootstrap
code, so few people were surprised when some regressions were reported
against 2.6.34-rc1. When the reports continued to arrive after -rc3,
though, the level of irritation began to grow, to the point that Linus started talking about reverting the whole
thing. Nobody seemed to dislike the objectives of the patches, but
system-killer regressions after -rc3, along with the twisted mess of
#ifdefs created by the patch and the fact that it was on by
default led to some grumpiness.
Normally, new features are expected to be configured out by default; to the
greatest extent possible, a new kernel should behave as much like its
predecessors as possible when the default options are taken. In this case,
the default led to significant changes and problems. The purpose of this
option was twofold: to allow the new code
to be configured out when it proved to be problematic, and to ensure that
it was well tested in the mean time. Certainly it was successful on both
fronts, even if some of the testers proved to be not entirely willing.
As of this writing, it would appear that the worst problems have been
fixed; talk of removing the no-bootmem code has subsided. Eventually,
perhaps, all architectures will make similar changes and the bootmem code
can be removed entirely. Meanwhile, Yinghai has a new set of changes on the horizon for 2.6.35:
replacing the early_res code with the "logical memory block" allocator
currently used by some other architectures. That change looks even more
disruptive than the bootmem elimination was.
Comments (3 posted)
Kernel development news
By Jonathan Corbet
April 7, 2010
For some time now, your editor has asserted that, at the kernel level, the
virtualization problem is mostly solved. Much of the remaining work is in
the performance area. That said, making virtualized systems perform well
is not a small or trivial problem. One of the most interesting aspects
of this problem is in the interaction between virtualized guests and host
memory management. A couple of patch sets under discussion illustrate
where the work in this area is being done.
The transparent huge pages patch
set was discussed here back in October. This patch seeks to change how
huge pages are used by Linux applications. Most current huge page users
must be set up explicitly to use huge pages, which, in turn, must be set
aside by the system administrator ahead of time; see the recent series by Mel Gorman
for more information on how this is done. The "some assembly required"
nature of huge pages limits their use in many situations.
The transparent huge page patch, instead, works to provide huge pages to
applications without those applications even being aware that such pages
exist. When large
pages are available, applications may have their scattered pages joined
together into huge pages automatically; those pages can also be split back
apart when the need arises. When the system operates in this mode, huge
pages can be used in many more situations without the need for application
or administrator awareness. This feature turns out to be especially
beneficial when running virtualized guests; huge pages map well to how
guests tend to see and use their address spaces.
The transparent huge page patches have been working their way toward
acceptance, though it should be noted that some developers still have
complaints about this work. Andrew Morton recently pointed out a different problem with this
patch set:
It appears that these patches have only been sent to linux-mm.
Linus doesn't read linux-mm and has never seen them. I do think we
should get things squared away with him regarding the overall
intent and implementation approach before trying to go further...
[T]his is a *large* patchset, and it plays in an area where Linus
is known to have, err, opinions.
It didn't take long for Linus to join the conversation directly; after a
couple of digressions into areas not directly related to the benefits of
the transparent huge pages patch, he realized that this work was motivated
by the needs of virtualization. At that point, he lost interest:
So I thought it was a more interesting load than it was. The
virtualization "TLB miss is expensive" load I can't find it in
myself to care about. "Get a better CPU" is my answer to that one.
He went on to compare the transparent huge
page work to high memory, which, in turn, he called "a
failure". The right solution in both cases, he says, is to get a
better CPU.
It should be pointed out that high memory was a spectacularly successful
failure, extending the useful life of 32-bit systems for some years. It
still shows up in surprising places - you editor's phone is running a
high-memory-enabled kernel. So calling high memory a failure is something
like calling the floppy driver a failure; it may see little use now, but
there was a time when we were glad we had it.
Perhaps, someday, advances in processor
architecture will make transparent huge pages unnecessary as well. But,
while the alternative to high memory (64-bit processors) has been in view
for a long time, it's not at all clear what sort of processor advance might
make transparent huge pages irrelevant. So, should this code get into the
kernel, it may well become one of those failures which is heavily used for
many years.
A related topic under discussion was the recently-posted VMware balloon driver patch. A balloon driver
has an interesting task; its job is to "inflate" within a guest system,
taking up memory and making it unavailable for processes running within the
guest. The pages absorbed by the balloon can then be released back to the
host system which, presumably, has a more pressing need for them
elsewhere. Letting "air" out of the balloon makes memory available to the
guest once again.
The purpose of this driver, clearly, is to allow the host to dynamically
balance the memory needs of its guest systems. It's a bit of a blunt
instrument, but it's the best we have. But Andrew Morton questioned the need for a separate memory
control mechanism. The kernel already has a function, called
shrink_all_memory(), which can be used to force the release of
memory. This function is currently used for hibernation, but Andrew
suspects that it could be adapted to the needs of virtualization as well.
Whether that is really true remains to be seen; it seems that the bulk of
the complexity lies not with the freeing of memory but in the communication
between the guest and the hypervisor. Beyond that, the longer-term
solution is likely to be something more sophisticated than simply applying
memory pressure and watching the guest squirm until it releases enough
pages. As Dan Magenheimer put it:
Historically, all OS's had a (relatively) fixed amount of memory
and, since it was fixed in size, there was no sense wasting any of
it. In a virtualized world, OS's should be trained to be much more
flexible as one virtual machine's "waste" could/should be another
virtual machine's "want".
His answer to this problem is the transcendent memory patch, which
allows the operating system to designate memory which is available for the
taking should the need arise, but which can contain useful data in the mean
time.
This is clearly an area that needs further work. The whole point of
virtualization is to isolate guests from each other, but a more cooperative
approach to memory requires that these guests, somehow, be aware of the
level of contention for resources like memory and respond accordingly.
Like high memory and transparent huge pages, balloon drivers may eventually
be consigned to the pile of failed technologies. Until something better
comes along, though, we'll still need them.
Comments (13 posted)
By Jake Edge
April 7, 2010
Today's increasing bandwidth, and faster networking hardware, has made it
difficult for a single CPU to keep up. Multiple cores and packages have
helped matters on the transmit side, but the receive side is trickier. Tom
Herbert's receive packet
steering (RPS) patches, which we looked at back in November,
provide a way to steer packets to particular CPUs based on a hash of the
packet's protocol data. Those patches were applied to the network subsystem
tree and are bound for 2.6.35, but now Herbert is back with an enhancement
to RPS that will attempt to steer packets to the CPU on which the receiving
application is running: receive
flow steering (RFS).
RFS uses the RPS hash table to store the CPU of an application when it
calls recvmsg() or sendmsg(). Instead of picking an
arbitrary CPU based on the hash and a CPU mask optionally set by an
administrator, as RPS does, RFS tries to use the CPU where the receiving
application is running. Based on the hash calculated on the incoming packet, RFS can look
up the "proper" CPU and assign the packet there.
The RPS CPU masks, which can be set via sysfs for each device (and
queue for devices with multiple queues), represent the allowable CPUs to
assign for a packet. But dynamically changing those values introduces the
possibility of out-of-order packets. For RPS, with largely static CPU
masks, it was not necessarily a big problem. For RFS, however, multiple
threads trying to read from the same socket, while potentially bouncing
around to different CPUs, would cause the CPU value in the hash table to
change frequently, thus increasing the likelihood of out-of-order packets.
For RFS, that was considered to be a "non-starter", Herbert
said, so a different approach was required. To eliminate the out-of-order
packets, two types of hash tables are created, both indexed by the hash
calculated from the packet information. The global
rps_sock_flow_table is populated by the recvmsg() or
sendmsg() call with the CPU number where the application is running
(this is called the "desired" CPU).
Each device queue then gets a rps_dev_flow_table which contains
the most recent CPU used to handle packets for that connection (which is
called the "current" CPU). In addition, the value of the tail queue
counter for the current CPU's backlog queue is stored in the
rps_dev_flow_table entry.
The two CPU values are compared when deciding which CPU to process the
packet on (which is done in get_rps_cpu()). If the current CPU
(as determined from the rps_dev_flow_table hash table) is
unset (presumably for the first packet) or that CPU is offline, the desired
CPU (from rps_sock_flow_table) is used. If the two CPU values are
the same, obviously, that CPU is used. But if they are both valid CPU
numbers, but different, the backlog tail queue counter is consulted.
Backlog queues have a queue head counter that gets incremented when packets
are removed from the queue. Using that and the queue length, a queue tail
counter value can be calculated. That is what gets stored in
rps_dev_flow_table. When the kernel makes its decision about
which CPU to assign the packet to, it needs to consider both the current
(really last used by the kernel) CPU and the desired (last used by an
application for sending or receiving) CPU.
The kernel compares the current CPU's queue tail counter (as stored in the
hash table) with that CPU's queue head counter. If the tail counter is less
than or equal the head counter,
that means that all packets that were put on the queue by this connection
have been processed. That in turn means that switching to the desired CPU
will not result in out-of-order packets.
Herbert's current patch is for TCP, but RFS should be "usable for other
flow oriented protocols". The benefit is that it can achieve better
CPU locality for the processing of the packet, both by the kernel, and the
application itself. Depending on various factors—cache hierarchy and
application are given as examples—it can and does increase the
packets per second that can be processed as well as lowering the latency
before a packet gets processed. But, interestingly, "on simple
benchmarks, we don't necessarily see improvement and sometimes see degradation".
For more complex benchmarks, the performance increase looks to be
significant. Herbert gave numbers for a netperf run where the transactions
per second went from 104K without either RFS or RPS, to 290K for the best
RPS configuration, and to 303K with RFS and RPS. A different test, with
100 threads handling an RPC-like request/response with some user-space work
being done, was even more dramatic. That test showed 103K, 174K, and 223K
respectively, but also showed a marked decrease in the latency for both
RPS and RPS + RFS.
These patches are coming from Google, which has been known to process a
few packets using the Linux kernel. If RFS is being used on production
systems at Google, that would seem to bode well for its reliability and
performance beyond just benchmarks. The patches were posted April 2, and
seemed to be generally well-received, so it's a little early to tell when
they might make it into the mainline. But it seems rather likely that we
will see them in either 2.6.35 or 36.
Comments (5 posted)
By Jonathan Corbet
April 6, 2010
One day, Andrew Morton was happily reading linux-kernel when he encountered
a patch fixing a minor problem with the "padata" code. Andrew, it seems,
had never heard of padata, which was merged
during the 2.6.34 merge window. So he asked: "
OK, on behalf of
thousands I ask: what the heck is kernel/padata.c?"
On behalf of those same thousands, your editor set out to learn what this
new bit of core kernel code does and how to use it.
In short: padata is a
mechanism by which the kernel can farm work out to be done in parallel on
multiple CPUs while retaining the ordering of tasks. It was developed for
use with the IPsec code, which needs to be able to perform encryption and
decryption on large numbers of packets without reordering those packets.
The crypto developers made a point of writing padata in a sufficiently
general fashion that it could be put to other uses as well, but that
requires knowing that the API is there and how to use it. Unfortunately,
they made a bit less of a point of updating the documentation directory.
The first step in using padata is to set up a padata_instance
structure for overall control of how tasks are to be run:
#include <linux/padata.h>
struct padata_instance *padata_alloc(const struct cpumask *cpumask,
struct workqueue_struct *wq);
The cpumask describes which processors will be used to execute
work submitted to this instance. The workqueue wq is where the
work will actually be done; it should be a multithreaded queue, naturally.
There are functions for enabling and disabling the instance:
void padata_start(struct padata_instance *pinst);
void padata_stop(struct padata_instance *pinst);
These functions literally do nothing beyond setting or clearing the
"padata_start() was called" flag; if that flag is not set, other
functions will refuse to work. There must be some perceived value in this
functionality, but the only current padata user (crypto/pcrypt.c)
does not make use of it. So padata_start() looks like one of
those exercises in pointless bureaucracy that we all have to cope with
sometimes.
The list of CPUs to be used can be adjusted with these functions:
int padata_set_cpumask(struct padata_instance *pinst,
cpumask_var_t cpumask);
int padata_add_cpu(struct padata_instance *pinst, int cpu);
int padata_remove_cpu(struct padata_instance *pinst, int cpu);
Changing the CPU mask has the look of an expensive operation, though, so it
probably should not be done with great frequency.
Actually submitting work to the padata instance requires the creation of a
padata_priv structure:
struct padata_priv {
/* Other stuff here... */
void (*parallel)(struct padata_priv *padata);
void (*serial)(struct padata_priv *padata);
};
This structure will almost certainly be embedded within some larger
structure specific to the work to be done.
Most its fields are private to padata, but the
structure should be zeroed at initialization time, and the
parallel() and serial() functions should be provided.
Those functions will be called in the process of getting the work done as
we will see momentarily.
The submission of work is done with:
int padata_do_parallel(struct padata_instance *pinst,
struct padata_priv *padata, int cb_cpu);
The pinst and padata structures must be set up as
described above; cb_cpu specifies which CPU will be used for the
final callback when the work is done; it must be in the current instance's
CPU mask. The return value from padata_do_parallel() is a little
strange; zero is an error return indicating that the caller forgot the
padata_start() formalities. -EBUSY means that somebody,
somewhere else is messing with the instance's CPU mask, while
-EINVAL is a complaint about cb_cpu not being in that CPU
mask. If all goes well, this function will return -EINPROGRESS,
indicating that the work is in progress.
Each task submitted to padata_do_parallel() will, in turn, be
passed to exactly one call to the above-mentioned parallel()
function, on one CPU, so true parallelism is achieved by submitting
multiple tasks. The workqueue is used to actually make these calls, so
parallel() runs in process context and is allowed to sleep.
The parallel() function gets the
padata_priv structure pointer as its lone parameter; information
about the actual work to be done is probably obtained by using
container_of() to find the enclosing structure.
Note that parallel() has no return value; the padata subsystem
assumes that parallel() will take responsibility for the task from
this point. The work need not be completed during this call, but, if
parallel() leaves work outstanding, it should be prepared to be
called again with a new job before the previous one completes.
When a task does complete, parallel() (or whatever function actually
finishes the job) should inform padata of the fact with a call to:
void padata_do_serial(struct padata_priv *padata);
At some point in the future, padata_do_serial() will trigger a
call to the serial() function in the padata_priv
structure. That call will happen on the CPU requested in the initial call
to padata_do_parallel(); it, too, is done through the workqueue,
but with local software interrupts disabled.
Note that this call may be deferred for
a while since the padata code takes pains to ensure that tasks are completed in
the order in which they were submitted.
The one remaining function in the padata API should be called to clean up
when a padata instance is no longer needed:
void padata_free(struct padata_instance *pinst);
This function will busy-wait while any remaining tasks are completed, so it
might be best not to call it while there is work outstanding. Shutting
down the workqueue, if necessary, should be done separately.
The API as described above is what can be found in the 2.6.34-rc3 kernel.
As was seen back at the beginning of this article, padata is just coming
into more general awareness, and some developers are asking questions about
the API. So changes are possible - but, then, that is true of any internal
kernel interface.
Comments (1 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>