Brief items
The current development kernel is 3.3-rc2,
released on January 31 - a little later
than would have ordinarily been expected. "
The diffstat is pretty
flat - indicative of mostly small changes spread out. Which is what I like
seeing, and we don't always see at this point. There's some file movement
(8250-based serial and the arm mx5 -> imx merge), but otherwise really not
a lot of excitement. Good." That said, there are quite a few
changes in this prepatch; see the short-form changelog in the announcement
for details. Thirteen of those changes are reverts for patches that didn't
work out.
Stable updates: there have been no stable updates in the last week.
The
2.6.32.56,
3.0.19, and
3.2.3 stable updates are in the review
process as of this writing; they can be expected on or after
February 3.
Comments (none posted)
It wouldn't be the first time lockdep & ftrace live locked the
system. Or made it so unbearably slow. Lockdep and ftrace do not
play well together. They both are very intrusive. The two remind me
of the United States congress. Where there is two parties trying to
take control of everything, but nothing ever gets done. We end up
with a grid/live lock in the country/computer.
--
Steven Rostedt
I can see some vindictive programmer doing that, while thinking
"I'll show these people who pointed out this bug in my code,
mhwhahahahaa! I'll fix their test-case while still leaving the
real problem unaddressed", but I don't think compiler people are
quite *that* evil. Yes, they are evil people who are trying to
trip us up, but still..
--
Linus Torvalds
In that way, my philosophy of ext4 is that it should be like the
Linux kernel; it's an evolutionary process and central planning is
often overrated. People contribute to ext4 for many different
reasons, and that means they optimize ext4 for their particular
workloads. Like Linus for Linux, we're not trying to architect for
"world domination" by saying, "hmm, in order to 'take out'
reiserfs4, we'd better implement features foo and bar".
--
Ted Ts'o
You're making the assumption that users are informed and
knowledgable, and all filesystem developers should know this is
simply not true. Users repeatedly demonstrate that they don't know
how filesystems work, don't understand the knobs that are
provided, don't understand what their applications do in terms of
filesystem operations and don't really understand their data
sets. Education takes time and effort, but still users make the
same mistakes over and over again.
--
Dave Chinner
Looks like there are more dragons and hidden trapdoors in the drm
release path than actual lines of code.
--
Daniel
Vetter
Comments (none posted)
The Linux Foundation has
announced
that Greg Kroah-Hartman has joined the organization as a fellow. "
In
his role as Linux Foundation Fellow, Kroah-Hartman will continue his work
as the maintainer for the Linux stable kernel branch and a variety of
subsystems while working in a fully neutral environment. He will also work
more closely with Linux Foundation members, workgroups, Labs projects, and
staff on key initiatives to advance Linux."
Comments (12 posted)
The deadline for requests to attend the 2012 storage, filesystem, and
memory management summit is February 5 (the event happens
April 1-2 in San Francisco). Any developers who would like to be
there and have not expressed their interest should do so in the very near
future.
Full Story (comments: none)
Kernel development news
By Jonathan Corbet
January 31, 2012
Herbert Poetzl recently
reported an
interesting performance problem. His SSD-equipped laptop could read data
at about 250MB/s with the 2.6.38 kernel, but performance dropped to
25-50MB/s on anything more recent. An order-of-magnitude performance drop
is just not the sort of benefit that most people look forward to when
upgrading their kernel, so this report quickly gained the attention of a
number of developers. The resolution of the problem turned out to be
simple, but it offers an interesting view of how high-performance disk I/O
works in the kernel.
An explanation of the problem requires just a bit of background, and, in
particular, the definition of a couple of terms. "Readahead" is the
process of speculatively reading file data into memory with the idea that
an application is likely to want it soon. Reasonable performance when
reading a file sequentially depends on proper readahead; that is the only
way to ensure that reading and consuming the data can be done in parallel.
Without readahead, applications will spend more time than necessary waiting
for data to be read from disk.
"Plugging," instead, is the process of stopping I/O request submissions to
the low-level device for a period of time. The motivation for plugging is
to allow a number of I/O requests to accumulate; that lets the I/O
scheduler sort them, merge adjacent requests, and apply any sort of
fairness policy that might be in effect. Without plugging, I/O requests
would tend to be smaller and more scattered across the device, reducing
performance even on solid-state disks.
Now imagine that we have a process about to start reading through a
long file, as indicated by your editor's unartistic rendering here:
Once the application starts reading from the beginning of the file, the kernel
will set about filling the first readahead window (which is 128KB with
larger files) and submit I/O for the second window, so the situation will
look something like this:
Once the application reads past 128KB into the file, the data it needs will
hopefully be in memory. The readahead machinery starts up again,
initiating I/O for the window starting at 256KB; that yields a situation
that looks something like this:
This process continues indefinitely with the kernel running to always stay
ahead of the application and have the data there by the time that
application gets around to reading it.
The 2.6.39 kernel saw some significant changes
to how plugging is handled, with the result that the plugging and
unplugging of queues is now explicitly managed in the I/O submission
code. So, starting with 2.6.39, the readahead code will plug the request
queue before it submits a batch of read operations, then unplug the queue
at the end. The function that handles basic buffered file I/O
(generic_file_aio_read()) also now does its own plugging. And
that is where the problems begin.
Imagine a process that is doing large (1MB) reads. As the first large read
gets into generic_file_aio_read(), that function will plug the
request queue and start working through the file pages already in memory.
When it gets to the end of the first readahead window (at 128KB), the
readahead code will be invoked as described above. But there's a problem:
the queue is still plugged by generic_file_aio_read(), which is
still working on that 1MB read request, so the I/O
operations submitted by the readahead code are not passed on to the
hardware; they just sit in the queue.
So, when the application gets to the end of the second readahead window, we
see a situation like this:
At this point, everything comes to a stop. That will cause the queue to be
unplugged, allowing the readahead I/O requests to be executed at last, but
it is too late. The application will have to wait. That wait is enough to
hammer performance, even on solid-state devices.
The fix is to simply remove the top-level
plugging in generic_file_aio_read() so that readahead-originated
requests can get through to the hardware. Developers who have been able to
reproduce the slowdown report that this patch makes the problem go away, so
this issue can be considered solved. Look for this fix to appear in a
stable kernel release sometime soon.
Comments (15 posted)
By Jonathan Corbet
January 31, 2012
The addition of a checkpoint/restore functionality to Linux has been an
ongoing topic of discussion and development for some years now. After the
poor reception given to the in-kernel C/R
implementation at the end of 2010, that particular project seems to have
faded into the background. Instead, most of the interest seems to be in
solutions that operate mostly in user space. Depending on the approach
taken, most or all the support needed to implement this functionality in
user space already exists. But a complete solution is not yet there.
CRIU
Cyrill Gorcunov has been working to fill in some of the gaps with a preparatory patch set for user-space
checkpointing/restore with the "CRIU" tool set. There are a number of
small additions to the kernel ABI to be found here:
- A new children entry in a thread's /proc directory
provides a list of that thread's immediate children. This information
allows a user-space checkpoint utility to find those child processes
without needing to walk through the entire process tree.
- /proc/pid/stat is extended to provide the bounds of
the process's argument and environment arrays, along with the exit
code. That allows this information to be reliably captured at
checkpoint time.
- A number of new prctl() options allow the argument and
environment arrays to restored in a way matching what was there at
checkpoint time. The desired end result is that ps shows the
same information about a process after a checkpoint/restore cycle as
it did before.
Perhaps the most significant new feature, though, is the addition of a new
system call:
long kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1, unsigned long idx2);
Checkpoint/restore is meant to work as well on a tree of processes as on a
single process. One challenge in the way of meeting that goal is that some
of those processes may share resources - files, say, or, perhaps, a whole
lot more. Replicating that sharing at restore time is relatively easy; the
clone() system call provides a nice set of flags controlling the
sharing of resources. The harder part is knowing, at checkpoint time,
whether that sharing is taking place.
One way for user space to determine whether, for example, two processes are
sharing the same open file would be to query the kernel for the address of
the associated struct file and see if they are the same in both
processes. That kind of functionality sets off alarms among those concerned about
security, though; learning where data structures live in kernel space is
often an important precondition to an attack. There was talk for a while
of "obfuscating" the pointers - through an exclusive-OR with a random
value, for example - but the risk was still seen as being too high. So the
compromise is kcmp(), which simply answers the question of whether
resources found in two processes are the same or not.
kcmp() takes two process ID parameters, indicating the processes
of interest; both processes must be in the same PID namespace as the
calling process. The type parameter tells the kernel the specific
item that is being compared:
- KCMP_FILE: determines whether a file descriptor idx1
in the first process is the same as another descriptor (idx2)
in the second process.
- KCMP_FILES: compares the file descriptor arrays to see
whether the processes share all files.
- KCMP_FS: compares fs_struct structures (which hold
the current umask, working directory, namespace root, etc.).
- KCMP_IO: compares the I/O context, used mainly for block I/O
scheduling.
- KCMP_SIGHAND: compares the two process's signal handler
arrays.
- KCMP_SYSVSEM: compares the list of undo operations associated
with SYSV semaphores.
- KCMP_VM: compares each process's address space.
The return value from kcmp() is zero if the two items are equal,
one if the first item is "less" than the second, or two if the first is
"greater" than the second. The ordered comparison may seem a little
strange, especially when one looks at the implementation and sees that the
pointers are obfuscated before comparison within the kernel. The result
is, thus, an ordering that (by design) does not match the ordering of the
relevant data structures in kernel space. It turns out that even a
reshuffled (but consistent) "ordering" is useful for optimizing comparisons
in user space when large numbers of open files are present.
This patch set has been through a few cycles of review and seems to have
addressed most of the concerns raised by reviewers. It may just find its
way in through the next merge window. Meanwhile, people who want to see
how the user-space side works can find the relevant code at criu.org.
DMTCP
CRIU is not the only user-space checkpoint/restore implementation out
there; the DMTCP (Distributed
MultiThreaded CheckPointing) project has been busy since about 2.6.9.
DMTCP differs somewhat from CRIU, though; in particular, it is able to
checkpoint groups of processes connected by sockets - even across different
machines - and it requires no changes to the kernel at all. These features
come with a couple of limitations, though.
Checkpoint/restore with DMTCP requires that the target process(es) be
started with a special script; it is not possible to checkpoint arbitrary
processes on the system. That script uses the LD_PRELOAD mechanism to
place wrappers around a number of libc and (especially) system call
implementations. As a result, DMTCP has no need to ask the kernel whether
two processes are sharing a specific resource; it has been watching the
relevant system calls and knows how the processes were created. The
disadvantage to this approach - beyond having to run checkpointable process
in a special environment - is that, as can be seen in the table of
supported applications, not all programs can be checkpointed.
The recent 1.2.4
release improves support, though, to the point that
everything a wide range of users care about should be checkpointable. The
system has been integrated with Open
MPI and is able to respond to MPI-generated checkpoint and restore
requests. DMTCP is available with the openSUSE, Debian Testing, and Ubuntu
distributions. DMTCP may offer something good enough today for many users,
who may not need to wait for one of the other projects to be ready sometime
in the future.
Comments (14 posted)
By Jonathan Corbet
February 1, 2012
Developers tend to fear compiler bugs, and for good reason: such bugs can
be hard to find and hard to work around. They can leave traps in a
compiled program that spring on users at bad times. Things can get even
worse if one person's compiler bug is seen by the compiler's developer as a
feature - such issues have a tendency to never get fixed. It is possible
that just this kind of feature has turned up in GCC, with unknown impact on
the kernel.
One of the many structures used by the btrfs filesystem, defined in
fs/btrfs/ctree.h, is:
struct btrfs_block_rsv {
u64 size;
u64 reserved;
struct btrfs_space_info *space_info;
spinlock_t lock;
unsigned int full:1;
};
Jan Kara recently reported that, on the
ia64 architecture, the lock field was occasionally becoming
corrupted. Some investigation revealed that GCC was doing a surprising
thing when the bitfield full is changed: it generates a 64-bit
read-modify-write cycle that reads both lock and full,
modifies full, then writes both fields back to memory. If
lock had been modified by another processor during this operation,
that modification will be lost when lock is written back. The
chances of good things resulting from this sequence of events are quite small.
One can imagine that quite a bit of work was required to track down this
particular surprise. It is also not hard to imagine the dismay that
results from a conversation like this:
I've raised the issue with our GCC guys and they said to me that:
"C does not provide such guarantee, nor can you reliably lock
different structure fields with different locks if they share
naturally aligned word-size memory regions. The C++11 memory model
would guarantee this, but that's not implemented nor do you build
the kernel with a C++11 compiler."
Unsurprisingly, Linus was less than
impressed by this response. Language standards are not written for the
unique needs of kernels, he said, and can never "guarantee" the behavior
that a kernel needs:
So C/gcc has never "promised" anything in that sense, and we've
always had to make assumptions about what is reasonable code
generation. Most of the time, our assumptions are correct, simply
because it would be *stupid* for a C compiler to do anything but
what we assume it does.
But sometimes compilers do stupid things. Using 8-byte accesses to
a 4-byte entity is *stupid*, when it's not even faster, and when
the base type has been specified to be 4 bytes!
As it happens, the problem is a bit worse than non-specified behavior.
Linus suggested running a test with a
structure like:
struct example {
volatile int a;
int b:1;
};
In this case, if an assignment to b causes a write to a,
the behavior is clearly buggy: the volatile keyword makes it
explicit that a may be accessed from elsewhere. Jiri Kosina gave it a try and reported that GCC is still
generating 64-bit operations in this case. So, while the original problem
is technically compliant behavior, it almost certainly results from the same
decision-making that makes the second example go wrong.
Knowing that may give the kernel community more ammunition to flame the GCC
developers with, but it is not necessarily all that helpful. Regardless of
the source of the problem, this behavior exists in versions of the compiler
that, almost certainly, are being used outside of the development community
to build the kernel. So some sort
of workaround is likely to be necessary even if GCC's behavior is fixed.
That could be a bit of a challenge; auditing the entire kernel for 32-bit-wide
bitfield variables in structures that may be accessed concurrently will not be a
small job. But, then, nobody said that kernel development was easy.
Comments (85 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>