Brief items
The current development kernel is 3.4-rc5,
released on April 29. "
And like
-rc4, quite a bit of the changes came in on Friday (with some more coming
in yesterday). And we haven't been calming down, quite the reverse. -rc5
has almost 50% more commits than -rc4 had. Not good." That said,
what's going in is mostly fixes; see the announcement for the short-form
changelog.
Stable updates: the 3.0.30 and 3.3.4 updates were released on April 27
with the usual set of important fixes.
Comments (none posted)
I am not making up the fact that I had a nightmare last night in
which r12 (ARM IP register) was trying to kill me. I might have
spent too long staring at kernel disassembly over the weekend.
--
Jon
Masters
Indeed, my goal was "less bonkers" rather than "not bonkers". A
"not bonkers" description remains a long-term aspiration rather
than a short-term goal for the moment.
--
Paul "moderately bonkers" McKenney
What is it with all these Linuses these days? There's a Linus at
google too. Some day I will get myself my own broadsword, and run
around screaming "There can be only one".
I used to be _special_ dammit. Snif.
--
Linus
Torvalds (Thanks to Nicolas Pitre)
Comments (1 posted)
For those who would like more information on how to use the Linux perf
subsystem, there is
an extensive
tutorial posted by Google, written by Stephane Eranian. It probably
merits a bookmark for anybody wanting to learn how to do interesting things
with perf.
Comments (2 posted)
Kernel development news
By Jonathan Corbet
May 2, 2012
Memory management is a notoriously tricky task, though the underlying
objective is quite clear: look into the future and ensure that the pages
that will be needed by applications are in memory. Unfortunately, existing
crystal ball peripherals tend not to work very well; they also usually
require proprietary drivers. So the kernel is stuck with a set of
heuristics that try to guess future needs based on recent behavior.
Adjusting those heuristics is always a bit of a challenge; it is easy to
put in changes that will break obscure workloads years in the future. But
that doesn't stop developers from trying.
A core part of the kernel's memory management subsystem is a pair of lists
called the "active" and "inactive" lists. The active list contains
anonymous and file-backed pages that are thought (by the kernel) to be in
active use by
some process on the system. The inactive list, instead, contains pages
that the kernel thinks might not be in use. When active pages are
considered for eviction,
they are first moved to the inactive list and unmapped from the address
space of the process(es) using them. Thus, once a page moves to the inactive
list, any attempt to reference it will generate a page fault; this "soft
fault" will cause the page to be moved back to the active list. Pages that
sit in the inactive list for long enough are eventually removed from the
list and evicted from memory entirely.
One could think of the inactive list as a sort of probational status for
pages that kernel isn't sure are worth keeping. Pages can get there from
the active list as described above, but there's another way to inactive
status as well: file-backed pages, when they are faulted in, are placed in the
inactive list. It is quite common that a process will only access a
file's contents once; requiring a second access before moving file-backed
pages to the active list lets the kernel get rid of single-use data
relatively quickly.
Splitting memory into two pools in this manner leads to an immediate policy
decision: how big should each list be? A very large inactive list gives
pages a long time to be referenced before being evicted; that can reduce
the number of pages kicked out of memory only to be read back in shortly
thereafter. But a large inactive list comes at the cost of a smaller active
list; that can slow down the system as a whole by causing lots of soft page
faults for data that's already in memory. So, as is the case with many
memory management decisions, regulating the relative sizes of the two lists
is a balancing act.
The way that balancing is done in current kernels is relatively
straightforward: the
active list is not allowed to grow larger than the inactive list. Johannes
Weiner has concluded that this heuristic is too simple and insufficiently
adaptive, so he has come up with a proposal for
a replacement. In short, Johannes wants to make the system more
flexible by tracking how long evicted pages stay out of memory before
being faulted back in.
Doing so requires some significant changes to the kernel's page-tracking
infrastructure. Currently, when a page is removed from the inactive list
and evicted from memory, the kernel
simply forgets about it; that clearly will not do if the kernel is to try
to track how long the page remains out of memory. The
page cache is tracked via a radix tree; the
kernel's radix tree implementation already has a concept of "exceptional
entries" that is used to track tmpfs pages while they are swapped out.
Johannes's patch extends this mechanism to store "shadow" entries for evicted
pages, providing the needed long-term record-keeping for those pages.
What goes into those shadow entries is a representation of the time the page was
swapped out. That time can be thought of as a counter of removals from the
inactive list; it is represented as an atomic_t variable called
workingset_time. Every time a page is removed from the inactive
list, either to evict it or to activate it, workingset_time is
incremented by one. When a page is evicted, the current value of
workingset_time is stored in its associated shadow entry. This
time, thus, can be thought of as a sort of sequence counter for memory
management events.
If and when that page is faulted back in, the difference between the
current workingset_time and the value in the shadow entry gives a
count of how many pages were removed from the inactive list while that page
was out of memory. In the language of Johannes's patch, this difference is
called the "refault distance." The observation at the core of this patch
set is that, if a page returns to memory with a refault distance of
R, its eviction and refaulting would have been avoided had the
inactive list been R pages longer. R is thus a sort of
metric describing how much longer the inactive list should be made to avoid
a particular page fault.
Given that number, one has to decide how it should be used. The algorithm
used in Johannes's patch is simple: if R is less than the length of
the active list, one page will be moved from the active to the inactive
list. That shortens the active list by one entry and places the
formerly-active page on the inactive list immediately next to the page that
was just refaulted in (which, as described above, goes onto the inactive
list until a second access occurs). If the formerly-active page is still
needed, it
will be reactivated in short order. If, instead, the working set is
shifting toward a new set of pages, the refaulted page may be activated
instead, taking the other page's place. Either way, it is hoped, the
kernel will do a better job of keeping the right pages active. Meanwhile,
the inactive list gets slightly longer in the hope of avoiding refaults in
the near future.
How well all of this works is not yet clear: Johannes has not posted any
benchmark results for any sort of workload. This is early-stage work at
this point, a long way from acceptance into a mainline kernel release. So
it could evolve significantly or fade away entirely. But more
sophisticated balancing between the active and inactive lists seems like an
idea whose time may be coming.
Comments (4 posted)
By Jonathan Corbet
May 1, 2012
Migrating a running container from one physical host to another is a tricky
job on a number of levels. Things get even harder if, as is likely, the
container has active network connections to processes outside of that
container. It is natural to want those connections to follow the container
to its new host, preferably without the remote end even noticing that
something has changed, but the Linux networking stack was not written with
this kind of move in mind. Even so, it appears that transparent relocation
of network connections, in the form of Pavel Emelyanov's
TCP connection repair patches, will be
supported in the 3.5 kernel.
The first step in moving a TCP connection is to gather all of the
information possible about its current state. Much of that information is
available from user space now; by digging around in /proc and
/sys, one can determine the address and port of the remote end,
the sizes of the send and receive queues, TCP sequence numbers, and a
number of parameters
negotiated between the two end points. There are still a few things that
user space will need to obtain, though, before it can finish the job; that
requires some additional support from the kernel.
With Pavel's patch, that support is available to suitably privileged
processes.
To dig into the internals of an active network connection, user space must
put the associated socket into a new "repair mode." That is done with the
setsockopt() system call, using the new TCP_REPAIR
option. Changing a process's repair mode status requires the
CAP_NET_ADMIN capability; the socket must also either be closed or
in the "established" state. Once the socket is in repair mode, it can be
manipulated in a number of ways.
One of those is to read the contents of the send and receive queues. The
send queue contains data that has not yet been successfully transmitted to
the remote end; that data needs to move with the connection so it can be
transmitted from the new location. The receive queue, instead, contains
data received from the remote end that has not yet been consumed by the
application being moved; that data, too, should move so it will be waiting
on the new host when the application gets around to reading it. Obtaining
the contents of these queues is done with a two-step sequence:
(1) call setsockopt(TCP_REPAIR_QUEUE) with either
TCP_RECV_QUEUE or TCP_SEND_QUEUE, then (2) call
recvmesg() to read the contents of the selected queue.
It turns out there is only one other important piece of information that
cannot already be obtained from user space: the maximum value of the MSS
(maximum segment size) negotiated between the two endpoints at connection
setup time. To make this value available, Pavel's patch changes the
semantics of the TCP_MAXSEG socket option (for
getsockopt()) when the connection is
in repair mode: it returns the maximal "clamp" MSS value rather than the
currently active value.
Finally, if a connection is closed while it is in the repair mode, it is
simply deleted with no notification to the remote end. No FIN or RST
packets will be sent, so the remote side will have no idea that things have
changed.
Then there is the matter of establishing the connection on the new host.
That is done by creating a new socket and putting it immediately into the
repair mode. The socket can then be bound to the proper port number; a
number of the usual checks for port numbers are suspended when the socket
is in repair mode. The TCP_REPAIR_QUEUE setsockopt()
call comes into play again, but this time sendmsg() is used to
restore the contents of the send and receive queues.
Another important task is to restore the send and receive sequence numbers.
These numbers are normally generated randomly when the connection is
established, but that cannot be done when a connection is being moved.
These numbers can be set with yet another call to setsockopt(),
this time with the TCP_QUEUE_SEQ option. This operation applies
to whichever queue was previously selected with TCP_REPAIR_QUEUE,
so the refilling of a queue's content and the setting of its sequence
number are best done at the same time.
A few negotiated parameters also need to be restored so that the two ends
will remain in agreement with each other; these include the MSS clamp
described above, along with the active maximum segment size, the window
size, and whether the selective acknowledgment and timestamp features can
be used. One last setsockopt() option,
TCP_REPAIR_OPTIONS, has been added to make it possible to set
these parameters from user space.
Once the socket has been restored to a state approximating that which
existed on the old host, it's time to put it into operation. When
connect() is called on a socket in repair mode, much of the
current setup and negotiation code is shorted out; instead, the connection
goes directly to the "established" state without any communication from the
remote end. As a final step, when the socket is taken out of the repair
mode, a window probe is sent to restart traffic
between the two ends; at that point, the socket can resume normal operation
on the new host.
These patches have been through a few revisions over a number of months;
with version 4, networking maintainer David Miller accepted them into net-next. From there,
those changes will almost certainly hit the mainline during the 3.5 merge
window. The TCP connection repair patches do not represent a complete
solution to the problem of checkpointing and restoring containers, but they
are an important step in that direction.
Comments (1 posted)
By Jonathan Corbet
April 30, 2012
One of the few hard rules of kernel development is that breaking the
user-space binary interface is not acceptable. If there is user-space code
that depends on specific behavior, that behavior must be maintained
regardless of how inconvenient that may be. But what is to be done if two
different programs depend on mutually-incompatible behaviors, so that it is
seemingly impossible to keep them both working? The answer may be to
violate another rule by putting an ugly hack into the kernel—or to do
something rather more tricky.
The "autofs" protocol is used to communicate between the kernel and an
automounter daemon. It allows the automounter to set up special virtual
filesystems that, when referenced by user space, can be replaced by a
remote-mounted real filesystem. Much of this protocol is implemented with
ioctl() calls on a special autofs device, but it also makes use of
pipes between the kernel and user space when specific filesystems are
mounted.
This protocol is certainly part of the kernel ABI, so its components have
been defined with some care. One of the key elements of the autofs
protocol is the autofs_v5_packet structure, which is sent from the
kernel to user space via a pipe; it is used, among other things, to report
that a filesystem has been idle for some time and should be unmounted.
This structure looks like:
struct autofs_v5_packet {
struct autofs_packet_hdr hdr;
autofs_wqt_t wait_queue_token;
__u32 dev;
__u64 ino;
__u32 uid;
__u32 gid;
__u32 pid;
__u32 tgid;
__u32 len;
char name[NAME_MAX+1];
};
The size of every field is precisely defined, so this structure should look
the same on both 32- and 64-bit systems. And it does, except for one tiny
little problem. The size of the structure as defined is 300 bytes, which
is not divisible by eight. So if two of these structures were to be placed
contiguously in memory, the 64-bit ino field would have to be
misaligned in one of them. To avoid this problem, the compiler will, on
64-bit systems, round the size of the structure up to a multiple of eight,
adding four bytes of padding at the end.
So sizeof() on struct autofs_v5_packet will return 300 on
a 32-bit system, and 304 on a 64-bit system.
That disparity is not a problem most of the time, but there is an
exception. Automounting is one of the many tasks being assimilated by the
systemd daemon. When systemd reads one of the above structures from the
kernel, it checks the size of what it read against its idea of the size of
the structure to ensure that everything is
operating as it should be. That check works just fine, as long as systemd
and the kernel agree on that size. And normally they do,
but there is an exception: if systemd is running as a 32-bit process on a
64-bit kernel, it will get a 304-byte structure when it is expecting 300
bytes. At that point, systemd concludes that something has gone wrong and
gives up.
In February, Ian Kent merged a
patch to deal with this problem. One could be forgiven for calling the
solution hacky: on 64-bit systems, the kernel's automount code will
subtract four from the size of that structure if (and only if) it is
talking with a user-space client running in 32-bit mode. This patch makes
systemd work in this situation; it was merged for 3.3-rc5 and fast-tracked
into the various stable kernel releases. Everybody then lived happily ever
after.
...except they didn't. It seems that the automount program from
the autofs-tools package, which is still in use on a great many systems,
had run into this problem a number of years ago. At that time, the
autofs-tools developers decided to work around the problem in user space.
So, if automount determines that it is running in 32-bit mode on a 64-bit
kernel (Linus has little respect for how
that determination is done, incidentally), it will correct its idea of what
the structure size should be. If the kernel messes with that size, the
automount "fix" no longer works, so Ian's patch fixes systemd at the cost of
breaking automount.
So we are now in a situation where two deployed programs have different
ideas of how the autofs protocol should work. On pure 32- or 64-bit
systems, both programs work just fine, but, depending on which kernel is
being run, one or the other of the two will break in the 32-on-64
configuration. If Ian's patch remains, some users will be most unhappy,
but reverting it will upset other users. It is, in other words, a somewhat
unfortunate situation.
Unfortunate, but not necessarily unrecoverable. One possible way to fix
things can be seen in this patch from
Michael Tokarev. In short, this patch looks at the name of the current
command (current->comm) and compares it against
"automount". If the currently-running program is called
"automount," the structure-size tweak is not applied and things work
again. For any other program (including systemd), the previous fix
remains. So things are fixed at the expense of having the kernel ABI
depend on the name of the running program. At best, this solution can be
described as "inelegant." At worst, there may be some other, unknown
program with a different name that breaks in the same way automount does;
any such program will remain broken with this fix in place.
Still, Linus has conceded that "it's
probably what we have to go with." But he preferred to look for
a less kludgy and more robust solution. One possibility was for the kernel
to look at the
size of the read() operation that would obtain the
autofs_v5_packet
structure prior to writing that structure; if that size is either 300 or
304, the kernel could give the
calling program the size it is expecting. The problem here is that
the read() operation is hidden behind the pipe, so the autofs
code does not actually have access to the size of the buffer provided by
user space.
So Linus came up with a different solution, the concept of "packetized pipes". A packetized pipe
resembles the normal kind with a couple of exceptions: each
write() is kept in a separate buffer, and a read()
consumes an entire buffer, even if the size of the read is smaller than the
amount of data in the buffer. With a packetized pipe, the kernel can
always just write the larger (304-byte) structure size; if user space is
only trying to read 300 bytes, then it will get what it expects and be
happy. So there is no need for special hacks in the kernel, just a
slightly different type of pipe dynamics. Following a suggestion from Alan
Cox, Linus made an open with O_DIRECT turn on the packetized
behavior, so user space can create such pipes if need be.
After a couple of false starts, Linus got this patch working and merged it
just prior to the 3.4-rc5 release. So the 3.4 kernel should work fine for
either automount or systemd.
The kernel community got a bit lucky here; it was possible for a suitably
clever and motivated developer to figure out a way to give both programs
what they expect and make the system work for everybody. The next time
this kind of problem arises, the solution may not be so simple.
Maintaining ABI stability is not always easy or fun, but it is necessary to
keep the system viable in the long term.
Comments (57 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>