|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 3.17-rc7, released on September 28 instead of the final 3.17 release that Linus had wanted to do. "It's not that anything particularly scary happened, but quite frankly, things just didn't calm down as I hoped for. And while my travel schedule would have made it really nice had I been able to just do a shorter-than-usual release, 'convenience' isn't really part of the release criteria. Oh well."

Stable updates: there have been no stable updates in the last week, and none are in the review process as of this writing.

Comments (3 posted)

Quotes of the week

Basically, [checkpatch is] OK as a tool that draws ones attention to a subset of code that might or might not be worth looking at; it gives both false positives and false negatives and as long as there's not too much of those, it's useful. Provided that it does *not* replace one's taste. I.e. "hey, it points to these lines; let's take a look, some of those might be worth some attention", rather than "The Most Holy Oracle Has Spoken; Do What It Says".
Al Viro

There's literally 1 million checkpatch 'failures' in the kernel source code, do we really want to clean them all up, waste reviewer and maintainer bandwidth and pretend that those 1 million extra cleanup commits are just as valuable as the 'other' work that goes on in the kernel?
Ingo Molnar

Some time wasted on that, but I learnt a valuable debugging technique:

    #undef EINVAL
    #define EINVAL __LINE__
Hugh Dickins

Comments (7 posted)

Kernel development news

Foo over UDP

By Jonathan Corbet
October 1, 2014
Tunneling protocols are increasingly important in modern networking setups. By tying distant networks together, they enable the creation of virtual private networks, access to otherwise-firewalled ports, and more. Tunneling can happen at multiple levels in the networking stack; SSH tunnels are implemented over TCP, while protocols like GRE and IPIP work directly at the IP level. Increasingly, though, there is interest in implementing tunneling inside the UDP protocol. The "foo over UDP" (FOU) patch set from Tom Herbert, which has been pulled into the net-next tree for 3.18, implements UDP-level tunneling in a generic manner.

Why UDP? Just about any network interface out there has hardware support for UDP at this point, handling details like checksumming. UDP adds just enough information (port numbers, in particular) to make the routing of encapsulated packets easy. UDP can also be made to work with protocols like Receive Side Scaling (RSS) and the Equal-cost multipath routing protocol (ECMP) to improve performance in highly connected settings. The advantages of UDP tunneling are enough that some developers think it's going to become nearly ubiquitous in the coming years.

Packet encapsulation and tunneling over UDP is a relatively straightforward concept to understand. Suppose a simple TCP packet is presented to the tunneling interface:

[TCP packet]

This packet has the usual IP and TCP headers, followed by the data the user wishes to send. The encapsulation process does something like this:

[Encapsulated TCP packet]

At this point, the packet looks like a UDP packet that happens to have a TCP packet buried within it. The system can now transmit it to the destination as an ordinary UDP packet; at the receiving end, the extra headers will be stripped off and the original packet will be fed into the network stack.

Configuring a FOU tunnel will typically be a two-step process. The transmit and receive sides have been separated, a feature which, among other things, allows asymmetric setups should anybody want them. On the receive side, configuration is really just a matter of setting up a UDP port to be the recipient of encapsulated packets. The new "fou" subcommand is intended for this purpose:

    ip fou add port 5555 ipproto 4

This command sets aside port 5555, saying that packets arriving there will have protocol 4, which is IP encapsulation. Packets received on that port will have the encapsulation removed; they will then be fed back into the network stack for delivery to the real destination.

Things are a little more complicated on the transmit side, since the destination address must be provided and transmission needs to work with existing encapsulation protocols. A typical command might look like:

    ip link add name tun1 type ipip \
       remote 192.168.1.1 local 192.168.1.2 ttl 225 \
       encap fou encap-sport auto encap-dport 5555

This command will set up a new virtual interface (tun1) configured for IPIP encapsulation. The source port for packets is left to the network stack to decide, but the destination port will be 5555. Of course, one has to get the encapsulation protocol to actually use this interface. At the moment, support for doing that has been added to the IPIP, SIT (an IPv4-to-IPv6 tunneling protocol) and GRE (used for virtual private networks) protocols.

Some numbers posted in the patch set show significant performance increases for the SIT and IPIP protocols; performance with GRE was roughly equivalent to the no-FOU case. So this feature has a clear potential to speed things up by taking advantage of existing optimizations around UDP transmission and receipt. The nice thing is that no special hardware support is required; current hardware can handle UDP just fine. So it is a simple solution that will work on existing systems — and it should be available in the 3.18 kernel release.

Comments (35 posted)

How implementation details become ABI: a case study

By Jonathan Corbet
October 1, 2014
One of the final changes that went into the mainline kernel repository before the 3.17-rc7 release was this fix from Mikhail Efremov. It affects some low-level code within the virtual filesystem layer that manages name changes in the dentry structure — the structure that handles the mapping between file names and in-kernel inode structures. How that change came to be necessary makes a good lesson in how unintended behaviors can become part of the kernel's ABI over time.

The problem

The addition of the renameat2() system call in the 3.15 development cycle brought with it a subtle, unintended change in behavior that is best illustrated by example. On a system running (a hopefully updated version of) Bash, one can type a sequence like the following:

    $ cd /tmp
    $ touch foo bar
    $ exec 42<bar
    $ ls -l /proc/self/fd/42
    lr-x------ 1 corbet lwn 64 Sep 29 13:01 /proc/self/fd/42 -> /tmp/bar

The exec command causes the shell to open bar as file descriptor 42. The output of the ls shows that, indeed, bar is open on that file descriptor. What should happen, though, after the following sequence of commands?

    $ mv foo bar
    $ ls -l /proc/self/fd/42

On a kernel prior to 3.15, the output will look like this:

    lr-x------ 1 corbet lwn 64 Sep 29 13:01 /proc/self/fd/42 -> /tmp/bar (deleted)

On later kernels, instead, the result is:

    lr-x------ 1 corbet lwn 64 Sep 29 15:00 /proc/self/fd/42 -> /tmp/foo (deleted)

When a file with open descriptors is deleted, the actual file remains in existence until all of those file descriptors are closed. On kernels prior to 3.15, the name associated with that deleted file will be the name it had when it was deleted. In newer kernels, instead, a file may, if it is deleted via a rename operation, end up appearing to have the name of the file that was renamed on top of it.

This change may appear to be nearly irrelevant; who is going to care about the apparent name of a deleted file that is no longer accessible via the filesystem? But it seems that there are scripts out there that do care. One case was outlined by Mikhail in his patch posting: the package update utility on ALT Linux will replace (via a rename) the executable for a running daemon process, then try to find existing processes that are running the older version of the executable. But renaming the new version of the executable on top of the previous one causes any process running the old version to appear to be running something else, so the upgrade process fails. Piotr Karbowski, who appears to have been the first to report the bug, stated that it made his system unusable. This behavior change did, in fact, cause systems to break.

The cause

Understanding this bug requires delving into struct dentry and a somewhat obscure function called switch_names() that handles rename operations. The dentry structure, since it is charged with name mapping, must contain the file name of interest. But that name can be stored in two different ways. If the length of the name is less than DNAME_INLINE_LEN (a value between 32 and 40, chosen for optimal structure alignment), that name will be stored within the dentry structure itself. Otherwise, the d_name field will contain a pointer to an externally-allocated string.

The switch_names() function is defined like this:

    void switch_names(struct dentry *dentry, struct dentry *target);

Its job is to cause dentry to have the name currently associated with target. When moving names around, switch_names() must clearly pay attention to whether internal or external names are being used. Since there are two dentry structures to work with, there are four possible combinations. If both names are allocated externally, life is easy:

    swap(target->d_name.name, dentry->d_name.name);

One might wonder why the two names are exchanged in this way, since the stated purpose is only to affect dentry. The swap is done because target is about to disappear anyway, so its "name" is not really seen to matter anymore. Swapping allows this code to (1) avoid memory allocations, which, given how deep it is running within the VFS layer, is useful, and (2) not worry about freeing the old name associated with dentry, since that will now happen when target is freed anyway.

If, instead, both names are internal, memory allocations are not a concern. Prior to 3.15, the code for that case looked like this:

    memcpy(dentry->d_iname, target->d_name.name, target->d_name.len + 1);
    dentry->d_name.len = target->d_name.len;

This operation would leave both dentry structures appearing to have the same name — different behavior than that seen with the external-names case. Again, since target is expected to be destroyed soon, that difference should not really matter. It did start to matter, though, when the cross-rename feature (allowing the names of two files to be atomically swapped) was added in 3.15. In that case, the two names should be switched, as was done in the external case. So, in current kernels, the code looks like this:

    unsigned int i;

    for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) {
	swap(((long *) &dentry->d_iname)[i], ((long *) &target->d_iname)[i]);
    }

This funny-looking loop allows the swapping of the two names without the need for temporary variables or extra copying. (For completeness, the mixed internal/external cases swap the names).

As far as everybody could tell, the above code was correct; it made the internal-name case behave like the external-name case did. But if (1) one file is renamed on top of another, and (2) both files have short names, the user-visible behavior of the system changes, and that change caused programs to break. Breaking things in that way goes against one of the fundamental rules of kernel development, so some sort of fix needs to be put into place.

The fix

Mikhail's original patch added a flag ("exchange") to switch_names(). If that flag is set (as would be the case for an atomic file swap operation), the 3.15 behavior holds; otherwise, the code would revert back to the previous behavior for the both-internal case (names would still be swapped in the other cases). This patch was initially rejected; Linus called it "too ugly to live":

Yes, we had that hack before, but we didn't make it conditional. It historically was more of a "it's easier to just memcpy the name" than switch things around. Then that became accidental semantics, and that's all normal. But then when we make this explicit and intentional, I really think we should do it *right*, and either switch() the names around or just copy it.

Having a "switch_names()" function that *neither* switches *nor* copies, and giving it an argument to decide which, but not even do it *right*? That's just too f*cking disgusting for words.

A proper solution would, thus, cause the "just copy the name to the new dentry" behavior to happen on rename operations in all cases where an explicit swap has not been requested, even those which were not handled that way in the past. Implementing that behavior runs into a problem, though: in the case where both names are external, it may be necessary to allocate memory for a copy of the target file name. Such an allocation would have to be handled in atomic context and would slow down code that needs to run quickly. So a simple solution is not readily available.

Thus, the developers decided to merge a version of Mikhail's patch for 3.17, even if they don't like it. The patch has changed a bit since Al Viro took the opportunity to clean up the surrounding code a bit. But that code will probably not last beyond the 3.17 release.

What is likely to happen, instead, is a variant of this patch from Al. It adds a reference count to external names, allowing those names to be "copied" by just incrementing the count. Actually freeing the name must be done conditionally based on the results of a decrement-and-test operation. There are some additional complications; the name may be accessed under read-copy-update (RCU) rules, for example, so the actual freeing must happen in an RCU callback. But the idea is simple enough and, since few places actually manipulate the names in dentry structures, the implementation is relatively small.

Still, that is a larger change than anybody would like to see go into 3.17 at this point in the development cycle. So reference-counted external names in dentry structures will have to wait until 3.18. Meanwhile, Mikhail's fix has gone in for 3.17 and been marked for the stable updates, so the old behavior will return in the near future. This behavior was accidental and never documented, and the kernel developers seemingly believe that any code relying on it was poorly written to begin with. But, all of that notwithstanding, that behavior has become a part of the kernel ABI, so those developers will preserve it even if they don't like it.

Comments (26 posted)

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Networking

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds