Fixing the unfixable autofs ABI
The "autofs" protocol is used to communicate between the kernel and an automounter daemon. It allows the automounter to set up special virtual filesystems that, when referenced by user space, can be replaced by a remote-mounted real filesystem. Much of this protocol is implemented with ioctl() calls on a special autofs device, but it also makes use of pipes between the kernel and user space when specific filesystems are mounted.
This protocol is certainly part of the kernel ABI, so its components have been defined with some care. One of the key elements of the autofs protocol is the autofs_v5_packet structure, which is sent from the kernel to user space via a pipe; it is used, among other things, to report that a filesystem has been idle for some time and should be unmounted. This structure looks like:
struct autofs_v5_packet { struct autofs_packet_hdr hdr; autofs_wqt_t wait_queue_token; __u32 dev; __u64 ino; __u32 uid; __u32 gid; __u32 pid; __u32 tgid; __u32 len; char name[NAME_MAX+1]; };
The size of every field is precisely defined, so this structure should look the same on both 32- and 64-bit systems. And it does, except for one tiny little problem. The size of the structure as defined is 300 bytes, which is not divisible by eight. So if two of these structures were to be placed contiguously in memory, the 64-bit ino field would have to be misaligned in one of them. To avoid this problem, the compiler will, on 64-bit systems, round the size of the structure up to a multiple of eight, adding four bytes of padding at the end. So sizeof() on struct autofs_v5_packet will return 300 on a 32-bit system, and 304 on a 64-bit system.
That disparity is not a problem most of the time, but there is an exception. Automounting is one of the many tasks being assimilated by the systemd daemon. When systemd reads one of the above structures from the kernel, it checks the size of what it read against its idea of the size of the structure to ensure that everything is operating as it should be. That check works just fine, as long as systemd and the kernel agree on that size. And normally they do, but there is an exception: if systemd is running as a 32-bit process on a 64-bit kernel, it will get a 304-byte structure when it is expecting 300 bytes. At that point, systemd concludes that something has gone wrong and gives up.
In February, Ian Kent merged a patch to deal with this problem. One could be forgiven for calling the solution hacky: on 64-bit systems, the kernel's automount code will subtract four from the size of that structure if (and only if) it is talking with a user-space client running in 32-bit mode. This patch makes systemd work in this situation; it was merged for 3.3-rc5 and fast-tracked into the various stable kernel releases. Everybody then lived happily ever after.
...except they didn't. It seems that the automount program from the autofs-tools package, which is still in use on a great many systems, had run into this problem a number of years ago. At that time, the autofs-tools developers decided to work around the problem in user space. So, if automount determines that it is running in 32-bit mode on a 64-bit kernel (Linus has little respect for how that determination is done, incidentally), it will correct its idea of what the structure size should be. If the kernel messes with that size, the automount "fix" no longer works, so Ian's patch fixes systemd at the cost of breaking automount.
So we are now in a situation where two deployed programs have different ideas of how the autofs protocol should work. On pure 32- or 64-bit systems, both programs work just fine, but, depending on which kernel is being run, one or the other of the two will break in the 32-on-64 configuration. If Ian's patch remains, some users will be most unhappy, but reverting it will upset other users. It is, in other words, a somewhat unfortunate situation.
Unfortunate, but not necessarily unrecoverable. One possible way to fix things can be seen in this patch from Michael Tokarev. In short, this patch looks at the name of the current command (current->comm) and compares it against "automount". If the currently-running program is called "automount," the structure-size tweak is not applied and things work again. For any other program (including systemd), the previous fix remains. So things are fixed at the expense of having the kernel ABI depend on the name of the running program. At best, this solution can be described as "inelegant." At worst, there may be some other, unknown program with a different name that breaks in the same way automount does; any such program will remain broken with this fix in place.
Still, Linus has conceded that "it's
probably what we have to go with
". But he preferred to look for
a less kludgy and more robust solution. One possibility was for the kernel
to look at the
size of the read() operation that would obtain the
autofs_v5_packet
structure prior to writing that structure; if that size is either 300 or
304, the kernel could give the
calling program the size it is expecting. The problem here is that
the read() operation is hidden behind the pipe, so the autofs
code does not actually have access to the size of the buffer provided by
user space.
So Linus came up with a different solution, the concept of "packetized pipes". A packetized pipe resembles the normal kind with a couple of exceptions: each write() is kept in a separate buffer, and a read() consumes an entire buffer, even if the size of the read is smaller than the amount of data in the buffer. With a packetized pipe, the kernel can always just write the larger (304-byte) structure size; if user space is only trying to read 300 bytes, then it will get what it expects and be happy. So there is no need for special hacks in the kernel, just a slightly different type of pipe dynamics. Following a suggestion from Alan Cox, Linus made an open with O_DIRECT turn on the packetized behavior, so user space can create such pipes if need be.
After a couple of false starts, Linus got this patch working and merged it just prior to the 3.4-rc5 release. So the 3.4 kernel should work fine for either automount or systemd.
The kernel community got a bit lucky here; it was possible for a suitably
clever and motivated developer to figure out a way to give both programs
what they expect and make the system work for everybody. The next time
this kind of problem arises, the solution may not be so simple.
Maintaining ABI stability is not always easy or fun, but it is necessary to
keep the system viable in the long term.
Index entries for this article | |
---|---|
Kernel | Development model/User-space ABI |
Kernel | Filesystems/autofs |
Posted Apr 30, 2012 17:39 UTC (Mon)
by kilpatds (subscriber, #29339)
[Link] (9 responses)
Posted Apr 30, 2012 17:44 UTC (Mon)
by corbet (editor, #1)
[Link]
There has been some talk in the discussion of a version 6 of the autofs protocol that would address a number of shortcomings, but there's no particular hurry to get that into place.
Posted May 3, 2012 14:18 UTC (Thu)
by stevem (subscriber, #1512)
[Link] (7 responses)
Posted May 3, 2012 15:12 UTC (Thu)
by jzbiciak (guest, #5246)
[Link] (6 responses)
So, what do you define as natural alignment?
Posted May 3, 2012 15:28 UTC (Thu)
by stevem (subscriber, #1512)
[Link] (4 responses)
64-bit types
and it's worked for me. It *might* not be the best layout for caching performance, but for API structures like ioctls I think it's a clean way to go.
Alternatively, add explicit padding in the structures to make them alignment-neutral. Much less clean, but better than having differently-sized structs depending on architecture.
Posted May 3, 2012 15:41 UTC (Thu)
by jzbiciak (guest, #5246)
[Link] (3 responses)
That doesn't fix this structure, as I understand it. The total size of the elements is 300 bytes, but the whole struct must be aligned to 8 bytes on 64 byte architectures, but only needs alignment 4 on the i386 ABI. Consider this much simpler structure: On the i386 ABI, sizeof(struct broken) is 12. On the x86-64 ABI, it's 16.
Posted May 3, 2012 15:50 UTC (Thu)
by stevem (subscriber, #1512)
[Link]
Posted May 3, 2012 18:46 UTC (Thu)
by nybble41 (subscriber, #55106)
[Link] (1 responses)
Yes, but if you add padding fields you can ensure the same alignment on all reasonable architectures: This way sizeof(struct fixed) is 16 bytes on i386 and x86-64. Of course, that doesn't help with the structure from this article, because the smaller i386 alignment was written into the ABI. IMHO, any structures exposed through the ABI should be naturally aligned and marked as "packed", with explicit padding fields where necessary; alignment should not be left to the compiler where ABI compatibility is a concern.
Posted May 3, 2012 19:10 UTC (Thu)
by jzbiciak (guest, #5246)
[Link]
That would generally be the best idea, but as you stated, it doesn't fix this structure, because the i386 ABI doesn't require natural alignment for 64-bit types, as per the definition you gave in your other comment. The reason I asked is that i386's ABI is perverse in requiring natural alignment for some, but not all data types. So, I wasn't sure if the concept of "natural" had different meaning in the context of i386's ABI. I tend to agree, and do this myself when I care about the actual layout of a struct. Ok, I usually forego 'packed' since it isn't portable. I instead make the struct a multiple of the largest data type, and force fields onto natural boundaries for their types. This practice is especially necessary when you have any "wire formats" or "on disk format." You can almost think of packets crossing the kernel/user-space boundary as having "wire format" semantics in this case, with the minor difference that pointers are more meaningful in user-space exchanges than they usually are over, say, a TCP link.
Posted May 3, 2012 18:30 UTC (Thu)
by nybble41 (subscriber, #55106)
[Link]
"Natural alignment" is a term of art meaning that the address (or in this case, offset) of the data is a multiple of its size. It can be used independently of the target, and for any size, though it's usually applied to powers of two. For example, on most platforms with multiple page sizes, pages must be naturally aligned; you can't construct a 1MB hugepage from an arbitrary span of 256 4kB pages, but instead have to use a region aligned to a multiple of 1MB.
In this case it would mean aligning 64-bit fields to an offset which is a multiple of eight bytes, even on i386 where only 32-bit alignment is required.
Posted Apr 30, 2012 17:46 UTC (Mon)
by pr1268 (guest, #24648)
[Link] (10 responses)
I think autofs and systemd should get their s**t together. While I think it's proper and appropriate for the kernel to avoid breaking userspace, I firmly believe it's also not the kernel's job to fix broken userspace! </rant>
Posted Apr 30, 2012 18:19 UTC (Mon)
by drag (guest, #31333)
[Link] (5 responses)
Plus I don't understand why it's so terrible that systemd checks to make sure that the data it's getting from the kernel is the correct size.
Posted Apr 30, 2012 21:16 UTC (Mon)
by man_ls (guest, #15091)
[Link] (3 responses)
Posted Apr 30, 2012 21:19 UTC (Mon)
by felixfix (subscriber, #242)
[Link]
Posted May 1, 2012 8:03 UTC (Tue)
by mjt (guest, #1515)
[Link] (1 responses)
Now, reading just 300 bytes (size of that packet on 32bit userspace) instead of 304 bytes which a 64bit kernel sent to userspace does the trick, but the pipe buffer now holds 4 bytes, which will be read by userspace as a NEW packet. Even if new packet actually arrives, for the userspace the result will be complete garbage, since the packet boundary got lost.
At any rate, no userspace trick will fix already released versions of userspace applications, which is the whole point. Yes it could be made differently, but what's done is done and we don't want to break it.
Posted May 1, 2012 18:34 UTC (Tue)
by dlang (guest, #313)
[Link]
Posted May 4, 2012 17:44 UTC (Fri)
by giraffedata (guest, #1954)
[Link]
Here's a closer analogy: John has 12 steps in his house. He likes to walk down them with his eyes closed and count the steps. Mary invites John over to her house and he resists, saying "I know where everything is in my house and I'm afraid I would get lost in your house." Mary assures him that her house is just like his in every relevant detail, and on that basis John accepts. But Mary has 13 steps and John, walking down with his eyes closed and counting to 12, falls and breaks his arm. Should Mary pay for his medical bills?
Did Mary break her promise? Is having one more step in the staircase is a relevant detail? Is a person entitled to walk down stairs with his eyes closed?
We typically say you can move a program from a 32 bit computer to a 64 bit one without change because the environment is the same in every relevant detail. We also say you can move a program from kernel release N to N+1 for the same reason. Automount and Systemd users are relying on some interpretation of those promises.
Posted Apr 30, 2012 18:26 UTC (Mon)
by vrfy (guest, #13362)
[Link]
Adding runtime matches for architectures in userland tools to find out what to expect from a running kernel interface is just not acceptable these days, and people should stop doing that. We need to fix what's broken, and not paper over it.
It's surely not the kernel's job to fix userland tools, but it's even less the job of userspace to work around obviously non-working kernel interfaces, when they should be fixed instead.
Posted Apr 30, 2012 18:27 UTC (Mon)
by arjan (subscriber, #36785)
[Link]
Posted May 1, 2012 23:59 UTC (Tue)
by mezcalero (subscriber, #45103)
[Link] (1 responses)
Posted May 2, 2012 5:02 UTC (Wed)
by dlang (guest, #313)
[Link]
In opensource software there is almost never a reason to do this, and never without discussing the problem first.
That said, I have no problem with the sanity check that systemd is doing. They are not working around anything, they are just checking that what they are getting has a chance of being what they expect to get. They know the size of the object that they are getting, and so they only get that much data. fetching more would potentially cause other grief.
Posted Apr 30, 2012 18:48 UTC (Mon)
by arnd (subscriber, #8866)
[Link]
Posted Apr 30, 2012 20:26 UTC (Mon)
by bdale (subscriber, #6829)
[Link] (12 responses)
Posted Apr 30, 2012 20:47 UTC (Mon)
by ernstp (guest, #13694)
[Link]
Posted May 1, 2012 9:14 UTC (Tue)
by etrusco (guest, #4227)
[Link] (10 responses)
Posted May 1, 2012 10:05 UTC (Tue)
by cortana (subscriber, #24596)
[Link] (7 responses)
Posted May 1, 2012 10:29 UTC (Tue)
by Fowl (subscriber, #65667)
[Link]
Posted May 1, 2012 15:09 UTC (Tue)
by bdale (subscriber, #6829)
[Link] (2 responses)
Posted May 4, 2012 14:37 UTC (Fri)
by bronson (subscriber, #4806)
[Link] (1 responses)
Posted May 10, 2012 12:36 UTC (Thu)
by nye (subscriber, #51576)
[Link]
aptitude install linux-image-amd64
Actually, it might even be the default these days if you install the i686 version of Debian on a 64-bit machine.
Posted May 1, 2012 16:45 UTC (Tue)
by drag (guest, #31333)
[Link]
It's a very good option.
Posted May 11, 2012 13:21 UTC (Fri)
by WolfWings (subscriber, #56790)
[Link]
Especially since a LOT of the time you _need_ a 32-bit userland to deal with a lot of binary packages out there or to get stuff running under WINE properly, it's easier for the "end user" images to stick to 32-bit userland even on 64-bit platforms. And even if you have a 64-bit userland you'll find it littered with various 32-bit-compat libraries.
Servers? Those are a different story, but many of those folks don't even need 32-bit compatibility so they can end up with a cruft-free pure-64-bit userland without all the 'compat' crap bolted in. So it reduces the size of the true-64-bit install media too, since there's no 32-bit cruft now.
Posted May 16, 2012 23:04 UTC (Wed)
by steffen780 (guest, #68142)
[Link]
Posted May 1, 2012 17:14 UTC (Tue)
by jzbiciak (guest, #5246)
[Link] (1 responses)
In any case, that effectively enshrined the bug as a new facet of the Linux kernel ABI.
Posted May 1, 2012 17:21 UTC (Tue)
by raven667 (subscriber, #5198)
[Link]
Posted Apr 30, 2012 21:09 UTC (Mon)
by quotemstr (subscriber, #45331)
[Link] (4 responses)
1) O_DIRECT doesn't what this patch says it should mean. O_DIRECT means "no buffering", not "weird read(2) games". O_DIRECT on a pipe ought to mean something like "block write(2) until someone blocks in read(2), then memcpy directly to the reader's buffer", and with this patch, O_DIRECT can never mean that for pipes.
2) This "packetized pipe" system looks like an invitation for data loss. It's one thing to create a pipe mode that preserves message boundaries (like NT's PIPE_TYPE_MESSAGE named pipes), but the right behavior should be to cause read(2) to fail unless the reader gets the whole packet. (That's how NT message pipes work.) Because of the silent data loss issue, these packetized pipes aren't truly general-purpose.
In general, this approach seems like an overly general solution to a very particular problem.
Posted May 1, 2012 2:46 UTC (Tue)
by neilbrown (subscriber, #359)
[Link] (3 responses)
O_DIRECT typically has significant alignment restrictions, and this new semantic has some sort of alignment restrictions (if you squint a bit).
O_SYNC is also available for pipes and the man page say "Any writes on the resulting file descriptor will block the calling process until the data has been physically written to the underlying hardware", and if we understand "the memory in the reading process" to be "the underlying hardware", then your suggested feature (which I like) would fit to some degree with O_SYNC.
The data-loss issue sounds problematic, but we seem to have lived with it with SOCK_DGRAM sockets so I suspect we will survive with pipes. Hopefully the FIONREAD ioctl will provide the number of bytes that can be read, so a well written program can avoid losing data.
Posted May 1, 2012 7:26 UTC (Tue)
by butlerm (subscriber, #13312)
[Link] (2 responses)
Posted May 3, 2012 19:30 UTC (Thu)
by kleptog (subscriber, #1183)
[Link] (1 responses)
I suppose changing the type of pipe to userspace would be an even greater ABI change.
Posted May 10, 2012 5:53 UTC (Thu)
by kevinm (guest, #69913)
[Link]
I think it was mentioned somewhere in the thread that an SOCK_SEQPACKET socket would have been a better design, but hindsight is always an exact science.
Posted Apr 30, 2012 23:18 UTC (Mon)
by renox (guest, #23785)
[Link] (1 responses)
Posted May 1, 2012 15:40 UTC (Tue)
by juliank (guest, #45896)
[Link]
Posted Apr 30, 2012 23:59 UTC (Mon)
by ewan (guest, #5533)
[Link] (5 responses)
Posted May 1, 2012 0:36 UTC (Tue)
by hamjudo (guest, #363)
[Link] (4 responses)
The history of operating systems is filled with examples of bad design decisions that weren't caught in time, and thus got stuck in the interface definition.
Posted May 1, 2012 0:55 UTC (Tue)
by corbet (editor, #1)
[Link] (3 responses)
Posted May 1, 2012 15:08 UTC (Tue)
by ewan (guest, #5533)
[Link] (2 responses)
Posted May 1, 2012 15:38 UTC (Tue)
by hamjudo (guest, #363)
[Link]
Nothing, but common sense, prevents users from writing programs that use this feature.
Posted May 1, 2012 15:41 UTC (Tue)
by juliank (guest, #45896)
[Link]
Posted May 7, 2012 11:23 UTC (Mon)
by mjthayer (guest, #39183)
[Link] (4 responses)
Posted May 7, 2012 14:04 UTC (Mon)
by dgm (subscriber, #49227)
[Link] (1 responses)
Posted May 7, 2012 14:24 UTC (Mon)
by mjthayer (guest, #39183)
[Link]
Posted May 10, 2012 12:41 UTC (Thu)
by slashdot (guest, #22014)
[Link] (1 responses)
Posted May 11, 2012 16:49 UTC (Fri)
by amonnet (guest, #54852)
[Link]
Posted May 16, 2012 10:03 UTC (Wed)
by ldo (guest, #40946)
[Link] (3 responses)
This is why alignment on greater than one-byte boundaries should never be included in any interchange format or protocol, because invariably the architectural considerations that make them look like a good idea at one end in one situation will end up looking stupid at the other end or in another situation. And because it’s an interchange format, you cannot just simply change it without breaking compatibility.
Alignment only makes sense for internal program in-memory structures, which can be changed with a simple recompile and restart. Interchange formats should always be defined with no alignment padding. End of story.
Posted May 16, 2012 10:33 UTC (Wed)
by amonnet (guest, #54852)
[Link] (2 responses)
Posted May 16, 2012 10:50 UTC (Wed)
by amonnet (guest, #54852)
[Link]
Posted May 16, 2012 12:15 UTC (Wed)
by paulj (subscriber, #341)
[Link]
Fixing the unfixable autofs ABI
Because it's way too late to add "packed," that would have done nothing to fix the current users.
Packed
Fixing the unfixable autofs ABI
Fixing the unfixable autofs ABI
Fixing the unfixable autofs ABI
32-bit types
16
8
Fixing the unfixable autofs ABI
struct broken
{
__u64 f1;
__u32 f2;
};
Fixing the unfixable autofs ABI
Fixing the unfixable autofs ABI
struct fixed
{
__u64 f1;
__u32 f2;
__u32 pad1;
};
Fixing the unfixable autofs ABI
IMHO, any structures exposed through the ABI should be naturally aligned and marked as "packed", with explicit padding fields where necessary; alignment should not be left to the compiler where ABI compatibility is a concern.
Fixing the unfixable autofs ABI
Fixing broken userspace is NOT the kernel's job
Fixing broken userspace is NOT the kernel's job
But why check only for exactly 300 bytes? It would be understandable if systemd would check that the size of the data is >=300. That way there would be room to add padding or even new fields at the end, if needed.
Magic numbers
Magic numbers
Magic numbers
Magic numbers
That's the wrong analogy for this debate. I'm sure you can think of a 100 cases where users got burned by a design decision in a user space program 5 years ago, and a kernel change today could compensate for it, but the consensus is the users should suffer.
Fixing broken userspace is NOT the kernel's job
Fixing broken userspace is NOT the kernel's job
http://lists.freedesktop.org/archives/systemd-devel/2011-...
Fixing broken userspace is NOT the kernel's job
Fixing broken userspace is NOT the kernel's job
Fixing broken userspace is NOT the kernel's job
not all 32 bit compat, just x86
Nobody can know everything, but...
Nobody can know everything, but...
Nobody can know everything, but...
As I see it systemd simply "discovered" a bug in the kernel and the maintainers decided a general solution in the kernel was the best. And why do you say it was "fast-tracked"? 5 moths don't seem that rushed to me - but of course it's a bit unpleasant that they failed to notice that the other obvious user of the API had a different "fix" during all that time.
Nobody can know everything, but...
Why? Seems like a perfectly reasonable situation.
No bloated 64bit pointers but 4gb per process.
Nobody can know everything, but...
Nobody can know everything, but...
Nobody can know everything, but...
Nobody can know everything, but...
Nobody can know everything, but...
Nobody can know everything, but...
Nobody can know everything, but...
Nobody can know everything, but...
Nobody can know everything, but...
Packetized pipes seem like a bad idea.
Fixing the unfixable autofs ABI
Fixing the unfixable autofs ABI
Discarding data isn't so much of a problem as knowing when data has been discarded. recvmsg(2) is nice for that. You generally can't have message per read semantics for arbitrary size messages in any case because the kernel buffers must store an entire message to avoid unusual complexities. (Ask the SCTP people about the fun problems you get trying to implement a point-to-multipoint socket that supports arbitrary size messages.)Fixing the unfixable autofs ABI
Considering that reality, all you really need is some reliable way to know whether your message has been truncated. Sockets have MSG_PEEK / MSG_TRUNC for that, so I guess the question is why wouldn't it be practical to make recvmsg(2) work on a packetized pipe so that you can do the same thing?
Fixing the unfixable autofs ABI
Fixing the unfixable autofs ABI
Fixing the unfixable autofs ABI
Fixing the unfixable autofs ABI
Fixing the unfixable autofs ABI
This is a two part kludge. Part one is adding the "unusual" behavior to pipes. Part two, is making it so that the kernel side of the autofs code will set the flag in the right situations.Fixing the unfixable autofs ABI
Almost. The kernel side always turns on packetized behavior so that things work as expected with all known user-space binaries.
Fixing the unfixable autofs ABI
Fixing the unfixable autofs ABI
It is off unless, something turns on O_DIRECT. The kernel side of the autofs interface turns on O_DIRECT. As far as I know, it isn't turned on anywhere else in the kernel (yet).Fixing the unfixable autofs ABI
Fixing the unfixable autofs ABI
Fixing the unfixable autofs ABI
Fixing the unfixable autofs ABI
Fixing the unfixable autofs ABI
Fixing the unfixable autofs ABI
Fixing the unfixable autofs ABI
The Only Natural Alignment Is Byte Alignment
The Only Natural Alignment Is Byte Alignment
Anyway, it's not as if 16, 32, 64, .. are just some random numbers.
The Only Natural Alignment Is Byte Alignment
The Only Natural Alignment Is Byte Alignment