A kernel change breaks GlusterFS

Posted Mar 28, 2013 10:55 UTC (Thu) by mkerrisk (subscriber, #1978)
In reply to: A kernel change breaks GlusterFS by zlynx
Parent article: A kernel change breaks GlusterFS

But there *was* no ABI change. A 64-bit value continued to hold 64 bits.

The question is where you consider the definition of the ABI to be. Is it some documented standard ("this is a 64-bit field"), or is it "the behavior as (it appears to be) implemented" ("only 32 bits are ever used in this field")? The GlusterFS folks clearly took it to be the latter. One can argue that it was a questionable decision, but given the problem they were trying to solve, and the constraints on how much information they could pass in the cookie sent over NFSv3, it wasn't a completely insane thing to do, given the observed kernel behavior.

This is exactly like C programmers who complain when their undefined behavior changes.

The analogy doesn't really hold. For C, there is a very carefully defined standard that thoroughly specifies behavior and notes the cases where behavior is undefined. For much of the kernel API, there is nothing like such precise documentation/specification. This leaves user-space programmers trying to make guesses about what is or is not permissible, and that is exactly the hole that the Gluster folk fell into. And as noted by another commenter, the Samba folk fell into the same hole. The fact that two independent groups fell into the same hole is quite telling, in my view.

Back to kernel examples, if some user space program begins relying on the number of columns when parsing the proc files is that the kernel's problem? No. That is just a badly written program.

I think that's a weak example to support your argument, because the advice that one should parse /proc defensively is reasonably well known. And don't get me wrong, your argument is reasonable, but I think it's far from definitive.

Returning to my point about documented standards versus "the behavior as (it appears to be) implemented"... The Linux kernel violates standards in a number of places, and when it comes down to contradictions between documented behavior (man pages and standards) versus existing implementation, Linus always firmly plumps for the latter (unless the existing behavior is causing actual pain to user space).

And take a look at the EPOLLWAKEUP example referred to in the article. In that case, the problem was that a program was setting random bits in the epoll API that formerly had no meaning. The application had *no* good reason to set those bits, because they had absolutely no effect (and unfortunately there was no kernel check to give an error when that was done). When someone tried to give those bits a meaning, the application broke. The response was not to say: user-space, go fix your stupid application; that would just inflict pain on thousands of users as their binaries break. Instead the response was: we'll need to modify this kernel patch in such a way that it does not break user-space (and that changed _decreased_ the usability of the kernel feature that was added). The argument in that case that the kernel should change was much weaker than the argument would have been for accommodating GlusterFS, if the GlusterFS problem had actually been detected in time

You can't have it both ways. Linus pretty consistently goes one way, and I can see his point (though I've disagreed with some specific cases in the past).

A kernel change breaks GlusterFS

Posted Mar 28, 2013 17:54 UTC (Thu) by jimparis (guest, #38647) [Link]

> The question is where you consider the definition of the ABI to be. Is it some documented standard ("this is a 64-bit field"), or is it "the behavior as (it appears to be) implemented" ("only 32 bits are ever used in this field")? The GlusterFS folks clearly took it to be the latter. One can argue that it was a questionable decision, but given the problem they were trying to solve, and the constraints on how much information they could pass in the cookie sent over NFSv3, it wasn't a completely insane thing to do, given the observed kernel behavior.

There are two things they could have done that would have made it less insane:

- Ask the kernel developers if it's OK to assume high bits are zero

- Verify the assumption with an assert (instead of missing files and going into infinite loops)

A kernel change breaks GlusterFS

Posted Mar 29, 2013 17:24 UTC (Fri) by giraffedata (guest, #1954) [Link]

The question is where you consider the definition of the ABI to be. Is it some documented standard ("this is a 64-bit field"), or is it "the behavior as (it appears to be) implemented" ("only 32 bits are ever used in this field")? The GlusterFS folks clearly took it to be the latter.

According to what you wrote, they did more than that. They looked at the man page, which is the closest Linux comes to an ABI specification. They noted the language in there, which could have said "continuation cookie" or something, but actually uses the word "offset." This suggests that it is a very old interface from days before we understood layering the way we do now and that it is (or at least has to emulate) a byte offset within the stored representation of the directory. They probably noticed that the C type of the member is the one used for file offsets. That means only in a huge directory could it have nonzero upper bits.

It's not clear to me how they disposed of the possibility of a huge directory, but that's probably beside the point.

But I have to say I assume the developers knew they were taking a risk. Common sense would have told them that modern filesystems, especially the ones that haven't been invented yet, don't have such simple implementations of directories and their developers might well use the upper 32 bits freely. They decided to trade the future breakage for all the present benefits truncating the offset gave them . We tend to remember those tradeoffs differently after payment becomes due.

A kernel change breaks GlusterFS

Posted Apr 6, 2013 16:56 UTC (Sat) by jra (subscriber, #55261) [Link] (1 responses)

For Samba at least it's certainly just a bug in our code.

It's interesting how it happened though. The cookie returned from telldir() is defined as a 'long', not a fixed length type. Back in the day on simpler filesystems, this used to be the index into the directory.

We were lazy and just naturally assumed it would always be such, and back in the 32-bit days a long fit into a 32-bit DOS protocol search directory field, so we just stuffed it in there.

Modern CIFS/SMB/SMB[2|3] use a last-filename character string to restart a search, not an integer index, so newer clients never run into this problem. It's only old DOS clients that use the 32-bit index protocol request. So when the underlying systems went from 32-bit to 64-bit, we mostly didn't notice (everyone was running Windows rather than DOS, plus the kernel still didn't use the top 32-bits of the now 64-bit long return from telldir()).

Then the kernel changed, the top 32-bit started to be used and old DOS clients broke. Oops. As it's old DOS clients we still haven't fixed this (there really aren't that many still out there, if there were the NAS vendors would have been screaming for a fix long before now).

So what happened ? Ignorance, lazyness and errors on our part - mostly. However, if the telldir() interface had returned a deliberately opaque struct as a cookie instead of an integer that was expected to be an index, then we probably wouldn't have made this error.

So I'd argue internal semantics of telldir() changed (from index to cookie), but the ABI didn't (the cookie was just hidden inside what used to be the index).

Still, I don't think the kernel should change. We just need to fix our crappy userspace code and learn our lesson in future :-).

A kernel change breaks GlusterFS

Posted Apr 12, 2013 13:07 UTC (Fri) by meuh (guest, #22042) [Link]

For Samba at least it's certainly just a bug in our code.

Or it's a bug in the API: why not using an off_t type ?

It's interesting how it happened though. The cookie returned from telldir() is defined as a 'long', not a fixed length type. Back in the day on simpler filesystems, this used to be the index into the directory.

telldir() returns a long, but readdir() returns a struct dirent that, under Linux (see readdir(3)) hold an off_t d_off field. d_off might be 64bits wide even on a 32bits system with support for Large File Support (LFS) (compiled with -D_FILE_OFFSET_BITS=64 -D_LARGE_FILE_SOURCE=1), while long is still 32bits wide. So after the ext4 cookie extension one would be afraid to see that d_off could hold a value bigger than the one returned by telldir(3).

Hopefully readdir(3) is implemented using getdents(2) which returns an unsigned long on all configurations.

You might want to read my other comment regarding extending to 64bits something that's going to be a 32bits value anyway.