LWN: Comments on "A kernel change breaks GlusterFS" http://lwn.net/Articles/544298/ This is a special feed containing comments posted to the individual LWN article titled "A kernel change breaks GlusterFS". hourly 2 A kernel change breaks GlusterFS http://lwn.net/Articles/546273/rss 2013-04-06T16:56:28+00:00 jra <div class="FormattedComment"> For Samba at least it's certainly just a bug in our code.<br> <p> It's interesting how it happened though. The cookie returned from telldir() is defined as a 'long', not a fixed length type. Back in the day on simpler filesystems, this used to be the index into the directory.<br> <p> We were lazy and just naturally assumed it would always be such, and back in the 32-bit days a long fit into a 32-bit DOS protocol search directory field, so we just stuffed it in there.<br> <p> Modern CIFS/SMB/SMB[2|3] use a last-filename character string to restart a search, not an integer index, so newer clients never run into this problem. It's only old DOS clients that use the 32-bit index protocol request. So when the underlying systems went from 32-bit to 64-bit, we mostly didn't notice (everyone was running Windows rather than DOS, plus the kernel still didn't use the top 32-bits of the now 64-bit long return from telldir()).<br> <p> Then the kernel changed, the top 32-bit started to be used and old DOS clients broke. Oops. As it's old DOS clients we still haven't fixed this (there really aren't that many still out there, if there were the NAS vendors would have been screaming for a fix long before now).<br> <p> So what happened ? Ignorance, lazyness and errors on our part - mostly. However, if the telldir() interface had returned a deliberately opaque struct as a cookie instead of an integer that was expected to be an index, then we probably wouldn't have made this error.<br> <p> So I'd argue internal semantics of telldir() changed (from index to cookie), but the ABI didn't (the cookie was just hidden inside what used to be the index).<br> <p> Still, I don't think the kernel should change. We just need to fix our crappy userspace code and learn our lesson in future :-).<br> <p> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/546080/rss 2013-04-05T06:20:53+00:00 mkerrisk <div class="FormattedComment"> <font class="QuotedText">&gt; Btw, the patch has a "Patch-updated-by" and "blame me" description. So Fan Yong shouldn't be blamed alone...</font><br> <p> Bernd,<br> <p> It was not my intention to assign blame, and I'm sorry if the article gave that impression. The change that Fan Yong (or something like it) made was obviously needed. And, at some level, I could imagine that the FS developers didn't see this as an ABI change. My point was a general one: even things that may not look like ABI changes may in fact be so, and developers as a group perhaps need to be even more vigilant about that possibility. (The message queue change is a similar kind of case. One could easily have thought of it as *not* being an ABI change, but it was.)<br> <p> Thanks,<br> <p> Michael<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/546019/rss 2013-04-04T19:23:29+00:00 aakef <div class="FormattedComment"> Btw, the patch has a "Patch-updated-by" and "blame me" description. So Fan Yong shouldn't be blamed alone...<br> <p> <p> Bernd<br> </div> The tree makes noise, but we need a test to listen for it http://lwn.net/Articles/545312/rss 2013-03-30T17:30:11+00:00 davecb <div class="FormattedComment"> Many years ago I was on the ABI stability team at Sun, and we actively tried to test all the "must be true" assertions in library interfaces like this. The easiest were changes in the nature of parameters: the hardest were changes in the meanings of otherwise unchanged parameters.<br> <p> Ones like this we would have considered either "uninterpreted opaque cookie" (good) or "must be zero" (bad). Finding the latter usually resulted in version-number changes, weird backward-compatibility tests to see if anyone had stolen the upper half for anything and debates about how to change the documentation.<br> <p> As a side effect, we tried to write a static test or shared-library test we could run against any apps that were being compatibility-tested by their vendors, and we recorded the difference in our porting database, so that if another vendor or an old SunOS system used it, it would get fixed in a port. <br> <p> The latter, IMHO, was the really valuable part: if Hockey-PUX had the bug, we'd refrain from reproducing it on applications ported to Solaris. Linux rarely has such bugs: other vendors (including ourselves) weren't as fortunate.<br> <p> --dave<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/545222/rss 2013-03-29T17:24:22+00:00 giraffedata <blockquote> The question is where you consider the definition of the ABI to be. Is it some documented standard ("this is a 64-bit field"), or is it "the behavior as (it appears to be) implemented" ("only 32 bits are ever used in this field")? The GlusterFS folks clearly took it to be the latter. </blockquote> <p> According to what you wrote, they did more than that. They looked at the man page, which is the closest Linux comes to an ABI specification. They noted the language in there, which could have said "continuation cookie" or something, but actually uses the word "offset." This suggests that it is a very old interface from days before we understood layering the way we do now and that it is (or at least has to emulate) a byte offset within the stored representation of the directory. They probably noticed that the C type of the member is the one used for file offsets. That means only in a huge directory could it have nonzero upper bits. <p> It's not clear to me how they disposed of the possibility of a huge directory, but that's probably beside the point. <p> But I have to say I assume the developers knew they were taking a risk. Common sense would have told them that modern filesystems, especially the ones that haven't been invented yet, don't have such simple implementations of directories and their developers might well use the upper 32 bits freely. They decided to trade the future breakage for all the present benefits truncating the offset gave them . We tend to remember those tradeoffs differently after payment becomes due. A kernel change breaks GlusterFS http://lwn.net/Articles/545126/rss 2013-03-29T09:48:28+00:00 dvdeug <div class="FormattedComment"> Sure, I get that. But if you get an opaque id that takes 64 bits, and you don't preserve all 64 bits, you've broken the API, not the kernel. Don't mumble things about offsets and your constraints; it's your fault. At the very least, it never would have got into an infinite loop if they'd checked their assumption on every offset they got from the kernel.<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/545096/rss 2013-03-29T05:19:36+00:00 alankila <div class="FormattedComment"> Yes, but apparently there is no space in the NFSv3 protocol used to encode the information anywhere else. This is based on the words of the article, not knowledge. Even after the problem is known, the extra bits still go somewhere into the readdir() cookie, only now they're stored in some different way. Eww.<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/545039/rss 2013-03-28T21:02:07+00:00 Cyberax <div class="FormattedComment"> Larger cookie size. 64 bits are not really enough for a lots of uses.<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/545028/rss 2013-03-28T19:52:50+00:00 dlang <div class="FormattedComment"> Why, it already has a field for the cookie, what benefit would a second field provide?<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/545023/rss 2013-03-28T19:35:45+00:00 xorbe <div class="FormattedComment"> This whole mess is one big hack. Sounds like NFS should have a defined spare field just for cookie usage.<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/544996/rss 2013-03-28T17:54:24+00:00 jimparis <div class="FormattedComment"> <font class="QuotedText">&gt; The question is where you consider the definition of the ABI to be. Is it some documented standard ("this is a 64-bit field"), or is it "the behavior as (it appears to be) implemented" ("only 32 bits are ever used in this field")? The GlusterFS folks clearly took it to be the latter. One can argue that it was a questionable decision, but given the problem they were trying to solve, and the constraints on how much information they could pass in the cookie sent over NFSv3, it wasn't a completely insane thing to do, given the observed kernel behavior.</font><br> <p> There are two things they could have done that would have made it less insane:<br> <p> - Ask the kernel developers if it's OK to assume high bits are zero<br> <p> - Verify the assumption with an assert (instead of missing files and going into infinite loops)<br> <p> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/544965/rss 2013-03-28T16:29:19+00:00 Cyberax <div class="FormattedComment"> Use "last returned filename" as a cookie. Easy and simple - Amazon does this for S3 and it works splendidly.<br> </div> The ext4 change 'breaks' Samba too http://lwn.net/Articles/544896/rss 2013-03-28T13:34:00+00:00 bfields <p>Once handed out to an NFS client, you don't know when the client may use a cookie; could be much later, could be after we've rebooted. So that internal mapping would have to be kept forever, on disk (to survive reboots). <p>You could do it, but I think you'd quickly start feeling like you were doing a ton of work that the filesystem should really be doing for you. <p>SMB might make this easier, I don't know. <p>It might be possible to extend the NFS protocol to use some kind of directory-read pointer with a more limited lifetime. But that won't solve anybody's problem today, at least not until we get the IETF a time machine. (Four days left for someone to publish an RFC addressing the paradoxes inherent to time-travelling standards processes.) A kernel change breaks GlusterFS http://lwn.net/Articles/544850/rss 2013-03-28T10:55:20+00:00 mkerrisk <blockquote> <font color="purple"> But there *was* no ABI change. A 64-bit value continued to hold 64 bits. </font> </blockquote> <p> The question is where you consider the definition of the ABI to be. Is it some documented standard ("this is a 64-bit field"), or is it "the behavior as (it appears to be) implemented" ("only 32 bits are ever used in this field")? The GlusterFS folks clearly took it to be the latter. One can argue that it was a questionable decision, but given the problem they were trying to solve, and the constraints on how much information they could pass in the cookie sent over NFSv3, it wasn't a completely insane thing to do, given the observed kernel behavior. <blockquote> <font color="purple"> This is exactly like C programmers who complain when their undefined behavior changes. </font> </blockquote> <p> The analogy doesn't really hold. For C, there is a very carefully defined standard that thoroughly specifies behavior and notes the cases where behavior is undefined. For much of the kernel API, there is nothing like such precise documentation/specification. This leaves user-space programmers trying to make guesses about what is or is not permissible, and that is exactly the hole that the Gluster folk fell into. And as noted by another commenter, the Samba folk fell into the same hole. The fact that <i>two</i> independent groups fell into the same hole is quite telling, in my view. <blockquote> <font color="purple"> Back to kernel examples, if some user space program begins relying on the number of columns when parsing the proc files is that the kernel's problem? No. That is just a badly written program. </font> </blockquote> <p> I think that's a weak example to support your argument, because the advice that one should parse /proc defensively is reasonably well known. And don't get me wrong, your argument is reasonable, but I think it's far from definitive. <p> Returning to my point about documented standards versus "the behavior as (it appears to be) implemented"... The Linux kernel violates standards in a number of places, and when it comes down to contradictions between documented behavior (man pages and standards) versus existing implementation, Linus always firmly plumps for the latter (unless the existing behavior is causing actual pain to user space). <p> And take a look at the EPOLLWAKEUP example referred to in the article. In that case, the problem was that a program was setting random bits in the epoll API that formerly had no meaning. The application had *no* good reason to set those bits, because they had absolutely no effect (and unfortunately there was no kernel check to give an error when that was done). When someone tried to give those bits a meaning, the application broke. The response was not to say: user-space, go fix your stupid application; that would just inflict pain on thousands of users as their binaries break. Instead the response was: we'll need to modify this kernel patch in such a way that it does not break user-space (and that changed _decreased_ the usability of the kernel feature that was added). The argument in that case that the kernel should change was much weaker than the argument would have been for accommodating GlusterFS, if the GlusterFS problem had actually been detected in time <p> You can't have it both ways. Linus pretty consistently goes one way, and I can see his point (though I've disagreed with some specific cases in the past). A kernel change breaks GlusterFS http://lwn.net/Articles/544855/rss 2013-03-28T10:32:26+00:00 paulj <div class="FormattedComment"> Sure. There's 2 trains of thought going through my head:<br> <p> a) How to get a stable order over fixed-length IDs (dir size limited to the ID space then).<br> <p> b) Assuming you can't get that, how would you design sane APIs that still allowed iteration of directories?<br> <p> For b, it's names. However, I still don't understand why 'a' isn't possible. See my comment on the companion ext4 article about using a ring for the ID space.<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/544845/rss 2013-03-28T09:52:52+00:00 dlang <div class="FormattedComment"> Sure, except that there are those pesky legacy constraints.<br> <p> The latest NFS protocol specifies that the token passed over the wire is a 64 bit value (earlier versions specified a 32 bit value)<br> <p> Other network filesystems have similar specifications.<br> <p> In fact, from a little googling, it looks like the POSIX spec says that these cookies are of type 'long'. this makes changing it to something like a variable length string like you would need to use the filename extremely hard.<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/544835/rss 2013-03-28T09:20:15+00:00 paulj <div class="FormattedComment"> So really it'd be much better (ignoring legacy constraints) to use the filename as the handle for any readdir-like directory iteration operation?<br> </div> The ext4 change 'breaks' Samba too http://lwn.net/Articles/544816/rss 2013-03-28T07:54:28+00:00 Lumag <div class="FormattedComment"> Why can't GlusterFS also have an internal mapping between new cookies and some random bitstring given as a part of NFS' readdir() ?<br> </div> The ext4 change 'breaks' Samba too http://lwn.net/Articles/544772/rss 2013-03-28T02:43:18+00:00 abartlet <div class="FormattedComment"> It is interesting to see this article, because this same change (from the description) also 'breaks' Samba.<br> <p> We see the same thing, and have had to blacklist a number of our testsuites from our automated testing, because when an older diretory searching/enumeration protocol is used, we mapped the readdir() cookie onto a 4 byte element in that protocol. These tests in turn started spinning in an infinite loop due to the truncation.<br> <p> Now, the fix for us is presumably to have some other in-memory mapping from 64-bit back to 32-bit for the cookies Samba clients have obtained. Naturally we also would need to avoid a DoS by allocation of an infinite number of mapped cookies, and other challenges, which along with the real-world use case for the interface being old DOS clients, means we haven't got around to it yet. <br> <p> I only mention this to indicate that this was noticed beyond GlusterFS. <br> <p> Andrew Bartlett<br> Samba Team<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/544762/rss 2013-03-28T00:57:22+00:00 ringerc <div class="FormattedComment"> I must agree there. Relying on certain parts of a field being unset when they're not explicitly documented as available for use is unwise. You can do it, but if it breaks you get to keep the pieces.<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/544759/rss 2013-03-28T00:35:35+00:00 bfields <p>Traditionally the cookie is a byte-offset into some linear representation of the directory. If you do that, then in practice a "64 bit" cookie in practice probably isn't going to exceed 2^32 until you have millions of entries. <p>ext4 is unusual in that it uses a hash of the filename as the cookie--so the 64-bit cookies are effectively random 64-bit sequences, and really do use the high bits. <p>(Well, OK, actually they're limited to 63 bits, but you get the idea.) A kernel change breaks GlusterFS http://lwn.net/Articles/544749/rss 2013-03-27T23:43:11+00:00 bronson <div class="FormattedComment"> Ha, I read the articles in the wrong order. Last paragraph of the companion article: <a href="https://lwn.net/Articles/544520/">https://lwn.net/Articles/544520/</a><br> <p> Sounds like XFS also uses 32 bit cookies but uses some additional memory to ensure it doesn't suffer the collision problem.<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/544747/rss 2013-03-27T23:37:22+00:00 rvolgers <div class="FormattedComment"> Wow, talk about piling ugly hacks upon ugly hacks. A well-explained hack is still a hack!<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/544746/rss 2013-03-27T23:32:05+00:00 bronson <div class="FormattedComment"> I was wondering that myself. What singles ext4 out for breakage? Hoping someone knows.<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/544743/rss 2013-03-27T23:06:01+00:00 zlynx <div class="FormattedComment"> But there *was* no ABI change. A 64-bit value continued to hold 64 bits.<br> <p> This is exactly like C programmers who complain when their undefined behavior changes. Like reading a[-1] or depending on signed integer wrap or reading a double through an integer pointer.<br> <p> It never should have worked. It never did work except on your one compiler and machine combination. And changing it isn't the compiler's problem.<br> <p> Back to kernel examples, if some user space program begins relying on the number of columns when parsing the proc files is that the kernel's problem? No. That is just a badly written program.<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/544739/rss 2013-03-27T22:25:02+00:00 nix <div class="FormattedComment"> Given Jeff Darcy's last review comment... I don't think it's going in until that bug is fixed. :)<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/544734/rss 2013-03-27T22:06:36+00:00 jengelh <div class="FormattedComment"> If glusterfs has no problem with 64-bit cookies that XFS uses, why would gluster suddenly have an issue with 64-bit cookies from ext4? Or is it that the gluster on-disk format is dependent upon the fstype?<br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/544726/rss 2013-03-27T21:38:55+00:00 dvdeug <div class="FormattedComment"> I see their problem, but isn't it quite problematic to assume a 64-bit index only stores data in the bottom 32 bits? <br> </div> A kernel change breaks GlusterFS http://lwn.net/Articles/544720/rss 2013-03-27T21:14:57+00:00 bfields There's a proposed solution for glusterfs (with an excellent changelog) at <a href="http://review.gluster.org/#change,4711">http://review.gluster.org/#change,4711</a>.