LWN.net Logo

Throwing one away

By Jonathan Corbet
September 19, 2012
One of the (many) classic essays in The Mythical Man-Month by Frederick Brooks is titled "Plan to throw one away." Our first solution to a complex software development problem, he says, is not going to be fit for the intended purpose. So we will end up dumping it and starting over. The free software development process is, as a whole, pretty good at the "throwing it away" task; some would argue that we're too good at it. But there are times when throwing one away is hard; the current discussion around control groups in the kernel shows how situation can come about.

What Brooks actually said (in the original edition) was:

In most projects, the first system built is barely usable. It may be too slow, too big, awkward to use, or all three. There is no alternative but to start again, smarting but smarter, and build a redesigned version in which these problems are solved. The discard and redesign may be done in one lump, or it may be done piece-by-piece. But all large-system experience shows that it will be done.

One could argue that free software development has taken this advice to heart. In most projects of any significant size, proposed changes are subjected to multiple rounds of review, testing, and improvement. Often, a significant patch set will go through enough fundamental changes that it bears little resemblance to its initial version. In cases like this, the new subsystem has, in a sense, been thrown away and redesigned.

In some cases it's even more explicit. The 2.2 kernel, initially, lacked support for an up-and-coming new bus called USB. Quite a bit of work had gone into the development of a candidate USB subsystem which, most people assumed, would be merged sometime soon. Instead, in May 1999, Linus looked at the code and decided to start over; the 2.2.7 kernel included a shiny new USB subsystem that nobody had ever seen before. That code incorporated lessons learned from the earlier attempts and was a better solution — but even that version was eventually thrown away and replaced.

Brooks talks about the need for "pilot plant" implementations to turn up the problems in the initial implementation. Arguably we have those in the form of testing releases, development trees, and, perhaps most usefully, early patches shipped by distributors. As our ability to test for performance regressions grows, we should be able to do much of our throwing-away before problems in early implementations are inflicted upon users. For example, the 3.6 kernel was able to avoid a 20% regression in PostgreSQL performance thanks to pre-release testing.

But there are times when the problem is so large and so poorly understood that the only way to gain successful "pilot plant" experience is to ship the best implementation we can come up with and hope that things can be fixed up later. As long as the problems are internal, this fixing can often be done without creating trouble for users. Indeed, the history of most software projects (free and otherwise) can be seen as an exercise in shipping inferior code, then reimplementing things to be slightly less inferior and starting over again. The Linux systems we run today, in many ways, look like those of ten years or so ago, but a great deal of code was replaced in the time between when those systems were shipped.

But what happens when the API design is part of the problem? User interfaces are hard to design and, when they turn out to be wrong, they can be hard to fix. It turns out that users don't like it when things change on them; they like it even less if their programs and scripts break in the process. As a result, developers at all levels of the stack work hard to avoid the creation of incompatible changes at the user-visible levels. It is usually better to live with one's mistakes than to push the cost of fixing them onto the user community.

Sometimes, though, those mistakes are an impediment to the creation of a proper solution. As an example, consider the control groups (cgroups) mechanism within the kernel. Control groups were first added to the 2.6.24 kernel (January, 2008) as a piece of the solution to the "containers" problem; indeed, they were initially called "process containers." They have since become one of the most deeply maligned parts of the kernel, to the point that some developers routinely threaten to rip them out when nobody is looking. But the functionality provided by control groups is useful and increasingly necessary, so it's not surprising that developers are trying to identify and fix the problems that have been exposed in the current ("pilot") control group implementation.

As can be seen in this cgroup TODO list posted by Tejun Heo, lot of those problems are internal in nature. Fixing them will require a lot of changes to kernel code, but users should not notice that anything has changed at all. But there are some issues that cannot be hidden from users. In particular: (1) the cgroup design allows for multiple hierarchies, with different controllers (modules that apply policies to groups) working with possibly different views of the process tree, and (2) the implementation of process hierarchies is inconsistent from one controller to the next.

Multiple hierarchies seemed like an interesting feature at the outset; why should the CPU usage controller be forced to work with the same view of the process tree as, say, the memory usage controller? But the result is a more complicated implementation that makes it nearly impossible for controllers to coordinate with each other. The block I/O bandwidth controller and the memory usage controller really need to share a view of which control group "owns" each page in the system, but that cannot be done if those two controllers are working with separate trees of control groups. The hierarchy implementation issues also make coordination difficult while greatly complicating the lives of system administrators who need to try to figure out what behavior is actually implemented by each controller. It is a mess that leads to inefficient implementations and administrative hassles.

How does one fix a problem like this? The obvious answer is to force the use of a single control group hierarchy and to fix the controllers to implement their policies over hierarchies in a consistent manner. But both of those are significant, user-visible API and behavioral changes. And, once again, a user whose system has just broken tends to be less than appreciative of how much better the implementation is.

In the past, operating system vendors have often had to face issues like this. They have responded by saving up all the big changes for a major system update; users learned to expect things to break over such updates. Perhaps the definitive example was the transition from "Solaris 1" (usually known as SunOS 4 in those days) to Solaris 2, which switched the entire system from a BSD-derived base to one from ATT Unix. Needless to say, lots of things broke in the process. In the Linux world, this kind of transition still happens with enterprise distributions; RHEL7 will have a great many incompatible changes from RHEL6. But community distributions tend not to work that way.

More to the point, the components that make up a distribution are typically not managed that way. Nobody in the kernel community wants to go back to the pre-2.6 days when major features only got to users after a multi-year delay. So, if problems like those described above are going to be fixed in the kernel, the kernel developers will have to figure out a way to do it in the regular, approximately 80-day development cycle.

In this case, the plan seems to be to prod users with warnings of upcoming changes while trying to determine if anybody really has difficulties with them. So, systems where multiple cgroup hierarchies are in use will emit warnings to the effect that the feature is deprecated and inviting email from anybody who objects. Similar warnings will be put into specific controllers whose behavior is expected to change. Consider the memory controller; as Tejun put it: "memcg asked itself the existential question of to be hierarchical or not and then got confused and decided to become both." The plan is to get distributors to carry a patch warning users of the non-hierarchical mode and asking them to make their needs known if the change will truly be a problem for them. In a sense, the distributors are being asked to run a pilot for the new cgroup API.

It is possible that the community got lucky this time around; the features that need to be removed or changed are not likely to be heavily used. In other cases, there is simply no alternative to retaining the older, mistaken design; the telldir() system call, which imposes heavy implementation costs on filesystems, is a good example. We can never preserve our ability to "throw one away" in all situations. But, as a whole, the free software community has managed to incorporate Brooks's advice nicely. We throw away huge quantities of code all the time, and we are better off for it.


(Log in to post comments)

Throwing one away

Posted Sep 20, 2012 2:34 UTC (Thu) by bronson (subscriber, #4806) [Link]

Right on. Too bad this cleanup must happen in the kernel rather than pre-merge but better late than never!

"prod users with warnings" means printk and maybe the occasional kconfig nag?

Throwing one away

Posted Sep 20, 2012 5:51 UTC (Thu) by josh (subscriber, #17465) [Link]

Regarding telldir, why not emulate it entirely in userspace rather than forcing filesystems to do it?

Throwing one away

Posted Sep 20, 2012 6:35 UTC (Thu) by neilbrown (subscriber, #359) [Link]

I keep hearing how bad telldir is, but that doesn't make it true.

You really need something like it to implement NFS - or any network filesystem.

And really, it isn't that hard to design a good directory structure which provides stable telldir addresses. Unfortunately it is easier to design one that doesn't.

(I agree that it is difficult to design a good directory structure which is format-compatible with ext2 and which provides stable telldir cookies, but that isn't really telldir's fault).

(which doesn't answer your suggestion about user-space emulation except to observe that it wouldn't help for NFS).

Throwing one away

Posted Sep 20, 2012 14:56 UTC (Thu) by josh (subscriber, #17465) [Link]

Why do you need telldir to implement NFS?

Throwing one away

Posted Sep 20, 2012 15:55 UTC (Thu) by khim (subscriber, #9252) [Link]

Because NFS is completely stateless protocol. This means that server should handle telldir correctly even if it crashed and was restarted, for example.

This is circular logic: you need telldir(2) to implement telldir(2) and this can not be done entirely in userspace. Duh. Of course. But why NFS is designed this way in a first place?

Throwing one away

Posted Sep 20, 2012 16:24 UTC (Thu) by ballombe (subscriber, #9523) [Link]

> But why NFS is designed this way in a first place?

You never had your $HOME hosted on a original NFS server, did you ? The ability to recover from crash without impacting client too much was very important when crashes were a daily event.

Throwing one away

Posted Sep 20, 2012 22:12 UTC (Thu) by horen (subscriber, #2514) [Link]

> You never had your $HOME hosted on a original NFS server, did you ?

I remember the move from SunOS/3.2.x to SunOS/4.x, and the problems implementing their automounter. $HOME was always NFS-mounted from a remote server, as were /usr/local hierarchies with $ARCH-specific binaries, libraries, and config files.

> Perhaps the definitive example was the transition from "Solaris 1" (usually known as SunOS 4 in those days) to Solaris 2, which switched the entire system from a BSD-derived base to one from ATT Unix.

Prior to that, SMCC's move from the Motorola 68000 to their own SPARC was an even more difficult -- if not devastating -- move, one which paved the way for the subsequent SunOS->Solaris2 betrayal.

Throwing one away

Posted Sep 20, 2012 18:02 UTC (Thu) by nix (subscriber, #2304) [Link]

It's much more fundamental than that. To implement the READDIR request, or anything like it in a stateless protocol (ignoring here hypothetical protocols which can only return the whole directory in one go, since they would have obviously terrible performance), you must be able to ask for 'the next bit of the directory from where I was readdir()ing last' even though the server has no idea where you might be in the directory because the protocol is stateless. NFS does this by handing you a cookie with each READDIR response which corresponds in some way the spec does not specify to the location of that file in a readdir() response, so you can pass it back in the next request and get the next filename.

And that cookie is, of course, an (encoded) return value of telldir().

Worse yet, because NFS is stateless, *any* nonpersistent telldir() cookies are going to fail with NFS: the pathological case of someone who does a telldir() and then a seekdir() much much later in a different call to opendir() is downright *common* if you've got an NFS server looking at that filesystem.

Which means that every NFS-exportable fs (any serious FS, period) needs a way to encode positions in all directories, stably, into a 32-bit number, with vaguely reasonable things happening to those positions even when the directories change. It's a complete pig on a lot of filesystems: they sometimes need whole extra data structures just to make this one system call work. But, as Neil says, something like it does indeed seem to be essential if you want stateless network filesystems of any sort to work. I wish there was a better way, but I wish for a lot of impossible things.

Throwing one away

Posted Sep 20, 2012 20:59 UTC (Thu) by khim (subscriber, #9252) [Link]

Ignoring here hypothetical protocols which can only return the whole directory in one go, since they would have obviously terrible performance.

I'm not so sure. You'll only have "an obviously terrible performance" if you have thousands (or maybe millions?) of files in one directory. When NFS was initially developed you had horrible performance in such a case no matter what (typically you had O(N²) performance and thousands of files were not acceptable for that reason alone) and later versions are not completely stateless so in the end the whole exercise only created grief for a lot of people without any actual upside.

But it's obviously to late to try to fix it.

Throwing one away

Posted Sep 20, 2012 21:13 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

There is a better way, and it's incredibly simple - just remember the last returned file name. Then send it back to the server.

I.e. use the returned value for the cookie itself.

Throwing one away

Posted Sep 20, 2012 21:22 UTC (Thu) by bronson (subscriber, #4806) [Link]

What if someone renames or deletes the file?

Throwing one away

Posted Sep 20, 2012 21:34 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Then you use the next filename. Amazon S3 does this, and it works just fine.

Throwing one away

Posted Sep 20, 2012 22:00 UTC (Thu) by neilbrown (subscriber, #359) [Link]

While this can certainly work "fine", it still imposes on the filesystem. It requires that the file system stores a lexically-order index of each directory. So you would have to design (part of) your filesystem to match this interface.

Nothing wrong with that, except that we already have a much more broadly used interface which you can design you filesystem to.

There was at one stage a proposal before the NFSv4 working group to allow the lookup key for directories to be either an opaque fixed-sized blob, or a directory entry name. The filesystem would somehow indicate what it wanted (Came from Hans Reiser if I remember correctly). Unfortunately it never went anywhere.

http://marc.info/?l=ietf-nfsv4&m=112545905820867&w=1

Throwing one away

Posted Sep 21, 2012 15:57 UTC (Fri) by faramir (subscriber, #2327) [Link]

I'm not sure why you need a lexically-ordered index if you use filenames as the location token. It would seem to me that what you need is a consistent ordering rather then a lexical one. I'm assumming here that we are talking about a directory which is static and we just want to make sure that we see every filename at some point.

BTW, are there any standards which speak to what one should see in the presence of directory changes? Personally, the most I would hope for in the dynamic directory case is that a program would eventually see every file that existed at opendir() time before readdir() returned no more files.

Throwing one away

Posted Sep 21, 2012 18:00 UTC (Fri) by knobunc (subscriber, #4678) [Link]

You need a lexically ordered one in case your "last-file" cookie refers to something that is no longer present. Otherwise what do you return?

Throwing one away

Posted Sep 21, 2012 19:27 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

You don't need a lexical order, just some stable ordering.

Throwing one away

Posted Sep 21, 2012 20:15 UTC (Fri) by knobunc (subscriber, #4678) [Link]

Agreed... but given that you need an ordering to allow you to locate where a missing arbitrary string needs to go, it must rely on an ordering property present in any arbitrary string. Which leaves you with little other than a lexical ordering.

Throwing one away

Posted Sep 21, 2012 20:49 UTC (Fri) by neilbrown (subscriber, #359) [Link]

Yes - ordering doesn't need to be strictly lexical, does need to be stable.

Why would you have an ordering that isn't stable? The obvious answer is that a hash table with chaining is commonly used and does not reliably provide a stable order (as collisions tend to be ordered based on creation order).

My preferred mechanism for directory indexing is a hash table with internal chaining. It easily provides a reliable fixed size telldir cookie, and performance should be quite good until the number of directory entries gets to about half the cookie space.

Throwing one away

Posted Sep 21, 2012 20:56 UTC (Fri) by neilbrown (subscriber, #359) [Link]

> are there any standards which speak to what one should see in the presence of directory changes?

http://pubs.opengroup.org/onlinepubs/009695399/functions/...

says

> If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir_r() returns an entry for that file is unspecified.

Which presumably implies that if a file is not added or removed after opendir, then readdir will return it precisely once. That is certainly how I understand it.

Throwing one away

Posted Sep 21, 2012 23:06 UTC (Fri) by nix (subscriber, #2304) [Link]

I don't know. POSIX says that readdir() returns "an ordered sequence of all the directory entries in a particular directory", which surely implies that if a file is removed during readdir() you cannot skip other files -- but it never says, nor by my reading implies that you cannot repeat some of them in this situation. This means that if you're occasionally forced to restart a directory because you can't figure out where the requester was, so be it -- though it is probably best if that require multiple things to go wrong.

(e.g. in the scheme I sketched above, if a readdir requester's random-cookie/filename->DIR* entry was expired by the server and the name the readdir requester passed was missing from the directory, readdir would simply start passing back filenames from the start of the directory over again. This does mean that under extreme load, so the cookies kept expiring, and seriously extreme modification rates of a sufficiently huge directory, so that at least one name was deleted while being readdir()ed and while its cookie expired on every pass through the directory, readdir() might never come to an end -- but that's possible under extreme modification rates anyway, even on local filesystems, and this is a pathological case that's unlikely to occur in practice. To be honest, given that bugs in which filenames were persistently omitted if they were in the wrong place in the directory persisted in the BSDs for something like thirty years, it seems that programs' actual requirements of readdir() are rather less extreme than the guarantees!)

Throwing one away

Posted Sep 27, 2012 19:09 UTC (Thu) by cras (guest, #7000) [Link]

rename()s during readdir() can cause the renamed files to be returned twice or never. I'm not entirely sure if other non-renamed files can be returned twice. Deleting files (or at least rename()ing to another directory) can cause some files to be skipped with some filesystems (unfortunately I didn't keep more details). Dovecot's Maildir syncing code has some comments related to this. https://lkml.org/lkml/2004/10/24/230 also.

Throwing one away

Posted Oct 5, 2012 18:56 UTC (Fri) by foom (subscriber, #14868) [Link]

Ooh, according to that lkml thread, it *is* possible to get a consistent state of a directory, on linux. You just need to allocate a buffer big enough for the whole directory to fit in. There doesn't appear to be a way to figure out how big that needs to be ahead-of-time, but at least you could iterate, doubling the buffer each time:

>> 3. the application could be offered an interface for atomic directory
>> reads that requires the application to provide sufficient memory in a
>> single contiguous buffer (making it thread-safe in the same go).
>
>Actually, you can do this today, if you use the underlying
>sys_getdents64 system call. But the application would have to
>allocate potentially a very large amount of userspace memory.

Throwing one away

Posted Sep 20, 2012 21:54 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

what happens if the filesystem doesn't store the files in a nice simple list, but instead stores them in some sort of tree format, and the tree can be re-ordered significantly between one call and the next.

For simple filesystems you get simple answers. for complex filesystems things get really messy.

Throwing one away

Posted Sep 20, 2012 20:38 UTC (Thu) by bfields (subscriber, #19510) [Link]

NFS is completely stateless protocol.

NFSv4 isn't stateless, and in practice though NFSv3 itself may have been stateful, it was usually run alongside NLM.

And I don't see why you couldn't in theory add a little more state to the protocol and make seekdir/telldir. Whether that would actually be a practical solution to anyone's problem, I don't know....

Throwing one away

Posted Sep 20, 2012 21:43 UTC (Thu) by neilbrown (subscriber, #359) [Link]

I was always amused by the term "stateless" as applied to NFS, because I always thought the content of files was "state"...

Obviously "state" and "stateless" are relative terms which we need to be careful with.

NFSv4 certainly has a lot of state, but for much of this state (not including the files!) the server it allowed to drop the state - on a reboot. NFSv4 has a whole sub-protocol for recovering that state which essentially involves the server saying "If forgot everything, tell me what you know" and the clients saying "I had this file locked and this one open etc etc". I.e. the clients also store the state and feed it back to the server (Bruce of course knows all of this).

Were "current directory pointer" to be part of the "state" of an open file (when that file was a directory) ... how would the client reinstate that state when the server lost it? It would need a stable cookie!

I think the NFSv4 protocol does mention the possibility of the server saving some of its state to "stable storage" - there are times (particularly relating to extended partial network partitions) where that is needed and so the cost would be justified. (a bit like /var/lib/nfs/sm).
I suspect that storing directory offsets (after every readdir call!) to stable storage would be less than ideal for performance.

Throwing one away

Posted Sep 21, 2012 15:17 UTC (Fri) by bfields (subscriber, #19510) [Link]

Yeah, so I had only the two obvious ideas:

1. Directory-modifying operations could be blocked during the grace period, during which clients could reclaim their previous cursors. (Is that enough to help?)

2. The existence of a directory-open might make it practical to keep readdir cursors in stable storage (since now you have to remember a limited amount of state for open directories, as opposed to remembering every cookie you've ever handed out.)

I dunno.

Throwing one away

Posted Sep 21, 2012 22:57 UTC (Fri) by nix (subscriber, #2304) [Link]

Your first option brings back horrible memories of servers locking solid blocking on downed NFS servers. Anything that blocks fs operations is dancing with the deadlock fairy. (I'd say that if someone modifies a directory on which a cookie is held, that cookie is invalidated. It's not like there are meaningful guarantees for readdir() on directories under modification even in the local case.)

I don't see why the client doesn't just remember 'the last filename we readdir()ed' and hand that back, probably in addition with a crude, random, non-stable cookie generated by the server in *its* last response in the same opendir() loop. I suppose it would make the readdir request a bit bigger... but in the common case of readdir()s following each other in quick succession it need be no slower: the server could keep a bit of extremely temporary local-only state (basically a readdir() handle and the random cookie we sent back with the last response) for recently received readdir() requests, compare the incoming request with the random cookie and filename, and now we know the next name to use, expiring the pair when we hit the end of the directory. If we expire the cookie/filename pair (which should be rare enough), we just need to opendir() the directory and readdir() through it until we find the filename again.

What am I missing? (Other than the fact that we can't change NFS because it's already there and doesn't work this way.)

Throwing one away

Posted Sep 21, 2012 22:40 UTC (Fri) by nix (subscriber, #2304) [Link]

I was thinking mostly of 'core' NFSv3. NLM/lockd was always a bolt-on: suggesting that a stateful subsystem be used for something like readdir would never have flown. (A bit silly, really, because readdir(), locally, is about as stateful as anything can get.)

Throwing one away

Posted Sep 20, 2012 18:00 UTC (Thu) by Fats (subscriber, #14882) [Link]

"RHEL7 will have a great many incompatible changes from RHEL6"

Except for the time when thread implementation change I don't much incompatibilities between RHEL version. At our work place we have a bunch of programs compiler on RHEL3 that still run on newer versions if the proper compatbility lib is provided, most of the time older version of shared libs.

Throwing one away

Posted Oct 7, 2012 19:25 UTC (Sun) by oak (guest, #2786) [Link]

> The plan is to get distributors to carry a patch warning users of the non-hierarchical mode

Hierarchical mode could also have increasingly better performance compared to non-hierarchical mode so that any remaining users of that would have additional motivation to migrate.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds