LWN.net Logo

RCU and open file accounting

David Miller has been making great progress in his port of the Linux kernel to Sun's new "Niagara" (SPARC) CPU architecture. He has run into one little problem, however:

I just wanted to report that I am hitting the "VFS: file-max limit xxx reached" problem quite easily on my 32-cpu Niagara machine with 16GB of ram with current 2.6.x GIT. It seems far too easy to get a box into this state due to SLAB fragmentation and RCU. And once you get a machine into this state it is totally unusable.

Our test case is usually a "make -j8192" kernel build along with a parallel bootstrap of gcc. That puts about 256 processes on each cpu's runqueue, I doubt ksoftirqd can run much at all.

The file limit problem was last discussed here in October, when it delayed the release of the 2.6.14 kernel. A fix merged at that time made the problem harder to trigger, but, as David's experience shows, the problem has not been solved altogether. One might argue that a relatively small number of users run the sort of workload that David is playing with. But the point remains: with current kernels, including the upcoming 2.6.16 release, it is possible for a suitably-written program to run the open file count to its maximum, thus denying any sort of service to other users. This seems like a problem which one might want to fix.

One piece of the puzzle here is the way that the open file count is managed. Currently, that count is decremented in the slab destructor set up for file structures. This method works, but it can cause the decrement to be delayed by an arbitrary amount of time, with the result that the open file count overstates the number of files which are actually held open by processes in the system. Moving that operation out of the slab destructor can help to keep the count more in sync with reality.

The core of the problem, however is the use of the read-copy-update (RCU) mechanism for management of file structures. When a file is closed, the task of freeing the structure is queued in RCU. Using RCU lets the kernel ensure that the structure is not freed while references to it remain, but without the sort of locking overhead that comes with other techniques. As a result, performance is measurably improved on SMP systems.

When there is a lot of opening and closing of files going on (such as, say, when a wild-eyed developer starts an 8192-process kernel build), the length of the RCU callback queue can get quite long. By the time that the RCU code decides that the system has quiesced and it is safe to invoke the RCU callbacks, the queue might have thousands of entries. Working through the entire callback queue led to latency problems elsewhere in the system, so 2.6.14 included a patch which put an upper limit on the number of callbacks which would be processed in any single iteration.

The limit helped with the latency problem. But, if the generation of RCU callbacks continues at a high rate, the length of the callback queue can only grow. Every entry in the queue represents memory which could be returned to the system, but which has not yet been made available. So, as the queue grows, memory gets fragmented and the system heads towards the dreaded out-of-memory state.

An attempt at a solution can be found in this patch by Dipankar Sarma, which has been sitting in the -mm tree for a while. Dipankar's patch puts a configurable upper limit on the number of RCU callbacks which will be processed in any single batch; that allows system administrators to tune the batch size to their particular needs. On a server which is dealing with large number of file requests, and on which latency is not a crucial issue, the batch size can be set to a large number.

The patch also adds a high-water limit. If the length of the RCU callback queue ever exceeds that limit, the RCU code will (1) set the batch limit to infinity (or the integer representation thereof) and (2) send out an inter-processor interrupt forcing every CPU on the system to schedule. The combination of these actions will cause the system to work through the entire RCU queue at the soonest possible time. Once the queue length goes below a low-water limit, the old batch limit will be restored.

It is, in other words, a somewhat unsubtle approach; the system is given a kick in the rear and told to go clean up its mess. But, it seems, that is exactly what the system needs at such a time. The cleanup task can only be deferred for so long; the work eventually needs to be done regardless. David has reported that the patches fix the problem on his Niagara system, and suggests that they should be merged into 2.6.16. It is a fairly significant patch to merge at this late point in the cycle, but there seems to be a reasonably high level of confidence in its stability. So, chances are that it will be included as a preferable alternative to shipping 2.6.16 with a known problem.


(Log in to post comments)

RCU and open file accounting

Posted Mar 9, 2006 15:45 UTC (Thu) by swiftone (guest, #17420) [Link]

It is a fairly significant patch to merge at this late point in the cycle,

I thought the 2.6 series is currently at an "ongoing" cycle, with no 2.7 planned. That would mean that the only way for significant changes to show up would be to do exactly this -- percolate through the -mm tree and get into the mainline kernel.

If I have a flawed understanding of the current kernel design process, please let me know.

RCU and open file accounting

Posted Mar 9, 2006 16:23 UTC (Thu) by copsewood (subscriber, #199) [Link]

If 2.6.16 can't be destabilised in this way, the alternative would probably be for this patch to go into 2.6.17.

In the cycle towards 2.6.16

Posted Mar 9, 2006 16:36 UTC (Thu) by AnswerGuy (guest, #1256) [Link]

... I have to suspect that Jon was speaking to the upcoming 2.6.16 release.

There's been a concerted effort to keep the larger patches in the early builds so it would have been more appropriate for a 2.6.15.{1,2,3} ... and not for 2.6.16-rc6 ... for example.

Anyway, it could wait until 2.6.16.1 ... there aren't that many users with hundreds of CPUs and over 10GiB of RAM that need to worry about make -j8192 running out of steam.

JimD

In the cycle towards 2.6.16

Posted Mar 9, 2006 17:36 UTC (Thu) by madscientist (subscriber, #16861) [Link]

I'm quite confident that there's no way a patch like this would ever be considered for a bugfix release like 2.6.16.1. It's far too big and intrusive, and, honestly, not severe enough to justify the risk.

It'll either squeek into 2.6.16, or be deferred until 2.6.17.

In the cycle towards 2.6.16

Posted Mar 9, 2006 19:05 UTC (Thu) by PaulMcKenney (subscriber, #9624) [Link]

Looks to me like it actually did go into the 2.6.16 stream.

RCU and open file accounting (minor correction)

Posted Mar 9, 2006 19:36 UTC (Thu) by dipankar (subscriber, #7820) [Link]

Batch limiting was being experimented since 2.6.1 and throttling patch was published in July 2004. The throttling patch introduced the boot paramenter rcu.maxbatch which limits the maximum number of finished callbacks executed in one batch (default 10). IIRC, that was merged in Aug 2004. The 2.6.14 patch increased the default to 10000 and introduced forced reschedule based on queue length. The latest patch automatically swings between a batch limit of 10 and no limit at all when the system needs a kick in the rear end.

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds