What's holding up 2.6.14: two difficult bugs

[Posted October 18, 2005 by corbet]

Linus was set on releasing the 2.6.14 kernel on October 17, when a little issue came up. Serge Belyshev discovered that it is easy to cause the system to stop opening files for user-space applications. He posted a program which, in essence, does the following:

    while (1) {
        int fd = open("/dev/null", O_RDONLY);
	close(fd);
    }

After some 50,000 iterations, the open fails with a "too many open files in system" message. This behavior can be problematic in more realistic situations; it evidently can cause highly-parallel kernel builds to fail, and it also exposes the system to local denial of service attacks. So it is worth tracking down.

The kernel places a limit on the number of files which are allowed to be open simultaneously. That limit is not normally expected to include files which have been closed, however. The problem, as it turns out, is a virtual filesystem scalability patch which was merged in September. That patch eliminates some locking around file structures in the kernel, and, to that end, defers certain tasks (such as file cleanup) to the read-copy-update mechanism. For this particular case, file structures corresponding to closed files are building up in the RCU callback list, and RCU is not getting around to freeing them in time.

Initially, it was thought that the culprit was another patch which put a limit on the processing of the RCU callback lists. Those lists can get quite long, and lengthy callback processing was causing latency problems elsewhere in the kernel. So a "batch size" of ten was imposed; after ten callbacks have been processed, the RCU subsystem defers the rest until later. It seemed that this limit was causing the freeing of file structures to languish. Raising the batch limit to 10,000 seemed to improve the situation, so Linus merged a patch to that effect.

But, in fact, the higher batch limit did not solve the problem for real. RCU callbacks cannot be called immediately after being queued. They must, instead, wait until every processor on the system has scheduled at least once. This "quiescence" requirement is the kernel's way of ensuring that no references to the freed structure remain; it's a key part of how RCU works. If a process chews through file structures quickly enough, they will accumulate while the kernel waits for the grace period to run out, and no changes to the batch limits will help. The only way to be able to process those callbacks - and free the associated structures - is to force every processor to schedule.

A couple of patches have been posted in an attempt to deal with this problem. One of them simply changes the way file structures are accounted for - they are removed from the count of open files when the RCU callback is queued, rather than when it is executed. This patch stops programs from running into the maximum open file limit, but does nothing to stop the growth of the RCU callback queues. So the patch which got merged, instead, is this one from Eric Dumazet, which keeps track of the length of the callback list. Should the list get to be too long (where "too long" is wired at 10,000 entries), a reschedule is forced so that the callbacks can be processed. This patch appears to have dealt with the problem well enough to allow 2.6.14 to come out, though more refinement may be required afterward.

Unfortunately for those who are waiting for 2.6.14, another problem turned up. Some 64-bit architectures which lack I/O memory management units must be very careful in setting up DMA areas. A number of devices can only reliably deal with 32-bit DMA addresses, so DMA areas must be allocated in the lower part of memory. To that end, the x86-64 and ia64 architectures use a mechanism called the "software I/O translation buffer", or swiotlb. It is simply a large chunk of low memory, allocated at boot time, which is used as a bounce buffer for DMA operations involving 64-bit-challenged devices.

It was noted that the 2.6.14-rc4 kernel can allocate the swiotlb area in high memory, which defeats the entire purpose. This revelation led to a long discussion of how swiotlb memory should be allocated. It turns out that there is no easy way of finding the low memory on the system. Once upon a time, that memory would belong to CPU 0, but on some contemporary NUMA systems, the low memory might be elsewhere. So the real solution appears to iterate through all CPUs on the system, try to allocate from each of them, and test to see if the resulting memory is within the DMAable range. If not, the memory is freed and the next processor is tried. A couple of patches implementing this approach are circulating; none has been merged as of this writing.

Index entries for this article
Kernel	Read-copy-update
Kernel	swiotlb

What's holding up 2.6.14: two difficult bugs

Posted Oct 27, 2005 12:20 UTC (Thu) by markryde (guest, #33361) [Link] (1 responses)

Hello,

>it evidently can cause highly-parallel kernel builds

Sorry for my ignorance: can someone explain in few sentences
what are highly-parallel kernel builds ?
-- R

What's holding up 2.6.14: two difficult bugs

Posted Oct 27, 2005 16:27 UTC (Thu) by klossner (subscriber, #30046) [Link]

Building the kernel on a multi-CPU system with `make -j' so that many processors are used, each compiling a few source files.

This incident

Posted Oct 31, 2005 15:16 UTC (Mon) by Baylink (guest, #755) [Link]

Is going to crop up in arguments about "software engineering" WRT the kernel.

I'm not sure what part it will play, or what the right answer will be, but I know we'll hear about it in the trade press in the next 6 months, probably as a quote from some "analyst".

We ought to give a little thought to capsulizing what the answer will be.