LWN.net Logo

KS2007: Scalability

By Jonathan Corbet
September 9, 2007
LWN.net Kernel Summit 2007 coverage

Scalability was the subject of a day-2 session which focused primarily on small details. For example, there was concern about how well the slab allocator scales on NUMA systems; there was talk of doing more performance testing until it was pointed out that slab's days are numbered. Chances are it will be removed in an upcoming kernel release in favor of the SLUB allocator.

The SLOB allocator, it was noted, has been thoroughly reworked to help improve its performance on small systems.

There was a discussion on I/O performance. On a multi-CPU system, there is a cache penalty to be paid whenever the CPU which submits the operation is different from the CPU which handles the completion of that operation. It is hard to see how to solve this problem in the general case. One could take pains to submit operations on the CPU where completion is expected to be handled, but that really just moves the cost around. On sufficiently smart hardware, one can try to direct completion interrupts to the CPU which submitted the operation. Beyond that, there were not a whole lot of ideas going around.

Dave Chinner raised a separate issue: it seems that booting a system with 20,000 block devices can be just a little slow.. He's seeing this problems with installations running a mere 300TB of storage. Asynchronous scanning is seen as at least part of the solution to this problem. There is a related problem, though: it turns out that, with so many drives, finding a specific one can be a bit challenging. There is not much that the kernel can do about that one, though.

There is, it seems, a need to rework the direct I/O code to separate the memory setup from the I/O submission paths. There are places in the kernel which would like to submit direct I/O requests, but which are not working with user-space buffers. Currently the code is sufficiently intermixed that the direct I/O handling code cannot be used with buffers in kernel space. Another issue has to do with locking when both buffered and direct I/O are being performed on the same file. It is, apparently, difficult to keep the file's contents coherent in such situations.

There was also some talk about giving the filesystems more flexibility in how they handle writeout of dirty pages from memory. By handling writeout in the right order, it should be possible to improve performance. Increased interest in execute-in-place support was expressed; as the capacity of flash-based devices grows, it will make more sense to store system images there and run directly from the device.

On the small-systems side of scalability, the big issue remains the overall footprint of the kernel. That footprint is growing, of course, which presents a challenge for embedded deployments. Better ways of quantifying kernel size would be helpful.


(Log in to post comments)

KS2007: Scalability

Posted Sep 11, 2007 6:36 UTC (Tue) by sbsiddha (subscriber, #38593) [Link]

About I/O scalability on a multi CPU system, migrating I/O submission to completion CPU is not actually moving the cost around. If most of the submission work is moved to the completion cpu, it will help minimize the
access to remote cachelines (that happens in timers, slab, scsi layers of the kernel) and most of the remote accesses will now be local. There will still be some remote cache references while migrating the I/O but those will be relatively small per I/O. A simple and dumb I/O migration experiment gave good perf results on a heavily loaded system. Patches and results are at http://lkml.org/lkml/2007/7/27/414

It will be difficult however to make these patches generally acceptable and not regress perf for common workloads. Hopefully future I/O HW will solve some of these issues but we are looking to see if there are simple enhancements and heuristics that we can exploit in the current generation HW.

KS2007: Scalability

Posted Sep 11, 2007 18:54 UTC (Tue) by Nick (guest, #15060) [Link]

It was tricky to get into exact details of what was happening here.

One issue is data going over the interconnect on NUMA systems -- in
this case, obviously you cannot avoid actually sending the page
data over RAM. Basically we really have to make sure userspace does
the right thing.

Another issue is from which CPU should you do the pagecache writeout
from. And in this case you do want to do it on the same node that
most of the pages are located on (rather than where the device is,
because it's a question of which would require touching more data
structures).

For the problem you describe, it is different again. And yours does not
apply only to NUMA but also SMP. And basically I gather what you are
doing is trying to hand over control of the block layer to the completing
CPU at a point that is going to result in the fewest cache misses. We
didn't really discuss this in detail, but yes some of the points that
were raised included the upcoming hardware, and also the fact that network
might have similar concerns, and it might be good to work on them together.

I still hope to see continued work on your ideas, and I don't think they
were shot down at all (if I remember correctly).

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds