LWN.net Logo

KS2007: Mini-summit reports

By Jonathan Corbet
September 6, 2007
LWN.net Kernel Summit 2007 coverage

As the kernel (along with its development community) gets larger, the practice of holding subsystem-specific "mini-summits" is getting more popular. Getting the large group together is valuable, but the mini-summits often get more real work done without having the baggage associated with the full kernel summit. It is good to share the results from these summits with the larger group, so a slot for mini-summit reports has become traditional. The 2007 kernel summit was no exception.

The first reporter was Len Brown, who talked about the power management meeting held just before the Ottawa Linux Symposium. Perhaps the topic of most interest was suspend-to-RAM (STR) - one of those issues which never seems to go away. Making video adapters suspend (and, crucially, resume) remains an ongoing challenge, but there are, says Len, "signs of life." In particular, there are hopes of making things work well with Intel adapters thanks to the efforts of Keith Packard.

Certain types of ATA disks remain problematic. There are patches in 2.6.23 which help, but they might not actually be turned on. With these sorts of holes, says Len, it is "mind-boggling" that distributors try to support STR at all. On the other hand, people are starting to concern themselves with how quickly suspend and resume work; that can only be a good sign.

Andrew Morton broke in to ask who is the maintainer for STR. The answer: "the community." Another question about whether there is a design for STR in Linux went unanswered. The conclusion, according to Len, is that STR development is seriously short of developers; somebody needs to put some real resources into it.

Moving to hibernation (suspend to disk), Len noted that everybody hates the freezer code. He also pointed to the active user community around the out-of-tree TuxOnIce (formerly Suspend2) project. The fact that such popular code remains out of the mainline indicates a failure in our development processes. It was noted that TuxOnIce is characterized by much friendlier user support; problem reports are answered nicely, and the TuxOnIce developers will work with users to chase down driver-related problems. This is not generally true of the in-kernel hibernate code. Linus jumped in to point out that the current hibernation maintainer, Rafael Wysocki, has done a lot of work to improve the in-kernel code.

It was suggested that one thing which is really needed is a good document for driver writers on how to support suspend and resume. Then the discussion was cut off to move things forward.

Len's last topic was the cpufreq code. It would seem that the use of multiple idle states tends to break certain enterprise network management code. If processors shut down when idle, the amount of idle CPU time approaches zero and the management system concludes that the system is overused. This, evidently, is a more serious problem than one might think. On the embedded side, there is a lot of interest in out-of-tree dynamic power management frameworks which can assign a whole set of operating modes (not just CPU frequencies) to each process. When a process is scheduled in, the hardware is put into the required mode to support it. There are two embedded power policy management frameworks currently under development - Power Policy Manager and Open Hardware Manager, but both are in relatively early states of development.

Ted Ts'o talked very briefly about the filesystem and storage workshop held last February. The filesystems side of that gathering was covered on LWN so there is little to add here. On the storage side, there were a few issues which were raised briefly:

  • The current multipath implementation, which works at the BIO layer, has a number of problems. The developers are reaching the conclusion that multipath simply cannot be done at that level; instead, it needs to be pushed down to the block layer request level.

  • There is a lot of industry pressure toward disks with 4Kb hardware sectors. We will, eventually, need to efficiently handle such disks. There is also increasing interest in hybrid drives, which combine rotating storage with a flash-based cache. Patches for hybrid drives are in the works now.

  • Longer-term, it seems that the drive manufacturers are finding their options limited by sector-based addressing. So "object-based storage" devices are on the horizon, though several years away still. These drives, for all practical purposes, implement the filesystem themselves on the disk, exporting an inode-like object-based interface to the host. Supporting such hardware will require big changes to the block, RAID, and filesystem layers - but we have some time.

Martin Bligh reported from the virtual memory summit held just before the kernel summit. There is, he says, a lot of stuff "hanging over from last year." One problem is that realistic VM benchmarks are hard to find, making it difficult to tell whether VM patches are really an improvement or not.

There was some talk of NUMA replication - making copies of pages in the page cache on different nodes of a NUMA system for performance reasons. The conclusion was that replication should not be done in the general case. There may be some mechanism which allows it to be enabled in specific cases with some sort of user-space policy interface.

The remaining anti-fragmentation code will be merged, finally. There also appears to be a desire to work toward supporting larger pages in the page cache. Just increasing the page size will not do the trick, as the internal fragmentation costs will be too high. So some sort of mechanism which allows for variable-sized pages in the cache is called for. Christoph Lameter has posted a variable page size patch, but it has run into resistance. Among other things, it does not support fallback to smaller pages when large allocations are not available and it does not support mmap(). Christoph will be doing another pass over this patch to address some of these problems.

Memory controllers with containers were discussed. It was decided that Balbir Singh's memory controller patch is the right way to go, despite some concerns over its complexity and overhead.

Other decisions include removing the discontiguous memory option, settling on the "sparsemem" model instead. The slab allocator (recently replaced by SLUB) will be removed after a few remaining problems (/proc/slabinfo, for example) are dealt with. There will be a mechanism by which the kernel can inform user space that the system is under memory pressure, enabling large applications to know when freeing up some caches might be a good idea. The venerable DMA memory zone will go away, replaced by a more flexible way of allocating memory which meets specific requirements.

Finally, Avi Kivity discussed the virtualization summit. That group, consisting of representatives of almost all of the free and commercial virtualization alternatives, decided to focus on the guest side of the virtualization problem as a way of keeping peace in the room. Decisions were made to cooperate on virtio and the paravirt_ops interface.

Other topics covered included finding a way to present the characteristics of NUMA systems to guests, improving paging performance through page hinting, and preparing for upcoming hardware advances.


(Log in to post comments)

KS2007: Mini-summit reports

Posted Sep 6, 2007 9:44 UTC (Thu) by jengelh (subscriber, #33263) [Link]

>So "object-based storage" devices are on the horizon, though several years away still. These drives, for all practical purposes, implement the filesystem themselves on the disk.

Today, I can choose what filesystem I want, and if performance sucks, it is the filesystem to blame. I can choose to replace the filesystem if I desire so. Now they want a hardware-based filesystem. Suppose it sucks as much as VFAT. Statistically, there is always at least one manufacturer who does not get things right. Then I would be stuck with a slow FAT. Software-based filesystems (e.g. in the kernel) would probably continue to exist, but suffer from the hardware filesystem. And exchanging harddisks is quite pricey compared to re-mkfs'ing. That is a no-thanks for now.

KS2007: Mini-summit reports

Posted Sep 7, 2007 12:15 UTC (Fri) by liljencrantz (subscriber, #28458) [Link]

The article hints that not a complete hardware file system interface is in the works, only the parts related to reading/writing separate files. Anything else would be silly, since it would be impossible to implement the wildly varying file system semantics implemented by e.g. Unix and Windows.

KS2007: Mini-summit reports

Posted Sep 11, 2007 5:25 UTC (Tue) by gdt (subscriber, #6284) [Link]

The point of object-based storage isn't to make a single disk more efficient. It is to make a set of virtualised disks more efficient.

4Kb

Posted Sep 6, 2007 10:13 UTC (Thu) by epa (subscriber, #39769) [Link]

Don't disks already have 4Kb hardware sectors? 4096 bits is 512 bytes.

4Kb

Posted Sep 6, 2007 14:43 UTC (Thu) by rjbell4 (guest, #35764) [Link]

The author means 4KB, or 4096 bytes

Simon, is that you?

Posted Sep 14, 2007 1:54 UTC (Fri) by jzbiciak (✭ supporter ✭, #5246) [Link]

It would seem that the use of multiple idle states tends to break certain enterprise network management code. If processors shut down when idle, the amount of idle CPU time approaches zero and the management system concludes that the system is overused.

The BOFH strikes again! Bwahahahaha!!!!

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds