As the kernel (along with its development community) gets larger, the
practice of holding subsystem-specific "mini-summits" is getting more
popular. Getting the large group together is valuable, but the mini-summits
often get more real work done without having the baggage associated with
the full kernel summit. It is good to share the results from these
summits with the larger group, so a slot for mini-summit reports has become
traditional. The 2007 kernel summit was no exception.
The first reporter was Len Brown, who talked about the power management
meeting held just before the Ottawa Linux Symposium. Perhaps the topic of
most interest was suspend-to-RAM (STR) - one of those issues which never
seems to go away. Making video adapters suspend (and, crucially, resume)
remains an ongoing challenge, but there are, says Len, "signs of life." In
particular, there are hopes of making things work well with Intel adapters
thanks to the efforts of Keith Packard.
Certain types of ATA disks remain problematic. There are patches in 2.6.23
which help, but they might not actually be turned on. With these sorts of
holes, says Len, it is "mind-boggling" that distributors try to support STR
at all. On the other hand, people are starting to concern themselves with
how quickly suspend and resume work; that can only be a good sign.
Andrew Morton broke in to ask who is the maintainer for STR. The answer:
"the community." Another question about whether there is a design for STR
in Linux went unanswered. The conclusion, according to Len, is that STR
development is seriously short of developers; somebody needs to put some
real resources into it.
Moving to hibernation (suspend to disk), Len noted that everybody hates the
freezer code. He also pointed to the active user community around the
out-of-tree TuxOnIce (formerly
Suspend2) project. The fact that such popular code remains out of the
mainline indicates a failure in our development processes. It was noted
that TuxOnIce is characterized by much friendlier user support; problem
reports are answered nicely, and the TuxOnIce developers will work with
users to chase down driver-related problems. This is not generally true of
the in-kernel hibernate code.
Linus jumped in to point out that the current hibernation maintainer,
Rafael Wysocki, has done a lot of work to improve the in-kernel code.
It was suggested that one thing which is really needed is a good document
for driver writers on how to support suspend and resume. Then the
discussion was cut off to move things forward.
Len's last topic was the cpufreq code. It would seem that the use of
multiple idle states tends to break certain enterprise network management
code. If processors shut down when idle, the amount of idle CPU time
approaches zero and the management system concludes that the system is
overused. This, evidently, is a more serious problem than one might think.
On the embedded side, there is a lot of interest in out-of-tree dynamic
power management frameworks which can assign a whole set of operating modes
(not just CPU frequencies) to each process. When a process is scheduled
in, the hardware is put into the required mode to support it. There are
two embedded power policy management frameworks currently under development
- Power Policy
Manager and Open Hardware
Manager, but both are in relatively early states of development.
Ted Ts'o talked very briefly about the filesystem and storage workshop held
last February. The filesystems side of that gathering was covered on LWN so there is
little to add here. On the storage side, there were a few issues which
were raised briefly:
- The current multipath implementation, which works at the BIO layer,
has a number of problems. The developers are reaching the conclusion
that multipath simply cannot be done at that level; instead, it needs
to be pushed down to the block layer request level.
- There is a lot of industry pressure toward disks with 4Kb hardware
sectors. We will, eventually, need to efficiently handle such disks.
There is also increasing interest in hybrid drives, which combine
rotating storage with a flash-based cache. Patches for hybrid drives
are in the works now.
- Longer-term, it seems that the drive manufacturers are finding their
options limited by sector-based addressing. So "object-based storage"
devices are on the horizon, though several years away still. These
drives, for all practical purposes, implement the filesystem
themselves on the disk, exporting an inode-like object-based interface
to the host. Supporting such hardware will require big changes to the
block, RAID, and filesystem layers - but we have some time.
Martin Bligh reported from the virtual memory summit held just before the
kernel summit. There is, he says, a lot of stuff "hanging over from last
year." One problem is that realistic VM benchmarks are hard to find,
making it difficult to tell whether VM patches are really an improvement or
not.
There was some talk of NUMA replication - making copies of pages in the
page cache on different nodes of a NUMA system for performance reasons.
The conclusion was that replication should not be done in the general
case. There may be some mechanism which allows it to be enabled in
specific cases with some sort of user-space policy interface.
The remaining anti-fragmentation code will be merged, finally. There also
appears to be a desire to work toward supporting larger pages in the page
cache. Just increasing the page size will not do the trick, as the
internal fragmentation costs will be too high. So some sort of mechanism
which allows for variable-sized pages in the cache is called for.
Christoph Lameter has posted a
variable page size patch, but it has run into resistance. Among other
things, it does not support fallback to smaller pages when large
allocations are not available and it does not support mmap().
Christoph will be doing another pass over this patch to address some of
these problems.
Memory controllers with containers were discussed. It was decided that
Balbir Singh's memory controller
patch is the right way to go, despite some concerns over its complexity
and overhead.
Other decisions include removing the discontiguous memory option, settling
on the "sparsemem" model instead. The slab allocator (recently replaced by
SLUB) will be removed after a
few remaining problems (/proc/slabinfo, for example) are dealt
with. There will be a mechanism by which the kernel can inform user space
that the system is under memory pressure, enabling large applications to
know when freeing up some caches might be a good idea. The venerable DMA
memory zone will go away, replaced by a more flexible way of allocating
memory which meets specific requirements.
Finally, Avi Kivity discussed the virtualization summit. That group,
consisting of representatives of almost all of the free and commercial
virtualization alternatives, decided to focus on the guest side of the
virtualization problem as a way of keeping peace in the room. Decisions
were made to cooperate on virtio and the paravirt_ops
interface.
Other topics covered included finding a way to present the characteristics
of NUMA systems to guests, improving paging performance through page hinting, and preparing for
upcoming hardware advances.
(
Log in to post comments)