Brief items
The current 2.6 prepatch is 2.6.23-rc7,
released by Linus on
September 19. It contains a fair number of fixes and the return of
the sk98lin driver. This should be the last prepatch before the final
2.6.23 release.
See
the
long-format changelog for the details.
The current -mm release is 2.6.23-rc6-mm1. Recent changes
to -mm include a patch to disable the timerfd() system call (to
give time to work out what the API should actually be), randomization of
the brk() system call on i386 and x86_64 systems, and lots of
fixes.
Comments (none posted)
Kernel development news
It took me over two solid days to get this lot compiling and
booting on a few boxes. This required around ninety fixup patches
and patch droppings. There are several bugs in here which I know
of (details below) and presumably many more which I don't know of.
I have to say that this just isn't working any more.
--
Andrew Morton launches
2.6.23-rc6-mm1
Linux developers firmly guard their independence and don't often
follow our advice.
--
Richard Stallman
Yes, I realize that there's a lot of insane people out
there. However, we generally don't do kernel design decisions based
on them. But we can pat the insane users on the head and say "we
won't guarantee it works, but if you eat your prozac, and don't
bother us, go do your stupid things".
--
Linus Torvalds
Comments (5 posted)
By Jonathan Corbet
September 19, 2007
Most of the core virtual memory subsystem developers met for a mini-summit
just before the 2007 Kernel Summit in Cambridge. They came away feeling
that they had resolved a number of VM scalability problems. Subsequent
discussions have made it clear that, perhaps, this conclusion was a bit
premature. They may well have resolved things, but it is not clear that
everybody came to the same resolution.
All of the issues at hand relate to scalability in one way or another.
While the virtual memory subsystem has been through a great many changes
aimed at making it work well on contemporary systems, one key aspect of how
it works has remained essentially unchanged since the beginning: the
4096-byte (on most architectures) page size. Over that time, the amount of
memory installed on a typical system has grown by about three orders of
magnitude - that's 1000 times more pages that the kernel must manage and
1000 times more page faults which must be handled.
Since it does not appear that this trend will stop soon, there is a clear
scalability problem which must be managed.
This problem is complicated by the way that Linux tends to fragment its
memory. Almost all memory allocations are done in units of a single page,
with the result that system RAM tends to get scattered into large numbers
of single-page chunks. The kernel's memory allocator tries to keep larger
groups of pages together, but there are limits to how successful it can be.
The file
/proc/buddyinfo can be illustrative here; on a system which has
been running and busy for a while, the number of higher-order (larger)
pages, as shown in the rightmost columns, will be very small.
The main response to memory fragmentation has been to avoid higher-order
allocations at almost any cost. There are very few places in the kernel
where allocations of multiple contiguous pages are done. This approach has
worked for some time, but avoiding larger allocations does not always make
the need for such allocations go away. In fact, there are many things
which could benefit from larger contiguous memory areas, including:
- Applications which use large amounts of memory will be working with
large numbers of pages. The translation lookaside buffer (TLB) in the
CPU, which speeds virtual address lookups, is generally relatively
small, to the point that large applications run up a lot of
time-consuming TLB misses. Larger pages require fewer TLB entries,
and will thus result in faster execution. The hugetlbfs extension was
created for just this purpose, but it is a specialized mechanism used
by few applications, and it does not do anything special to make large
contiguous memory regions easier for the kernel to find.
- I/O operations can work better with larger contiguous chunks of data
to work with. Users trying to use "jumbo frames" (extra-large
packets) on high-performance network adapters have been experiencing
problems for a while. Many devices are limited in the number of
scatter/gather entries they support for a single operation, so small
buffers limit the overall I/O operation size. Disk devices are
pushing toward larger sector sizes which would best be supported by
larger contiguous buffers within the kernel.
- Filesystems are feeling pressure to use larger block sizes for a
number of performance reasons. This
message from David Chinner provides an excellent explanation of
why filesystems benefit from larger blocks.
But it is hard (on Linux) for a filesystem to work
with block sizes larger than the page size; XFS does it, but the
resulting code is seen as non-optimal and is not as fast as it could
be. Most other filesystems do not even try; as a result, an ext3
filesystem created on a system with 8192-byte pages cannot be mounted
on a system with smaller pages.
None of these issues are a surprise; developers have seen them coming for
some time. So there are a number of potential solutions waiting on the
wings. What is lacking is a consensus on which solution is the best way to
go.
One piece of the puzzle may be Mel Gorman's fragmentation avoidance work,
which has been discussed here more than once. Mel's patches seek to
separate allocations which can be moved in physical memory from those which
cannot. When movable allocations are grouped together, the kernel can,
when necessary, create higher-order groups of pages by relocating
allocations which are in the way. Some of Mel's work is in 2.6.23; more
may be merged for 2.6.24. The lumpy reclaim patches, also in
2.6.23, encourage the creation of large blocks by targeting adjacent pages
when memory is being reclaimed.
The immediate cause for the current discussion is a new version of
Christoph Lameter's large block
size patches. Christoph has filled in the largest remaining gap in
that patch set by implementing mmap() support. This code enables
the page cache to manage chunks of file data larger than a single page
which, in turn, addresses many of the I/O and filesystem issues. Christoph
has given a long list of
reasons why this patch should be merged, but agreement is not
universal.
At the top of the list of objections would appear to be the fact that the
large block size patches require the availability of higher-order pages to
work; there is no fallback if memory becomes sufficiently fragmented that
those allocations are not available. So a system which has filesystems
using larger block sizes will fall apart in the absence of large,
contiguous blocks of memory - and, as we have seen, that is not an uncommon
situation on Linux systems. The fragmentation avoidance patches can
improve the situation quite a bit, but there is no guarantee that
[PULL QUOTE:
If this patch set is merged, some developers want
it to include a loud warning to discourage users from
actually expecting it to work.
END QUOTE]
fragmentation will not occur, either as a result of the wrong workload or a
deliberate attack. So, if this patch set is merged, some developers want
it to include a loud warning to discourage users (and distributors) from
actually expecting it to work.
An alternative is Nick Piggin's fsblock work. People like to
complain about the buffer head layer in current kernels, but that layer has
a purpose: it tracks the mapping between page cache blocks and the
associated physical disk sectors. The fsblock patch replaces the buffer
head code with a new implementation with the goals of better performance
and cleaner abstractions.
One of the things fsblock can do is support large blocks for filesystems.
The current patch does not use higher-order allocations to implement this
support; instead, large blocks are made virtually contiguous in the
vmalloc() space through a call to vmap() - a technique
used by XFS now. The advantage of using vmap() is that the
filesystem code can see large, contiguous blocks without the need for
physical adjacency, so fragmentation is not an issue.
On the other hand, using vmap() is quite slow, the address space
available for vmap() on 32-bit systems is small enough to cause
problems, and using vmap() does nothing to help at the I/O level.
So Nick plans to extend fsblock to implement large blocks with contiguous
allocations, but with a fallback to vmap() when large allocations
are not available. In theory, this approach should be be best of both
worlds, giving the benefits of large blocks without unseemly explosions in
the presence of fragmentation. Says Nick:
However fsblock can do everything that higher order pagecache can
do in terms of avoiding vmap and giving contiguous memory to block
devices by opportunistically allocating higher orders of pages, and
falling back to vmap if they cannot be satisfied.
From the conversation, it seems that a number of developers see fsblock as
the future. But it is not something for the near future. The patch is
big, intrusive, and scary, which will slow its progress (and memory
management patches have a tendency to merge at a glacial pace to begin
with). It lacks the opportunistic large block feature. Only the Minix
filesystem has been updated to use fsblock, and that patch was rather
large. Everybody (including Nick) anticipates that more complex
filesystems - those with features like journaling - will present surprises
and require changes of unknown size. Fsblock is not a near-term solution.
One recently-posted patch
from Christoph could help fill in some of the gaps. His "virtual compound
page" patch allows kernel code to request a large, contiguous allocation;
that request will be satisfied with physically contiguous memory if
possible. If that memory is not available, virtually contiguous memory
will be returned instead. Beyond providing opportunistic large block
allocation for fsblock, this feature could conceivably be used in a
number of places where vmalloc() is called now, resulting in
better performance when memory is not overly fragmented.
Meanwhile, Andrea Arcangeli has been relatively quiet for some time, but one should
not forget that he is the author of much of the VM code in the kernel now.
He advocates a different approach entirely:
From my part I am really convinced the only sane way to approach
the VM scalability and larger-physically contiguous pages problem
is the CONFIG_PAGE_SHIFT patch (aka large PAGE_SIZE from Hugh for
2.4).
The CONFIG_PAGE_SHIFT patch
is a rework of an old idea: separate the size of a page as seen by the
operating system from the hardware's notion of the page size. Hardware
pages can be clustered together to create larger software pages which, in
turn, become the basic unit of memory management. If all pages in the
system were, say, 64KB in length, a 64KB buffer would be a single-page
allocation with no fragmentation issues at all.
If the system is to go to larger pages, creating them in software is about
the only option. Most processors support more than one hardware page size,
but the smallest of the larger page sizes tend to be too large for general
use. For example, i386 processors have no page sizes between 4KB and 2MB.
Clustering pages in software enables the use of more reasonable page sizes
and creates the flexibility needed to optimize the page size for the
expected load on the system. This approach will make large block support
easy, and it will help with the I/O performance issues as well. Page
clustering is not helpful for TLB pressure problems, but there is little to
be done there in any sort of general way.
The biggest problem, perhaps, with page clustering is that it replaces
external fragmentation with internal fragmentation. A 64KB page will, when
used as the page cache for a 1KB file, waste 63KB of memory. There are
provisions in Andrea's patch for splitting large pages to handle this
situation; Andrea claims that this splitting will not lead to the same sort
of fragmentation seen on current systems, but he has not, yet, convinced
the others of this fact.
Conclusions from this discussion are hard to come by; at one point Mel
Gorman asked: "Are we going to agree
on some sort of plan or are we just going to handwave ourselves to
death?" Linus has just called the whole
discussion "idiotic". What may happen is that the large block size
patches go in - with warnings - as a way of keeping a small subset of users
happy and providing more information about the problem space. Memory
management hacking requires a certain amount of black-magic handwaving in
the best of times; there is no reason to believe that the waving of hands
is going to slow down anytime soon this time around.
Comments (34 posted)
By Jonathan Corbet
September 19, 2007
Dynamic kernel tracing remains high on the wishlists presented by many
Linux users. While much work has been done to create a powerful tracing
capability, very little of that work has found its way into the mainline.
The recent posting of one small piece of infrastructure may help to change
that situation, though.
The piece in question is the trace layer posted by David
Wilder. Its purpose is to make it easy for a tracing application to get
things set up in the kernel and allow the user to control the tracing
process. To that end, it provides an internal kernel API and a set of
control files in the debugfs filesystem.
On the kernel side, a tracing module would set things up with a call to:
#include <linux/trace.h>
struct trace_info *trace_setup(const char *root, const char *name,
u32 buf_size, u32 buf_nr, u32 flags);
Here, root is the name of the root directory which will appear in
debugfs, name is the name of the control directory within
root, buf_size and buf_nr describe the size and
number of relay buffers to be created, and flags controls various
channel options. The TRACE_GLOBAL_CHANNEL flag says that a single
set of relay channels (as opposed to per-CPU channels) should be used;
TRACE_FLIGHT_CHANNEL turns on the "flight recorder" mode where
relay buffer overruns result in the overwriting of old data, and
TRACE_DISABLE_STATE disables control of the channel via debugfs.
The return value (if all goes well) will be a pointer to a
trace_info structure for the channel. This structure has a number
of fields, but the one which will be of most interest outside of the trace
code itself will be rchan, which is a pointer to the relay channel
associated with this trace point.
When actual tracing is to begin, the kernel module should make a call to:
int trace_start(struct trace_info *trace);
The return value follows the "zero or a negative error value" convention.
Tracing is turned off with:
int trace_stop(struct trace_info *trace);
When the tracing module is done, it should shut down the trace with:
void trace_cleanup(struct trace_info *trace);
Note that none of these entry points have anything to do with the placement
or activation of trace points or the creation of trace data. All of that
must be done separately by the trace module. So a typical module will,
after calling trace_start(), set up one or more kprobes or
activate a static kernel marker. The probe function attached to the trace
points should do something like this:
rcu_read_lock();
if (trace_running(trace)) {
/* Format trace data and output via relay */
}
rcu_read_unlock();
Additionally, if the TRACE_GLOBAL_CHANNEL flag has been set, the
probe function should protect access to the relay channel with a spinlock.
This protection may also be necessary in situations where an interrupt
handler might be traced.
In user space, the trace information will show up under
/debug/root/name, where debug is the debugfs mount point,
and root and name are the directory names passed to
trace_setup(). The file state can be read to get the
current tracing state; an application can write start or
stop to this file to turn tracing on or off. The file
trace0 is the relay channel where tracing data can be read; on SMP
systems with per-CPU channels there will be additional files
(trace1...) for additional processors. The file dropped
can be read to see how many trace records (if any) have been dropped due to
buffer-full conditions.
All told, it is not a particularly complicated bit of code.
Perhaps the most significant feature of this patch is that it is part of the
infrastructure created and used by the SystemTap project. Getting this
code into the mainline will make it that much easier for distributors to
provide well-supported tracing facilities to their users. And that, in
turn, should make users happy and give analysts one less thing to complain
about.
Comments (none posted)
By Jonathan Corbet
September 17, 2007
The final 2.6.23 kernel release is getting closer. At this point, it would
be more than surprising to see any additional API changes find their way
into this release, so it should be safe to summarize the changes which have
been made.
As always, a cumulative record of API changes can be found in the LWN 2.6 API changes page.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>