Brief items
The current 2.6 prepatch is 2.6.10-rc3,
announced by Linus on December 3.
Changes since -rc2 include the un-deprecation of
MODULE_PARM() (it
is generating too many warnings, and the fixes will not be merged before
2.6.10), a new major number (180) for the "ub" USB storage driver, some x86
single-stepping fixes, a large number of "sparse" annotations, the
token-based memory management fix, a memory
technology device (and JFFS2) update, a frame buffer device update, some
user-mode Linux patches, some page allocator tuning, and a few architecture
updates. See
the long-format changelog for
the details.
Linus's BitKeeper repository currently holds a DVB update and a set of bug
fixes. Very few patches are being accepted currently as the kernel hackers
try to stabilize things for the final 2.6.10 release.
Andrew Morton has released no -mm patches over the last week.
The current bugfix patch from Alan Cox is 2.6.9-ac13.
The current 2.4 prepatch remains 2.4.29-pre1; Marcelo has released
no prepatches since November 25.
Comments (2 posted)
Kernel development news
That's the problem with C++; it is far too easy to misuse. And with a
project as big as the Linux Kernel, and with as many contributors as
the Linux Kernel, at the end of the day, it's all about damage
control. If we depend on peer review to understand whether or not a
patch is busted, it is rather important that something as simple as
a = b + c;
does what we think it does, and not something else because someone has
overloaded the '+' operator. Or God help us, as I have mentioned
earlier, the comma operator.
-- Ted Ts'o
Comments (27 posted)
James Morris has
posted an
article on recent SELinux performance improvements. "
The use of
RCU solved a serious scalability problem with the AVC, thanks to the work
of Kaigai and the RCU developers. It seems likely to be a useful approach
for dealing with similar problems, and specifically with some of the
SELinux networking code as mentioned."
Comments (none posted)
Beyond doubt, many LWN readers have been concerned with how page fault
performance might be improved on their 512-processor desktop systems.
Christoph Lameter has been working on the answer for some months now;
his
page fault scalability patches are reaching a point where they will likely
be considered for inclusion after 2.6.10 comes out. This patch is an
interesting example of the kind of changes which must be made to support
large numbers of processors.
One of the virtual memory subsystem's core data structures is struct
mm_struct. This structure tracks the virtual address space used by
one or more processes. It contains pointers to the page tables, to the
virtual memory area (VMA) structures, and more. Processes typically have
their own struct mm_struct, but threads which share memory also
share the same mm_struct.
Access to this structure is serialized by two mechanisms. A semaphore
(mmap_sem) controls access to the mm_struct structure
itself, and a spinlock (page_table_lock) guards the page tables.
When the status of a page must be changed in the page tables, the kernel
must first take the page_table_lock to avoid creating confusion
with the other processors on the system. When he looked at the scalability
of the kernel's page fault handling code, Christoph identified this lock as
a problem. When many processors are trying to simultaneously make changes
to a single set of page tables, they end up spending a lot of time
busy-waiting for the page table lock. Improving the performance of this
code thus requires reducing the use of that lock.
The first step in this process is a patch which causes the VM
subsystem to hold page_table_lock for shorter periods of time.
The lock is dropped for portions of the code which have no need of it, and
later reacquired if needed. It is a fairly straightforward exercise in
lock breaking which helps scalability, but does not solve the whole
problem.
The core of the patch is a set of atomic page
table entry functions which can modify individual PTEs with no locking
required. Rather than acquiring page_table_lock, making a PTE
change, then dropping the lock, the kernel can simply make a call to:
int ptep_cmpxchg(struct vm_area_struct *vma, unsigned long address,
pte_t *ptep, pte_t oldval, pte_t newval);
This function uses the cmpxchg instruction (or whatever variant or
emulation may be available, depending on the architecture) to compare the
page table entry pointed to by ptep with oldval; if the
two match, the entry is set to newval and oldval is
returned. If the two do not match, the current thread lost a race with
another processor which changed the PTE first; in that case, the PTE is
not modified further and the function returns zero. Kernel code which uses
cmpxchg typically will retry a modification when this sort of race
occurs; Christoph's code, instead, is able to assume that the competing
thread did the same thing as the one it raced against: marked the page as
being present in memory. So no retries are needed.
With that change, pages can be brought into the working set and made
available without having to take the page_table_lock - except for
one last place. The mm_struct structure contains two fields
(rss and anon_rss) which track the total number of
in-memory pages referenced
by this address space (the "resident set size"). When a page is brought in
(or forced out), these fields must be incremented or decremented
accordingly. Access to rss and anon_rss is controlled by
page_table_lock. Getting rid of that last use of the lock has
required a surprising amount of work on Christoph's part.
The first implementation turned the RSS fields into atomic_t
variables, so that they could be operated on without locking. This solution
worked, but it had some shortcomings: (1) they could only be
32-bit variables, since not all architectures support 64-bit atomic types,
(2) the atomic operations are still relatively expensive,
and (3) having all processors on the system updating a single pair of
variables caused a great deal of cache line bouncing, which hurt
performance.
The next attempt was called "sloppy_rss." Essentially, the sloppy approach
retains the old unsigned long type for rss and
anon_rss, and simply
updates them without the lock. The result is incorrect RSS
values, but Christoph noted that the errors tended not to exceed 1%. This
approach is faster than using atomic operations. The incorrect values
bugged some developers, however, and the cache bouncing problem remained.
Another approach which was to do away with the RSS counters entirely, on the
theory that these values were not actually needed very often. When an attempt
to query the resident set size was made (generally by reading files in
/proc from user space), the kernel would scan through the
process's page tables and count the number of resident pages. This idea
did not get very far, however; the cost of querying RSS values was simply
too high.
The current approach was suggested by Linus
last month. A new set of counters is added to the task structure;
when a thread brings a page into memory, that thread's counters are
incremented accordingly. When a real RSS value is needed, the per-thread
values are summed to yield the answer. So querying the RSS still requires
a loop, but iterating through a list of tasks is much faster than walking
an entire set of page tables. This algorithm avoids locking issues (since
each thread takes care of its own page fault accounting and does not
contend with others); it also minimizes
the cache line problems. The "split RSS" approach still requires
rss and anon_rss counters in the mm_struct
itself; they are used to track
pages brought in by threads which have since exited, and they are decremented
when pages are forced out. This change also requires that RCU be used when
freeing the mm_struct structure to ensure that no other processor
is still trying to calculate an RSS value.
The current version of the patch has convinced
Linus, so expect it to go in at some point. The biggest roadblock, at
this point, may be that the four-level page table patch is at the front of
the queue for 2.6.11. That patch currently conflicts with Christoph's
work, and, in general, has made it hard for other VM work to get done.
Once the four-level patch goes into the mainline, however, things should
stabilize somewhat - at least, from the point of view of hackers working on
other VM-related patches.
Comments (1 posted)
It was asked recently: is the 2.6.10 release coming sometime soon? Andrew
Morton
replied that the
latter part of December looked like when it might happen. He also noted
that he is trying to produce a higher-quality release this time around:
We need to be be achieving higher-quality major releases than we
did in 2.6.8 and 2.6.9. Really the only tool we have to ensure
this is longer stabilisation periods.
Andrew also noted that getting people to test anything other than the final
releases is hard, with the result that many bugs are only reported after a
new "stable" kernel is out. If things don't get better, says Andrew, it
may be necessary to start doing point releases (e.g. 2.6.10.1) for the
final stabilization steps. Alternatively, the kernel developers could
switch to a new sort of even/odd scheme, so that 2.6.11 would be a new
features release, and 2.6.12 would be bug fixes only.
Much of the discussion, however, centered around regression testing. If
only there were more automated testing, the reasoning goes, fewer bugs
would make it into final kernel releases. This wish may eventually come
true, but, for now, it appears that regression testing is not as helpful as
many would like.
OSDL has pointed out that it
runs a whole set of tests every day. The problem, they say, is getting
people to actually look at the results. It may be that not enough people
know about OSDL's work, and, for that reason, the output is not being
used. But it also may be that the testing results are simply not that
useful.
Consider this posting from Andrew Morton on
regression testing:
However I have my doubts about how useful it will end up being.
These test suites don't seem to pick up many regressions.... We
simply get far better coverage testing by releasing code, because
of all the wild, whacky and weird things which people do with their
computers. Bless them.
The test suites, it seems, are not testing for the right things. One could
argue that the test suites simply have not, yet, been developed to the
point where they are performing comprehensive testing of the kernel. This
gap could be slowly filled in by having kernel bug fixes be accompanied by
new tests which verify that the bug remains fixed. Much of the code in the
kernel, however, is hardware-specific, and that code is where a lot of bugs
tend to be found. Hardware-specific code can only be tested in the
presence of the hardware in question. Outfitting a testing lab with even a
fraction of the hardware supported by Linux would be a massively expensive
undertaking.
So the wider Linux community is likely to remain the testing lab of last
resort for the kernel; the community as a whole, after all, does have all
that hardware. And the truth of the matter is that helping with testing is
part of the cost of free software (and of the proprietary variety as
well). So the best results might be had by trying to get more widespread
testing earlier in the process. Getting Linus to distinguish between
intermediate and release candidate kernels might help in that regard. If
that can't be done, then, perhaps, going with point releases may be
required.
Comments (8 posted)
Certain parts of the kernel, it seems, can be tweaked forever; I/O
schedulers would count as one of those parts. Linux has three of them
currently (plus a no-op scheduler), and its block I/O performance is
generally quite good. But that doesn't mean it can't be improved.
Jens Axboe recently decided to do some more hacking on his "completely fair
queueing" (CFQ) scheduler; the result is the new time-sliced CFQ scheduler, which has since
seen a second third fourth revision. The CFQ scheduler has always
tried to divide the bandwidth of each block device fairly among the
processes performing I/O to that device; the time-sliced version goes
further by giving each process exclusive access to the device for a period
of time.
In particular, the time-sliced scheduler picks a process, and dispatches
only that process's requests to the device for some tens of milliseconds.
The device is allowed to go idle for a few milliseconds if all of the
selected process's requests have been satisfied, with the idea that
the process may generate more requests within that window. If those
requests don't come, that process's time slice ends. Later revisions of
the patch check to see whether the given process is actually likely to run
within the idle window, and preempt the slice immediately if the answer is
"no."
Jens claims some very good results for the
new scheduler. The bandwidth numbers are nearly as good as those obtained
with the anticipatory scheduler (AS), while the maximum latency is much
less. These results may not be surprising; Jens has borrowed code from AS, and the idle window
has a similar effect to the brief I/O stalls used by AS to improve read
bandwidth.
As the I/O schedulers poach the best ideas from each other,
they may well become more alike.
The use of time slices may also improve the locality of
accesses to the drive, reducing the amount of time lost to seeks.
The new CFQ scheduler has spawned a low-key debate over which scheduler
should be used by default. The default scheduler currently is AS, but some
people (Andrea Arcangeli in particular) are
saying that it should be CFQ instead. SUSE apparently already makes CFQ
the default scheduler for its enterprise kernel. Andrew Morton is unsure;
AS still seems to be better for desktop systems and IDE disks. Even so, he
is ready to consider a change in the
default scheduler:
That being said, yeah, once we get the time-sliced-CFQ happening,
it should probably be made the default, at least until AS gets
fixed up. We need to run the numbers and settle on that.
The AS scheduler has already seen one improvement: a fix for a bug that
caused horrible performance for processes doing direct writes. Expect
other changes as AS hacker Nick Piggin works at improving its performance.
However this friendly competition turns out, better disk I/O performance
for Linux users will be part of it.
Comments (4 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>