Brief itemsannounced by Linus on December 3. Changes since -rc2 include the un-deprecation of MODULE_PARM() (it is generating too many warnings, and the fixes will not be merged before 2.6.10), a new major number (180) for the "ub" USB storage driver, some x86 single-stepping fixes, a large number of "sparse" annotations, the token-based memory management fix, a memory technology device (and JFFS2) update, a frame buffer device update, some user-mode Linux patches, some page allocator tuning, and a few architecture updates. See the long-format changelog for the details.
Linus's BitKeeper repository currently holds a DVB update and a set of bug fixes. Very few patches are being accepted currently as the kernel hackers try to stabilize things for the final 2.6.10 release.
Andrew Morton has released no -mm patches over the last week.
The current bugfix patch from Alan Cox is 2.6.9-ac13.
The current 2.4 prepatch remains 2.4.29-pre1; Marcelo has released no prepatches since November 25.
Kernel development news
a = b + c;does what we think it does, and not something else because someone has overloaded the '+' operator. Or God help us, as I have mentioned earlier, the comma operator.
-- Ted Ts'oposted an article on recent SELinux performance improvements. "The use of RCU solved a serious scalability problem with the AVC, thanks to the work of Kaigai and the RCU developers. It seems likely to be a useful approach for dealing with similar problems, and specifically with some of the SELinux networking code as mentioned." his page fault scalability patches are reaching a point where they will likely be considered for inclusion after 2.6.10 comes out. This patch is an interesting example of the kind of changes which must be made to support large numbers of processors.
One of the virtual memory subsystem's core data structures is struct mm_struct. This structure tracks the virtual address space used by one or more processes. It contains pointers to the page tables, to the virtual memory area (VMA) structures, and more. Processes typically have their own struct mm_struct, but threads which share memory also share the same mm_struct.
Access to this structure is serialized by two mechanisms. A semaphore (mmap_sem) controls access to the mm_struct structure itself, and a spinlock (page_table_lock) guards the page tables. When the status of a page must be changed in the page tables, the kernel must first take the page_table_lock to avoid creating confusion with the other processors on the system. When he looked at the scalability of the kernel's page fault handling code, Christoph identified this lock as a problem. When many processors are trying to simultaneously make changes to a single set of page tables, they end up spending a lot of time busy-waiting for the page table lock. Improving the performance of this code thus requires reducing the use of that lock.
The first step in this process is a patch which causes the VM subsystem to hold page_table_lock for shorter periods of time. The lock is dropped for portions of the code which have no need of it, and later reacquired if needed. It is a fairly straightforward exercise in lock breaking which helps scalability, but does not solve the whole problem.
The core of the patch is a set of atomic page table entry functions which can modify individual PTEs with no locking required. Rather than acquiring page_table_lock, making a PTE change, then dropping the lock, the kernel can simply make a call to:
int ptep_cmpxchg(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval);
This function uses the cmpxchg instruction (or whatever variant or emulation may be available, depending on the architecture) to compare the page table entry pointed to by ptep with oldval; if the two match, the entry is set to newval and oldval is returned. If the two do not match, the current thread lost a race with another processor which changed the PTE first; in that case, the PTE is not modified further and the function returns zero. Kernel code which uses cmpxchg typically will retry a modification when this sort of race occurs; Christoph's code, instead, is able to assume that the competing thread did the same thing as the one it raced against: marked the page as being present in memory. So no retries are needed.
With that change, pages can be brought into the working set and made available without having to take the page_table_lock - except for one last place. The mm_struct structure contains two fields (rss and anon_rss) which track the total number of in-memory pages referenced by this address space (the "resident set size"). When a page is brought in (or forced out), these fields must be incremented or decremented accordingly. Access to rss and anon_rss is controlled by page_table_lock. Getting rid of that last use of the lock has required a surprising amount of work on Christoph's part.
The first implementation turned the RSS fields into atomic_t variables, so that they could be operated on without locking. This solution worked, but it had some shortcomings: (1) they could only be 32-bit variables, since not all architectures support 64-bit atomic types, (2) the atomic operations are still relatively expensive, and (3) having all processors on the system updating a single pair of variables caused a great deal of cache line bouncing, which hurt performance.
The next attempt was called "sloppy_rss." Essentially, the sloppy approach retains the old unsigned long type for rss and anon_rss, and simply updates them without the lock. The result is incorrect RSS values, but Christoph noted that the errors tended not to exceed 1%. This approach is faster than using atomic operations. The incorrect values bugged some developers, however, and the cache bouncing problem remained.
Another approach which was to do away with the RSS counters entirely, on the theory that these values were not actually needed very often. When an attempt to query the resident set size was made (generally by reading files in /proc from user space), the kernel would scan through the process's page tables and count the number of resident pages. This idea did not get very far, however; the cost of querying RSS values was simply too high.
The current approach was suggested by Linus last month. A new set of counters is added to the task structure; when a thread brings a page into memory, that thread's counters are incremented accordingly. When a real RSS value is needed, the per-thread values are summed to yield the answer. So querying the RSS still requires a loop, but iterating through a list of tasks is much faster than walking an entire set of page tables. This algorithm avoids locking issues (since each thread takes care of its own page fault accounting and does not contend with others); it also minimizes the cache line problems. The "split RSS" approach still requires rss and anon_rss counters in the mm_struct itself; they are used to track pages brought in by threads which have since exited, and they are decremented when pages are forced out. This change also requires that RCU be used when freeing the mm_struct structure to ensure that no other processor is still trying to calculate an RSS value.
The current version of the patch has convinced Linus, so expect it to go in at some point. The biggest roadblock, at this point, may be that the four-level page table patch is at the front of the queue for 2.6.11. That patch currently conflicts with Christoph's work, and, in general, has made it hard for other VM work to get done. Once the four-level patch goes into the mainline, however, things should stabilize somewhat - at least, from the point of view of hackers working on other VM-related patches.replied that the latter part of December looked like when it might happen. He also noted that he is trying to produce a higher-quality release this time around:
Andrew also noted that getting people to test anything other than the final releases is hard, with the result that many bugs are only reported after a new "stable" kernel is out. If things don't get better, says Andrew, it may be necessary to start doing point releases (e.g. 22.214.171.124) for the final stabilization steps. Alternatively, the kernel developers could switch to a new sort of even/odd scheme, so that 2.6.11 would be a new features release, and 2.6.12 would be bug fixes only.
Much of the discussion, however, centered around regression testing. If only there were more automated testing, the reasoning goes, fewer bugs would make it into final kernel releases. This wish may eventually come true, but, for now, it appears that regression testing is not as helpful as many would like.
OSDL has pointed out that it runs a whole set of tests every day. The problem, they say, is getting people to actually look at the results. It may be that not enough people know about OSDL's work, and, for that reason, the output is not being used. But it also may be that the testing results are simply not that useful.
Consider this posting from Andrew Morton on regression testing:
The test suites, it seems, are not testing for the right things. One could argue that the test suites simply have not, yet, been developed to the point where they are performing comprehensive testing of the kernel. This gap could be slowly filled in by having kernel bug fixes be accompanied by new tests which verify that the bug remains fixed. Much of the code in the kernel, however, is hardware-specific, and that code is where a lot of bugs tend to be found. Hardware-specific code can only be tested in the presence of the hardware in question. Outfitting a testing lab with even a fraction of the hardware supported by Linux would be a massively expensive undertaking.
So the wider Linux community is likely to remain the testing lab of last resort for the kernel; the community as a whole, after all, does have all that hardware. And the truth of the matter is that helping with testing is part of the cost of free software (and of the proprietary variety as well). So the best results might be had by trying to get more widespread testing earlier in the process. Getting Linus to distinguish between intermediate and release candidate kernels might help in that regard. If that can't be done, then, perhaps, going with point releases may be required.
Jens Axboe recently decided to do some more hacking on his "completely fair
queueing" (CFQ) scheduler; the result is the new time-sliced CFQ scheduler, which has since
second third fourth revision. The CFQ scheduler has always
tried to divide the bandwidth of each block device fairly among the
processes performing I/O to that device; the time-sliced version goes
further by giving each process exclusive access to the device for a period
In particular, the time-sliced scheduler picks a process, and dispatches only that process's requests to the device for some tens of milliseconds. The device is allowed to go idle for a few milliseconds if all of the selected process's requests have been satisfied, with the idea that the process may generate more requests within that window. If those requests don't come, that process's time slice ends. Later revisions of the patch check to see whether the given process is actually likely to run within the idle window, and preempt the slice immediately if the answer is "no."
Jens claims some very good results for the new scheduler. The bandwidth numbers are nearly as good as those obtained with the anticipatory scheduler (AS), while the maximum latency is much less. These results may not be surprising; Jens has borrowed code from AS, and the idle window has a similar effect to the brief I/O stalls used by AS to improve read bandwidth. As the I/O schedulers poach the best ideas from each other, they may well become more alike. The use of time slices may also improve the locality of accesses to the drive, reducing the amount of time lost to seeks.
The new CFQ scheduler has spawned a low-key debate over which scheduler should be used by default. The default scheduler currently is AS, but some people (Andrea Arcangeli in particular) are saying that it should be CFQ instead. SUSE apparently already makes CFQ the default scheduler for its enterprise kernel. Andrew Morton is unsure; AS still seems to be better for desktop systems and IDE disks. Even so, he is ready to consider a change in the default scheduler:
The AS scheduler has already seen one improvement: a fix for a bug that caused horrible performance for processes doing direct writes. Expect other changes as AS hacker Nick Piggin works at improving its performance. However this friendly competition turns out, better disk I/O performance for Linux users will be part of it.
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds