Brief items
The current 2.6 development kernel is 2.6.27-rc1,
released by Linus on
July 28. Some 8100 changesets were merged during the 2.6.27 merge
window; see the article below for a summary. Highlights for 2.6.27 will
include lots of new drivers (including the gspca webcam drivers), support
for
hardware data integrity
checking in the block layer, support for checkpointing and restoring of
virtual machines in Xen, the
ftrace tracing framework,
mmiotrace, the
tracehook patches, delayed
allocation in ext4, the
UBIFS
filesystem,
multiqueue
networking,
kexec jump,
the
extension of a number of
system calls for safer user-space programming, the lockless page cache
(see below), and much more. See
the short-form changelog for details, or
the
long-form changelog for lots of details.
As of this writing, no patches have been merged into the mainline
repository since the 2.6.27-rc1 release.
The current stable 2.6 kernel remains 2.6.26; there have not yet
been any updates to this kernel, though the word is that the pile of
patches for such an update is growing.
2.6.25.13 was released on
July 28 with a number of networking-related fixes, some of which
appear to address severe problems. 2.6.25.12, with a long list of
fixes, was released on July 24.
Comments (none posted)
Kernel development news
Ok, so now that I've insulted you and your pets (they're ugly!),
show me wrong, and then call me a d*ckhead. ("Linus - you're a
d*ckhead, and you didn't understand the problem, so you're a
_stupid_ d*ckhead. And my pet may be ugly, but yours _smells_
bad!").
Or say "Uh, yeah, we're morons, and here's the much better patch, and we
won't do that again".
--
Linus Torvalds
Amazing! Your code, once plugged into the kernel proper, booted
fine on 5 different x86 testsystems, it booted fine an allyesconfig
kernel with MAXSMP and NR_CPUS=4096, it booted fine on allnoconfig
as well (and allmodconfig and on a good number of randconfigs as
well)....
[B]ecause v1 of your code was so frustratingly and mind-blowingly
stable in testing (breaking a long track record of v1 patches in
this area of kernel), and because the perfect patch does not exist
by definition, i thought i'd mention that after a long search i
found and fixed a serious showstopper bug in your code: you used
"1ul" in your macros, instead of the more proper "1UL" style. The
ratio between the use of 1ul versus 1UL is 1:30 in the tree, so
your choice of integer literals type suffix capitalization was
deemed un-Linuxish, and was fixed up for good.
--
Ingo Molnar
In anycase, it sounds like Tux3 is using many similar ideas. I
think you are on the right track. I will add one big note of
caution, drawing from my experience implementing HAMMER, because I
think you are going to hit a lot of the same issues.
I spent 9 months designing HAMMER and 9 months implementing it.
During the course of implementing it I wound up throwing away
probably 80% of the original design outright.
--
Matthew Dillon. The
whole
thread is an interesting read in filesystem design.
The pure size of the -rc's _is_ making me a bit nervous,
though. Sure, it means that we are good at merging it all, but I
have to say that I sometimes wonder if we don't merge too much in
one go, and even our current (fairly short) release cycle is
actually too big.
Anyway, that's a discussion for some other event.
--
Linus Torvalds
I seem to be hearing a lot of silence over support for SSD devices.
I have this vague worry that there will be a large rollout of SSD
hardware and Linux will be found to have pants-around-ankles.
--
Andrew Morton
Comments (4 posted)
By Jonathan Corbet
July 29, 2008
The 2.6.27 merge window closed with the
2.6.27-rc1 release on
July 28. Some 8100 changesets were merged this time around, making
2.6.27 another busy development cycle. A number of interesting things went
in since
last week's update;
the most significant changes visible to Linux users include:
- There are new drivers for ILI9320 LCD controller chips,
Cobalt server LCD frame buffers,
SH7760/SH7763 integrated LCD controllers,
NXP pca9532 LED controllers,
Philips PCA955x I2C LED controllers,
WMI-based hotkeys on HP laptops,
Maxim MAX73xx I2C port expanders,
Micronas DRX3975D/DRX3977D DVB-T demodulators,
DvbWorld 2102 DVB-S USB2.0 receivers,
MaxLinear MxL5007T silicon tuners,
Renesas SH7763 evaluation boards,
Renesas Solutions AP-325RXA boards,
Renesas R0P7785LC0011RL boards, and
Atmel integrated touchscreens.
Also added is "mISDN," a new, modular ISDN driver intended to replace
older code for a number of ISDN cards. Support for using mISDN
drivers remotely via an IP tunnel has been added.
- The Palm T|X handheld computer is now supported.
- The tmpfs filesystem has gained support for asynchronous I/O.
- The hugetlbfs mechanism can now support multiple huge page sizes.
There is a new directory (/sys/kernel/hugepages) with
information on huge page allocations. The x86 (64-bit) architecture
now supports 1GB pages; PowerPC can go to 16GB.
- Most system calls which create file descriptors can now accept a set
of flags; this change allows the race-free establishment of close-on-exec
semantics, requesting non-blocking opens, and more. Developers
wanting to use this capability will have to wait for a version of
glibc which adds the requisite interfaces.
- The unmaintained v850 architecture has been removed.
- The kexec jump patch set,
which uses the kexec mechanism as an alternative way of implementing
suspend-to-disk, has been merged.
- The omfs filesystem has
been merged.
- /proc now has a file (called syscall) for each
process; when read, it displays the process's current system call and
the supplied arguments.
- Linux users hoping to upgrade their systems in the near future will be
glad to know that
a series of patches designed to make the kernel scale to 4096
processors has been merged.
Changes visible to kernel developers include:
- The tracehook mechanism for defining static trace points (described in
this article) has been
merged, along with a number of trace points in the core kernel.
- A new, lockless form of get_user_pages() has been added:
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
Details of this interface can be found in this article, with the one
note that early versions were called fast_gup() instead.
(See also the related lockless page cache work,
which was also merged).
- The long-debated mmu-notifiers patch has
been merged. The notifiers
allow external memory management units (as may be seen in some
graphics cards or in virtualized guests) to be told about decisions
made by the core memory management code.
- There is a new framework for debugging boot-time memory
initialization; there's also "a few basic defensive measures" intended
to prevent difficult-to-debug boot problems.
- The new function:
int object_is_on_stack(void *obj);
returns a true value if the pointed-to object is on the current kernel
stack.
- There is a new macro for issuing warnings:
WARN(condition, format, ...);
It's much like WARN_ON() in that it will produce a full oops
listing; the difference is the added printk()-style format
string and arguments.
- A new helper function:
int flush_work(struct work_struct *work);
waits for the specific workqueue job work to finish
executing.
- dma_mapping_error() and pci_dma_mapping_error() have
new prototypes:
int dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
int pci_dma_mapping_error(struct pci_dev *hwdev, dma_addr_t dma_addr);
In each case, they have gained a new argument specifying which device
the mapping is being done for.
- There are a couple of new radix tree functions:
unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
void ***results,
unsigned long first_index,
unsigned int max_items);
unsigned int radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root,
void ***results,
unsigned long first_index,
unsigned int max_items,
unsigned int tag);
They are useful for looking up multiple items in a single call.
- Slab cache constructors no longer have a pointer to the cache itself
as an argument; they now take a single void * pointer to
the object itself.
- The long list of Video4Linux2 ioctl() callbacks has been
moved into its own structure (struct v4l2_ioctl_ops) which is
pointed to by the ioctl_ops member of struct
video_device.
Now begins the long task of finding and fixing all the bugs in all this new
code. If the usual pattern holds, that process will take about two months,
suggesting that we can expect 2.6.27 sometime in October.
Comments (7 posted)
By Jonathan Corbet
July 29, 2008
One of the biggest problems in kernel development is dealing with
concurrency. In a system where more than one thing can be happening at
once, one must always take care to keep multiple threads of control from
interfering with each other and corrupting the system as a whole. In the
same way that two roads become more dangerous when they intersect,
connecting two or more processors to the same memory greatly increases
their potential for the creation of mayhem.
Travelers to the US are often amused (or irritated) by the often-favored
solution to roadway concurrency: putting in traffic lights. Such a light
will indeed (if observed) eliminate the potential for a number of
unpleasant race conditions within intersections, but at a performance cost:
traffic going through the intersection must often stop and wait. This
solution also scales poorly; as more roads (or lanes with different
destinations) feed into the same intersection, each of them experiences
more red-light time.
In kernel programming, the first tool for controlling concurrency - locks
in various forms - are directly analogous to traffic lights. It is not
coincidental that the name for a common locking primitive (semaphore)
matches the name for a traffic light (semaforo) in a number of
Latin-derived languages. Locks enforce exclusive access to a kernel
resource in the same way that a traffic light enforces exclusive access to
an intersection, and with many of the same costs. When too many processors
end up waiting at the same lock, the performance of the system as a whole
can suffer significantly.
There are two common approaches to mitigating scalability problems with
locks. For many years after the 2.0 kernel came out, these problems were
addressed through the creation of more locks, each controlling a smaller
resource. Lock proliferation is effective, in that it reduces the chance
that two processors will be trying to acquire the same lock at the same
time. Since it works so well, this approach has led to the creation of
thousands of locks in the Linux kernel.
Proliferation has its limits, though. Adding locks increases complexity;
in particular, with more locks, the chances of creating occasional deadlock
situations increase. Deadlocks can be avoided through the careful
observation of rules on the acquisition of locks, and the order in which
they are acquired in particular. But nobody will ever be able to sort out
- and document - the proper relative locking order for thousands of locks.
So kernel developers must make do with rules for some of the most important
locks and the vigilance of the lockdep tool to find any remaining problems.
The other problem with lock proliferation is harder to get around, though.
The acquisition of a lock requires writing a value to a location in shared
memory. As each processor acquires a lock, it must change that value,
which causes that processor to acquire exclusive access to the cache line
holding the lock variable. The cache lines for heavily-used locks will fly
around the system in a way that badly hurts performance, even if no
processor ever has to wait for another to release the lock. Adding more
locks will not fix this problem; instead, it will just create more bouncing
cache lines and make things worse.
So, as the number of processors grows, the path to continued scalability
must not include the wholesale creation of new locks; indeed, it requires
the removal of locks in the most performance-critical paths. And that is
what this whole long-winded introduction leads up to: the 2.6.27 kernel
will include some changes by Nick Piggin which implement lockless operation in some
important parts of the virtual memory subsystem. And those, in turn, will
lead to faster operation on multiprocessor systems.
The first of these changes is a new function for obtaining direct access to
user-space pages from the kernel:
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
This function works much like get_user_pages(), but, in exchange
for some limits on its operation, it is able to do its job without
acquiring the mmap semaphore; that, in turn, can lead to a 10% performance
boost on "a threaded database workload." The details of how this function
works were covered here last
March (though the function was called fast_gup() back then),
so we'll not repeat that discussion here.
The other big change is a set of patches which Nick has been carrying for
quite some time: the lockless page cache. The page cache holds in-memory
copies of pages from files on disk; its purpose is to improve performance
by minimizing disk I/O. Looking up pages in the page cache is a common
activity; it happens as a result of file I/O, page faults, and more. So it
needs to be fast. In 2.6.26 kernels, each mapping (each connection between
the page cache and a specific file in a filesystem somewhere) has its own
lock. So processors will not normally contend for the locks unless they
are operating on the same file. But locks for commonly-accessed files
(shared libraries, for example) are likely to be frequently bounced between
processors.
Most page cache operations are lookups - read-only operations which make no
changes. In the lookup operation, the lock protects a few aspects of the
task, including:
- A given page within the mapping must be looked up in the mapping's
radix tree to find its
location in memory (if any).
- If the page is resident in the page cache, it must have its reference
count increased so that it will not be evicted before the code
performing the lookup has done whatever it needs to do.
The radix tree, itself, is a complicated data structure; it must be
protected from modification while the lookup is being performed. For
certain, performance-critical parts of the radix-tree code, that protection
is done through (1) some rules on what can be called when, and
(2) the use of read-copy-update (RCU). As a result, the radix tree
lookup can be done in a lockless manner.
There is still a problem, though: a given page may be evicted from the page
cache (or simply moved) between steps (1) and (2) above. Should that
happen, the second step will increment the reference count for a page which
now belongs to a different mapping, and return an incorrect pointer. The
kernel developers have, through lots of experience over many years, learned
that system crashes resulting from data corruption are quite hard on
throughput. So true scalability requires that this kind of scenario be
avoided; thus the mapping semaphore, which prevents page cache changes from
being made until the reference count has been properly updated.
Nick made an interesting observation here: it actually doesn't matter if
the wrong reference count gets incremented as long as one ensures that the
specific page mapping is still valid afterward. The result is a new,
low-level page cache function:
int page_cache_get_speculative(struct page *page);
If the given page has a reference count of zero, then the page has
been removed from the page cache; in that case this function return zero
and the reference count will not be changed. If the reference count is
non-zero, though, it will be increased and a non-zero value will be
returned.
Incrementing a page's reference count will prevent that page from being
evicted or moved until the count goes back to zero. So kernel code which
has incremented a specific page's reference count will thereby ensure that the page
stays in its current state. In the page cache case, the code can obtain a
speculative reference to a page found in a mapping's radix tree. But it
does not, yet, know whether it actually got a reference to the page it was
looking for - something may have happened between the radix tree lookup and
the obtaining of the reference. So it must check - after the reference has
been acquired - to be sure that it has the right page. If not, it releases
the reference and tries again. Eventually it will either pin down the right page
or verify that the relevant part of the file is not resident in memory.
Lockless operation forces a bit more care on the part of the page reclaim
code, which is trying to get a page's reference count down to zero so that
it can remove the page. Since there is no locking around the reference
count now, the reclaim code must set it to zero while checking, in an
atomic manner, that nobody else has incremented it. That is the purpose
of the atomic_cmpxchg() function, which will only perform the
operation if it does not collide with another processor. Since
page_cache_get_speculative() will not increment the reference
count if it is zero, the reclaim code knows that, by getting that count to
zero, it now has exclusive control of the page.
The end result of all this is that a set of locking operations has been
removed from the core of the page cache, improving the scalability of that
code. There is, of course, a cost, in the form of trickier code with a
more complex set of rules which must be followed. Chances are that we will
see more of this kind of code, though, as the number of processors in our
systems increases.
Comments (10 posted)
By Jake Edge
July 30, 2008
Kernel wireless maintainer John Linville outlined the past, present, and future
of the Linux wireless stack on the first day of this year's Ottawa Linux Symposium. In
his presentation, he ranged from early efforts, which were "a sore
spot for Linux" to the future where it is likely that Linux will have
support for some features before "that other OS". Along the
way, he looked at various issues that wireless support in Linux faces,
including vendor participation, suspend and resume, and regulatory issues.
Linville has been the maintainer Linux wireless for two and a half years since
being recruited into the job by David Miller and Jeff Garzik. When he took
over, wireless support was in disarray, as there were competing stacks to
support different hardware. Users were faced with lots of pain in getting
things working when "they just want their hardware to work"
said Linville. Since that time, things have greatly changed.
The original wireless hardware was what is called "Full MAC hardware",
where the implementation of the wireless protocols was handled by the
hardware, generally in firmware. The drivers made these devices appear to
be regular wired ethernet devices, though they did require some special
configuration for SSID and the like. Because the hardware would enforce
various regulatory requirements, vendors would generally work with the
community in order to support the hardware.
All of that changed with the advent of "Soft MAC hardware"—which
Linville likened to winmodems—where the CPU implements most of the
protocol. It is a cheaper solution for vendors, but it requires an 802.11
stack for the kernel. The ieee80211 drivers came along to support
the Intel Centrino wireless hardware, but they only supported those few
devices. Johannes Berg added the ieee80211softmac driver that
added some additional hardware support, but it was a kludgy solution.
Since then, Linville said, folks have realized that it was "sort of a
mistake to go down that road".
Enter the Devicescape stack. It was a feature rich 802.11 stack for Linux
that was popular with developers. After some locking and SMP problems were
resolved, it was merged into 2.6.22 as the mac80211 driver. Once
that happened, wireless drivers
started using it, to the point where Linville showed a chart of the current
drivers, almost all of which use mac80211. "It's been a boon
to us to pick up the mac80211 code."
One notable driver that does not support mac80211 is the libertas
driver for the OLPC. Unlike most other current devices, it is a Full MAC
device with special requirements. It has support for power saving modes
that do not yet exist in mac80211. Because it is a mesh-networking
device that still participates in forwarding network traffic when the
system is powered down, it has needs that are not yet supported.
Drivers in progress was the next topic Linville addressed. Several of
these are in need of developers to work on them, specifically for the Airgo
chipset and Atmel USB chipset. The TI chipset drivers have had some
questions raised about the reverse engineering process and may require a
legal vetting similar to what the SFLC did for ath5k. Marvell is
sponsoring development of a mac80211 based driver for its
hardware. This driver may also support 802.11n which allows for greater range
and higher speeds than current-generation 802.11.
Using data from LWN, Linville looked at the activity level of the wireless
development in Linux. He was amazed to note "how much of the 2.6.26
kernel came through this laptop". Using his Signed-off-by as a
proxy for wireless LAN commits, he noted 4.3-5.6% of the kernel commits in
the last three releases (.24 through .26) were for wireless. In each
kernel, wireless was either the fourth or fifth highest number of commits.
The compat-wireless-2.6 project is aimed at supporting newer hardware in
older kernels. Because folks are wary of running kernel.org kernels or
their distribution supports an older kernel—but they want to run with the
latest hardware—the project backports wireless drivers to kernels as
old as 2.6.21. It is a set of scripts and patches that build against the
user's kernel. Unfortunately, the project may not last much longer as the
multiqueue changes that have been merged for 2.6.27 may change the drivers
enough that they will be infeasible to backport.
At the top of the list for new features is removal of the wireless
extensions in favor of the new cfg80211 mechanism. According to
Linville, "nobody likes wireless extensions, and nobody likes the
existing
tools". The wireless extensions have vague semantics, can have
problems with race conditions, and because they are implemented by
ioctl() calls, they encourage duplication of code in multiple
drivers. cfg80211 will bring a much cleaner API along with
fixing some existing bugs like the 31 character limit for SSIDs.
Access point (AP) mode is another feature that is coming. Typically, APs
use similar or identical hardware to that in wireless MACs. For Soft MAC
hardware, all that is needed is support on the CPU side for AP mode, which
is coming for mac80211. Mesh networking, which has been
popularized by the OLPC project, is also coming to mac80211.
Cozybit has provided an implementation which will allow Linux to have a
feature unavailable for Windows.
Areas that are needed, but are not yet being worked on was next on
Linville's agenda.
Suspend and resume support is "flawed for mac80211
due to connection management issues". Because mac80211 is
unaware of suspend and resume, drivers must work around it by de-registering
and re-registering with it, which can be slow. Adding support for suspend
and resume
is on the list, as is supporting power saving modes.
Linville went on to discuss three big issues that are largely outside of
the control of the wireless hackers: firmware licensing, vendor participation,
and regulatory concerns. Because drivers for Windows come with the
firmware in the driver, many hardware vendors do not license the firmware
blob separately. This means that it is unclear what can be done with those
blobs. Certain vendors—Intel and Ralink were specifically called
out—provide liberal licenses for their firmware. Users are
encouraged to "vote with your dollars" by purchasing devices
that either do not require firmware or that have a clear, free software
friendly license.
Another consideration when deciding which vendors to support is whether
they are engaged with the community. For the most part, all vendors but
Broadcom are working with the wireless hackers by providing documentation
and/or source code. Some are even providing
dedicated developers to work on Linux drivers—Intel was the first,
but both Atheros (which just released a driver for its ath9k
hardware) and Marvell have also begun doing that.
Government regulations about what can and cannot be done in the unlicensed
frequencies used by wireless are a concern that is frequently cited by
vendors when refusing to work with the community. Unfortunately, their
concerns are not completely without merit as hardware vendors are expected
to ensure compliance with the regulations. "Non-compliance could be
a huge loss" for those companies. As Linville points out, though,
most vendors find a way to support Linux drivers.
In answer to a question, Linville said that most WiMAX and 3G wireless
devices are Full MAC designs, so there should be little or no regulatory
concern, which, in turn, means that Linux support should not be much of a
problem—at least until Soft MAC devices come along. Overall, Linux
wireless has come a long way, but there is lots still to do. One gets the
sense that the wireless team is up to the task.
Comments (26 posted)
Patches and updates
Kernel trees
Build system
- Sam Ravnborg: kbuild.
(July 28, 2008)
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>