Kernel development
Brief items
Kernel release status
There have been no kernel releases over the last week. The 2.6.23 merge window remains open, and patches are flooding into the mainline repository; see the article below for a summary.
Kernel development news
Quotes of the week
Merged for 2.6.23
Some 2600 changesets have been merged into the mainline kernel repository since last week's summary. The shape of 2.6.23 is now becoming clearer; this kernel will include:
- New drivers for Dallas DS1682 elapsed time recorder chips, PMC-Sierra
MSP71xx i2c controllers, Renesas M66592 USB peripheral controllers,
Renesas R8A66597 USB host controllers, OTi-6858 USB-to-RS232 bridge
controllers, Samsung S3C24xx SoC USB device controllers, Intel iop32x,
iop33x, and iop13xx DMA engines, Xilinx SystemACE compact flash
interfaces, BCM1250 dual UART devices, OMAP24xx multichannel SPI
controllers, Atmel AVR32 AT32AP700x real-time clocks, ST M41T80 and ST
M48T59 real-time clocks, Dallas DS1216 real-time clocks, TI OMAP
framebuffers, display controllers, and LCD controllers (along with a
support for a number of panels), Atmel AT32AP700X watchdog devices,
IBM z/VM virtual card readers and punches, Afatech AF9005
demodulators.
- After years of work, the core Xen i386 implementation has been
merged. Xen is finally a part of the mainline kernel. (Anybody who
is tempted to believe that predictions found in LWN are worth anything
may be amused by Dave Jones
poking fun at a suggestion, published in 2004, that Xen could be
merged sometime soon).
- The fallocate()
system call has been merged, but without the deallocation options.
- The developmental ext4 filesystem has gained a number of new features,
including fallocate() support, nanosecond timestamps, and
support for directories containing more than 65,000 other directories.
- The new "macvlan" driver allows the system administrator to create
virtual interfaces mapped to and from specific MAC addresses.
- A number of virtual drivers for Sun logical domains (on the SPARC64
architecture) have been added. LDOM CPU hotplug support has also been
added.
- The bsg code - a new generic SCSI device driver based on the block
layer - has been merged.
- IPV4 multipath cached routing support has been dropped; this code
never did work very well, and never got out of the experimental
state.
- Basic, experimental support for PPP over L2TP sockets has been added.
- A device model extension (marked experimental) can export a laptop's
desktop management information (DMI) data through sysfs. This will
allow distributors to load just the drivers needed for a specific
laptop instead of the "load them all and let the hardware sort them
out" technique which is often used now.
- The highly experimental "USB persist" feature attempts to maintain the
state of USB devices when they lose power. The driving motivation
between this patch is to be able to suspend a system containing
filesystems on USB storage and still have those filesystems mounted
and working at resume time.
- As scheduled, the speedstep-centrino CPU governor has been removed in
favor of the acpi-cpufreq code.
- The XFS filesystem now has a "stream of files" concept which allows it
to place related files (a series of frames in a video stream, for
example) contiguously on disk.
- The AFS filesystem now has file locking support.
- The raw block driver has been un-deprecated since it appears it will
not be going away anytime soon.
- The O_CLOEXEC
open flag has been added.
- There is a new clone() flag - CLONE_NEWUSER - which
creates a new user namespace for the process; it is intended for use
with container systems.
- The long-debated memory
fragmentation avoidance patches have been merged at last; the
associated lumpy reclaim
code has been merged as well.
- The kernel virtual machine (KVM) code can now support SMP guests.
Changes visible to kernel developers include:
- Much of the x86 startup code has been rewritten in C. There should be
little in the way of changes for anybody who does not actually get
into the code, but, for those folks, the new version should be much
easier to work with.
- There is a new rtnetlink API for managing software network devices.
- The networking core can now work with devices which have more than one
transmit queue. This is a feature which was needed to properly
support some wireless devices.
- The sysfs core has been significantly rewritten to weaken the
connection between sysfs entries and internal kobjects. The new code
should make life easier for driver writers who will have fewer object
lifecycle issues to worry about.
- The never-used enable_wake() PCI driver method has been
removed.
- Drivers wanting to get the revision ID from the PCI config space
should now just use the value found in the new revision
member of the pci_dev structure. All in-tree drivers have
been changed to use this new approach.
- The SCSI layer has picked up a couple of scatter/gather accessor
functions - scsi_dma_map() and scsi_dma_unmap() - in
preparation for chained scatter/gather lists and bidirectional
requests. Most drivers in the kernel have been updated to use these
functions.
- The idr code has a couple of new helper functions:
idr_for_each() and idr_remove_all().
- Much of the kernel build system has been converted over to
"menuconfig" objects, making it easy to turn whole groups of options
on or off at once.
- sys_ioctl() is no longer exported to modules.
- The page table helper functions ptep_establish(),
ptep_test_and_clear_dirty()
and ptep_clear_flush_dirty() have been removed - they had no
in-kernel users.
- Kernel threads are non-freezable by default; any kernel thread which
should be frozen for a suspend-to-disk operation must now call
set_freezable() to arrange for that to happen.
- The SLUB allocator is now the default.
- The new function is_owner_or_cap(inode) tests for access
permission based on the current fsuid and capabilities; it replaces
the open-coded test previously found in several filesystems.
- There is a new utility function:
char *kstrndup(const char *s, size_t max, gfp_t gfp);This function duplicates a string along the lines of the user-space strndup().
It's worth noting a couple of things which will not be in 2.6.23. The first is the process containers patch, which is not quite considered to be ready yet. Some other features (notably CFS group scheduling) are waiting for process containers, so chances are good that this code will be in shape for merging by 2.6.24.
The other big omission is the x86_64 clockevents, dynamic tick, and high-resolution timers code. This patch is considered by its authors to be ready (and your editor has been running it without ill effect), but, after the troubles caused by the integration of the i386 version of this code in 2.6.21, there is a desire felt by some developers to go a bit more slowly and carefully. The result was a somewhat unhappy discussion on the mailing lists and a plan to better split these patches so they can be carefully reviewed for the next development cycle.
USB device authorization
Universal serial bus (USB) devices do not normally have much of a security model associated with them. If a user is able to plug a USB device into the system, said system assumes that the device is properly authorized to be there. There are situations where the connection of USB device causes people to worry; the usual scenario is the fear of corporate secrets being copied into some sort of USB storage device and being carried out of the building. In general, in situations where such fears run strong, the response has involved (attempted) bans of USB devices or simply filling the USB ports of accessible computers with glue.Wireless USB changes the situation slightly. This protocol allows USB devices to operate remotely, without that pesky cable to trip over; it can be thought of as occupying a niche similar to that of Bluetooth. While a typical laptop user might be expected to notice an attacker plugging a normal USB keyboard into their system, said attacker could attempt to connect a wireless USB keyboard without coming near. Clearly, some sort of security layer is required. The wireless USB specification has anticipated this need; it provides for a whole series of acronym-laden techniques for (1) ensuring that both hosts and devices authenticate themselves to each other, and (2) that wireless USB communications are sufficiently well encrypted that they cannot be eavesdropped upon.
Iñaky Perez-Gonzalez is working on wireless USB support for Linux. He has come to the conclusion that the grungy details of wireless USB authentication belong in user space; the kernel cannot, on its own, keep track of which devices are known to the system and are allowed to connect. It is, however, up to the kernel to implement the authorization part of the equation: a wireless USB device which is not authorized should not be able to perform any sort of exchange with the host system. Iñaky's response to the authorization problem is this set of patches to the USB subsystem.
These patches add three new flags to the usb_device structure: wusb, authorized, and authenticated. The first indicates that a device is wireless, and the last (which is not yet used) indicates that the device has passed authentication. In the middle is the authorized flag which indicates whether it is OK to talk to the device. If the device is not authorized, the kernel will not even read its configuration to find the endpoints it provides; the only thing that can happen at that point is authentication. To that end, various points in the USB stack are changed to check the authorized flag before allowing access to a USB device.
User space is brought into the picture by way of the usual device-attach announcement and the creation of an associated sysfs tree. The sysfs directories for USB devices gain a new authorized attribute which corresponds to the internal flag; user space can enable access to the device by writing a non-zero value to that attribute. That infrastructure is all that is required for some sort of user-space daemon to notice the arrival of a new wireless USB device, check its database of known devices, possibly pop up some sort of pairing dialog to the user, and implement a decision on whether the device should be allowed to connect or not.
Iñaky has taken things a step further by realizing that this authorization mechanism need not be limited to wireless devices; it can, in fact, be used to allow some sort of management code to pass judgment on any USB device. There is a set of per-host authorized_default flags which can be configured by the administrator; simply setting the default to zero with no other action will disallow the connection of any new devices, whether wired or not.
A more complex implementation might allow only certain types of devices to connect. Keyboards and mice might be acceptable, but anything which could remove data from a system - storage devices or printers, say - would be disallowed. Or storage devices could be allowed, but only if they contain some sort of properly signed authorization certificate which can be verified by the host system. There are a number of interesting possibilities. The resulting security will be less than that which could be had by filling in the ports or simply configuring USB out of the system entirely, but it might be just what is needed at some sites.
Overall, it's a relatively simple patch set which adds some interesting capabilities. Much of the hard work - authentication and encryption setup - remains, but that's a job for user space. Iñaky has asked that this code be merged for 2.6.23; it's just a bit late, though, for a relatively untested (in the wider world) chunk of code to slip through the merge window. 2.6.24 seems more likely.
Yet another approach to software suspend
Back in early 2006, there was an ongoing, energetic debate over the future of the software suspend (to disk) code - a situation which remains true to this day. In the middle of it all, Andrew Morton had jumped in with a suggestion for a different approach:
Eighteen months later, it looks like we might just get that "suspend3" in the form of the kexec jump patch, posted by Ying Huang.
Ying's patch builds on the existing kdump facility. The purpose of kdump is to provide safe and useful crash dumps in situations where the state of the operating system is uncertain. If the system panics it is nice to be able to save its current state for post-mortem debugging. It is important, however, that the buggy kernel - which is now in an untrustworthy state - not be used to do dangerous things like write crash dump data to disk. To avoid that situation, a small "dump kernel" is placed in a reserved area of memory where, most of the time, it lurks unnoticed and unneeded. Should a panic occur, a kexec() call is made to transfer control to the dump kernel, which will be able to start up in a known state. As long as the dump kernel stays within its reserved area of memory, it will be able to write the rest of the system state to disk (or wherever) in a relatively safe way.
What Andrew recognized last year is that suspend-to-disk (which is slowly being rebranded "hibernation") does essentially the same thing: system activity is stopped and the current system state is written to disk. If the dump kernel could read that state back into memory and return to the original kernel, it would be able to hibernate (and resume) the system. An implementation along these lines would have the advantage of unifying much of the kdump and hibernation code, thus concentrating development effort and generally simplifying things. Plus it would be a way to eliminate the current code, which, despite many years' tenure in the mainline, remains somewhat unloved.
The current patch does not do all of that; it is really just the first step: making it possible to jump from the secondary kernel back into the original kernel. The code is relatively simple; though it does rely on much of the existing infrastructure to properly suspend and power down all devices in the system for the jump in either direction. So if device drivers are interfering with hibernation now, that problem will still exist in a kexec-based implementation. But much of the other hibernation code, including the much-maligned process freezer, would be unneeded and could be removed.
There's a few little details to take care of before one can take a hatchet to the current hibernation code, though. Powering-down devices between the two kernels is not really necessary or desirable; they just need to go into a quiet "hibernate" state. A kdump kernel needs to be placed in reserved memory from the beginning; trying to load it at panic time would be far too late. A kernel used for hibernation, instead, need not occupy system memory all the time, so some sort of on-demand secondary kernel loading is needed. The actual task of saving and restoring the system image is yet to be implemented - that can all be done easily in user space, however, with very little in the way of kernel support. Making the resume process fast enough will take some work - users might take a dim view of having to wait for two kernels to boot before getting their system back. And so on.
So, in other words, nobody should be holding their breath for kexec-based hibernation in the near future. But the initial response to this approach was mostly positive; there seems to be a lot of interest in simply starting over in this area. Some of that enthusiasm might fade as work progresses and it turns out that, even with a new approach, hibernation is still a difficult and somewhat grungy problem. So only time will tell if this code will develop into a better hibernation implementation.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
