Brief items
The 2.6.29 merge window is open, so there is no development kernel
release as of this writing. Quite a bit of work has been merged for
2.6.29; see the separate article below for details.
The current stable 2.6 kernel is 2.6.28, released by Linus on
December 24. Some of the
highlights of this kernel are the addition of the GEM GPU memory manager, the ext4 filesystem
is no longer "experimental", scalability improvements in memory management
via the reworked vmap() and pageout scalability patches, moving the -staging drivers into the mainline,
and much more. See the excellent KernelNewbies
summary for lots more details about 2.6.28. Says Linus: "In fact,
even _if_ you have friends or family, leave them to their endless
toil over that christmas ham or turkey, and during the night, when they're
asleep, you can give them that magical present of a newly updated
computer. When they wake up tomorrow morning, tell them how you saw Santa
crawl down the chimney with his USB stick in hand, updating the OS of all
good boys and girls."
Comments (none posted)
Kernel development news
The software design moral: Everything is shit and will attempt to
kill you when you're not looking.
--
Matthew Garrett
I don't believe "auto-destroy my music collection" is a sane
default.
--
Alan Cox
BTW, the current influx of higher-complexity filesystems certainly
worries me a little.
--
Christoph Hellwig
Can you post the patch, so that we can see if we can find some
silly error that we can ridicule you over?
--
Linus Torvalds (Thanks to Jeff
Schroeder)
There's a lot of stuff here, as can be seen by the final diffstat
number:
779 files changed, 472695 insertions(+), 26479 deletions(-)
and yes, it's all crap :)
--
Greg Kroah-Hartman
I will just note wryly that it used to be that I could compile 0.9x
kernels on a 40 MHz 386 machine in 10 minutes. Some 15 years
later, it still takes roughly the same amount of time to compile a
kernel, even though computers have gotten vastly faster since then.
Something seems wrong with that....
--
Ted Ts'o
Comments (11 posted)
By Jonathan Corbet
January 7, 2009
As of this writing, some 6500 non-merge changesets have been accepted for the
2.6.29 development cycle. There is the usual set of new device drivers,
combined with a number of important core kernel changes.
As of this writing, user-visible changes include:
- New drivers for for SH-2A FPU based SH7201 processors,
Palm T|X, T5 and LifeDrive audio devices,
Gumstix Overo audio devices,
Marvell Zylonite audio devices,
Wolfson Micro TWL4030, UDA134x, WM8350 AudioPlus, and WM8728 codecs,
Texas Instruments SDP3430 audio devices,
OMAP3 Pandora audio devices,
Intel G45 integrated HDMI audio codecs,
Broadcom BCM50610 network PHYs,
LSI ET1011C PHYs,
KS8695 Ethernet devices,
SMSC LAN9420 PCI Ethernet adapters,
SMSC LAN911x and LAN921x embedded Ethernet controllers,
Solarflare 10Xpress SFT9001 network controllers,
Atheros AR9285 chipsets,
Solos ADSL2+ PCI Multiport cards,
Nuvoton W90X900 CPUs,
LG ATSC lgdt3304 video capture devices,
Sharp s921 ISDB-T devices,
ST Microelectronics STB6100 silicon tuners and STB0899 multistandard
frontend devices,
ST STV06XX-based cameras,
TDA8261 8PSK/QPSK tuners,
OmniVision ov772x cameras,
Conexant CX24113/CX24128 tuners,
Texas Instruments TVP514x video decoders,
OMAP2 camera devices (as seen in Nokia Internet tablets),
NXP TEA5764 I2C FM radio devices,
Chelsio T3 ASIC based iSCSI adapters,
Wolfson Microelectronics WM8350 power management units,
Dialog DA9030 battery chargers,
DaVinci DM355 EVM microcontrollers,
Intel 5400 (Seaburg) memory controller chipsets,
Walkera WK-0701 RC transmitters,
Wacom W8001 penabled serial touchscreens,
Dialog Semiconductor DA9034 touchscreens,
TSC2007 based touchscreens,
PXA930 trackball mice, and
PXA930/PXA935 enhanced rotary controllers.
- A number of new drivers have also entered the kernel via the staging
tree; these include drivers for Sensoray 2250/2251 video capture
devices, Airgo AGNX00 wireless chips, a wide variety of data
acquisition devices via the Comedi framework, ASUS laptop OLED
displays, Ralink 2860 and 2870 wireless wireless interfaces ("This is the
Ralink RT2860 driver from the company that does horrible things like
reading a config file from /etc."),
RealTek RTL8187SE Wireless LAN NICs,
HD44780 or KS-0074 parallel port LCD panels,
ServerEngines BladeEngine (EC 3210) network interfaces,
Princeton Instruments USB cameras,
Mimio Xi interactive whiteboards,
the openPOWERLINK network stack,
Frontier Tranzport and Alphatrack devices, and
several families of Meilhaus data acquisition boards.
Also added, seemingly without help from Google, is a set of drivers
for the Android platform, including
support for the /dev/binder IPC mechanism, timed GPIO
operations, the RAM buffer console,
a special "low memory killer" device,
and the logger device.
Remember that "staging" drivers are not
considered to be up to normal kernel code quality drivers; they are
merged in the hope that developers will help to make them better.
Quite a few improvements to these drivers were merged via the staging
tree this time around, so this tree may be working as intended.
- The long-deprecated eepro100 driver has finally been removed; the
e100 driver should be used instead.
- The SCSI layer has acquired support for Fibre Channel over Ethernet
(FCoE) devices.
- The GEM layer used for memory management in graphical processor unit
(GPU) driver code has seen a number of improvements. The big news in
this area, though, is that the kernel mode setting code has finally
been merged. This change paves the way for the removal of a great
deal of scary user-space code, better support for features like fast
user switching, and the ability to run the X server without root
privilege. Kernel mode setting is still in an early state, though,
and most people will not want to enable it unless they are sure they
have a properly-prepared user space.
- Support for HP iPAQ h5000 systems,
Samsung S3C64XX series based systems,
and Pandora game consoles has been
added to the ARM architecture code.
- The SuperH architecture has gained support for the ftrace tracing
framework.
- There is a new no_file_caps= boot option which can be used to
disable file capabilities on kernels which have that feature enabled.
From the changelog: "This allows distributions to ship a kernel
with file capabilities compiled in, without forcing users to use (and
understand and trust) them."
- The CIFS filesystem supports a new forcemand mount option;
when present, it causes CIFS to use mandatory locks rather than
POSIX-style advisory locks.
- The CUBIC 2.3 TCP congestion control algorithm and the "backward
congestion notification" feature are now supported in the
networking layer.
- The network code has support for the "deficit round robin" packet
scheduling algorithm, said to produce highly fair scheduling with
minimal cost.
- A vast set of network namespace patches has been merged. The
namespace hackers have, so far, refrained from saying that this
feature is ready for general use, but it must be getting closer.
- The devpts filesystem now supports the creation of multiple instances
in different namespaces.
- The wireless regulatory domain code has been extended to provide 802.11d support.
- The Tree RCU patch set,
which should provide improved scalability on systems with "more than a
few hundred CPUs," has been merged.
- Users of huge pages can now look in /proc/pid/smaps
for a new KernelPageSize value giving the actual size of the
pages in use. Among other things, this information can be used to
verify that a process is actually using large pages where expected.
- The eCryptfs filesystem now supports the encrypting of file names as
well as their contents.
- The FUSE user-space filesystem mechanism can now support
ioctl() and poll() calls.
- Support for unlabeled networks and hosts has been added to the SMACK
security module.
Changes visible to kernel developers include:
- There is a new synchronous hash interface called "shash." It
simplifies the use of synchronous hash operations while allowing the
same tfm to be used simultaneously in different threads. All in-tree
users have been switched to the new API.
- The massive task credentials
patch set has been merged. This code reorganizes the handling of
process credentials (user ID, capabilities, etc.). One of the
immediate implications of this change is direct references to
credential-oriented fields in the task structure need to be changed;
for example, current->user->uid becomes
current_uid(). See Documentation/credentials.txt for a
description of the new API.
- The ftrace code has seen a lot of internal changes. The function
tracing feature has seen a number of improvements, and the developers
have added
mechanisms to profile the behavior of if statements,
provide function call graphs,
obtain user-space stack traces, and
follow CPU power-state transitions.
- Most of the callback functions/methods associated with the
net_device structure have been moved out of that structure
and into the new struct net_device_ops. In-tree drivers
have been converted to the new API.
- The priv field has been removed from struct
net_device; drivers should use netdev_priv() instead.
- The generic PHY layer now has power management support. To that end,
two new methods - suspend() and resume() - have been
added to struct phy_driver.
- The networking layer now supports large receive offload (or
"generic receive offload") operation.
- The NAPI API has been cleaned up somewhat; in particular, functions
like netif_rx_schedule(), netif_rx_schedule_prep(),
and netif_rx_complete() have lost the unneeded struct
net_device parameter.
- The hrtimer code has been simplified with the removal of variable
modes for callback functions. All processing is now done in hardirq
context.
- A new set of LSM hooks has been added; these support pathname-based
security operations. With the merging of these hooks, one major
obstacle to the inclusion of security modules like AppArmor and TOMOYO
has been removed.
- The kernel will now refuse to build with GCC 4.1.0 or 4.1.1; those
versions have unfortunate bugs which prevent the building of a working
kernel. Versions 3.0 and 3.1 have also been deemed to be too old and
will not be supported in 2.6.29.
- Video4Linux drivers now use a separate v4l2_file_operations
structure to hold their VFS-like callbacks. The prototypes of a
number of these functions have been changed to remove the
inode argument.
- Video4Linux2 has also acquired a new "subdevice" concept, meant to
reflect the fact that video "devices" tend to be, in reality, a set of
cooperating devices. See the new
document for a description of how this mechanism works.
- Two new functions - stop_machine_create() and
stop_machine_destroy() - allow the independent creation of
the threads used by stop_machine(). That, in turn, lets
those threads be created before trying to actually stop the machine,
making that operation more resistant to failure.
- The poll() file operation is now allowed to sleep; see this article for more
information on this change.
- The CPU mask mechanism, used to represent sets of processors in the
system, is in the middle of being massively reworked. The problem is
that CPU masks were often put on the stack, but, as the number of
processors grows, the stack lacks room for the mask. The new API is designed to
get these masks off the stack, and to guard against anybody ever
trying to put one back. See this
posting by Rusty Russell for details on this work.
The merge window opened on December 28; if the usual two-week pattern holds,
changes should be accepted through January 11. Tune in next week for
an update on the final patches merged for the 2.6.29 kernel.
Comments (5 posted)
By Jake Edge
January 7, 2009
Using an out-of-tree kernel patch has several downsides but, as long as the
patch is maintained and updated with the kernel, it is workable. If the
developers lose interest—or funding—it suddenly becomes a much
bigger problem for users. That scenario may be about to play out for users
of the grsecurity tool as a recent release
comes with a warning that it could be the last.
Users of grsecurity are, unsurprisingly, worried about the future of the
security tool, but calls for its inclusion in the mainline are not likely
to be successful. Over time, largely because of the efforts of others
outside of the grsecurity project,
various pieces of grsecurity (and the associated PaX project) have been added to the
kernel. But, there are a number of reasons that the full grsecurity patch
is not in the mainline; the most basic is that the developers seem
unwilling or uninterested in following the normal path to inclusion.
The grsecurity patch implements a number of security features that are
useful, particularly for web servers or servers that provide shell access
to untrusted users. One of the major features is role-based
access control (RBAC), which is an alternative to the traditional UNIX
discretionary
access control (DAC) or the more recent mandatory
access control (MAC) provided by SELinux and Smack. The aim of RBAC is
create a
"least privilege" system, where users and processes have only the minimum
necessary privilege to accomplish their task. grsecurity also includes
hardening of the chroot() system call, to eliminate privilege
escalation and other vulnerabilities from within a "chroot jail". In
addition, there
are a number of other miscellaneous features like auditing and restricting
/proc information, all of which are listed on the grsecurity
features page.
Another major component of grsecurity is the PaX code, which restricts
memory use
so that various exploits, such as buffer overflows and other code execution
vulnerabilities, are blunted or eliminated. It does this by making data
pages non-executable using—or emulating—the "no execute" (or
NX) bit. PaX restricts mprotect() to not allow pages that are
both writable and executable to avoid code injection as well. PaX also
adds much
more aggressive address space layout randomization (ASLR) than is currently
used by Linux. PaX is developed separately from grsecurity, by the
anonymous "PaX Team", then incorporated into grsecurity by developer
Brad Spengler.
The project has been around for a long time; grsecurity started in 2001,
while PaX began in 2000. There are numerous satisfied users and grsecurity
has been used in distributions such as NetSecL and Hardened Gentoo, but it
has never made it into the mainline.
Gabor Micsko recently posted a request on
linux-kernel for Linus Torvalds to reconsider grsecurity:
The common opinion of the developers of grsecurity, PaX and their users
is that acceptance of the code into the kernel would be the best
solution for saving the project, beside finding another long-term
sponsor.
Torvalds replied that much of what was in
grsecurity and PaX was "insane and very annoying and invasive
code." He then went on to explain some of the history:
The apparent inability (and perhaps more importantly - total
unwilling[n]ess) from the PaX team to be able to see what makes sense in a
long-term general kernel and what does not, and split things up and try to
push the sensible things up (and know which things are too ugly or too
specialized to make sense), caused many PaX features to never be merged.
Much of it did get merged over the years (mostly because some people spent
the time to separate things out), but no, we're not going to suddenly
start merging code like that just because the project is in trouble. None
of the basic issues have been solved.
A perfect example of the unwillingness to work with the kernel hackers is
embodied in the decision not to
implement RBAC as a Linux Security Module (LSM). For better or worse,
LSM is the mechanism used to implement access control in the kernel.
Conceptually, it is a good fit for the grsecurity RBAC code. It might
require additional LSM hooks, but working on getting those hooks added is
the right approach. There was some uncertainty about LSM at one time, but
it clearly is the way forward today.
There may also be an issue with the PaX code, in that anonymous
contributions to the kernel are not allowed. Presumably Spengler, or some
other interested hacker, could sign off on that code, but it cannot be
accepted directly from "PaX Team".
To the extent grsecurity and PaX have been proposed for inclusion, they
have always been presented as a single monolithic patch. There has never
been an attempt to break the patch up into logical chunks that can be
accepted or rejected on their individual merits. So far, that has not
occurred even after the project lost its sponsor. But waiting until the
last minute is not going to work. As Robert Hancock puts it:
Saying to the kernel developers "here, throw this huge blob of code into
your kernel because otherwise we're taking our ball and going home" is not
how it works.
If there is value in the existing code, interested users and developers
need to work within the kernel process to get it accepted. To do that, one
must identify the useful pieces and proceed from there. Valdis Kletnieks suggests:
Probably the best way to proceed would be for the stakeholders to come to some
agreement on which parts are the "sane stuff" (which could be an interesting
food fight), split those parts out, and submit them for inclusion as standalone
separate patches.
This is yet another example of the perils of out-of-tree code. By all
accounts, there are satisfied grsecurity users who may well be left behind
if the grsecurity project fails to find sponsors by the end of March. They
can, of course, continue running the grsecurity-enhanced kernels they
currently have, but may not be able to take advantage of upcoming kernel
advances.
Perhaps the stakeholders will gather together and continue updating
grsecurity for newer kernels, but that still leaves the underlying
problem. They would be better served spending at least part of their time
working with the kernel hackers to get as much of grsecurity and PaX
as possible merged into the mainline.
Comments (2 posted)
By Jonathan Corbet
January 7, 2009
The Btrfs filesystem has been under development for the last year or so;
for much of that time, it has been widely regarded as the most likely "next
generation filesystem" for Linux. But, before it can claim that title,
Btrfs must stabilize and find its way into the mainline kernel. Btrfs
developer Chris Mason has been saying for a while that he thinks the code
will come together more quickly if it is merged relatively soon, even if it
is not yet truly ready for production use. General experience with kernel
development tends to support this position: in-tree code gets more review,
testing, and fixes than out-of-tree code. So the development community as
a whole has been reasonably supportive of a relatively early Btrfs merge.
In our last Btrfs episode,
Andrew Morton suggested that a 2.6.29 merge be targeted.
Chris would like that happen; to that end, he has posted a version of Btrfs for
consideration. Unsurprisingly, that posting has already increased the
amount of attention being paid to this code, with the result that Chris
quickly got a list of things to fix. Most of those have now been
addressed, but there are a few remaining issues which could still impede
the merging of Btrfs in this development cycle. This article will look at
the potential roadblocks.
One of those is the user-space API. Btrfs brings with it a whole set of
new ioctl() calls, none of which have been seriously reviewed or
even documented. These calls perform functions like creating snapshots,
initiating defragmentation, creating or resizing subvolumes, adding devices
to the volume set, etc. Interestingly, there has been no real complaint
about the volume-management features of Btrfs in general. But the
interface to features like that needs close scrutiny; normally, user-space
APIs cannot be broken once they are merged into the mainline. There has
been some talk of making an exception for Btrfs, since there is little
chance of systems becoming dependent on a specific interface before Btrfs
is production-ready.
Still, once distributions start shipping Btrfs tools - to help testers if
nothing else - an API change would cause pain. Any potential for this kind
of pain would make API changes very hard to do. So Linux may well end up
being stuck with the early Btrfs API. Given that at least one developer thinks that this API needs a serious rework,
this issue could turn out to be a serious roadblock indeed.
Then, there is the issue of the special-purpose locking primitives used in
Btrfs. To understand this discussion, it's worth looking at the locking
function used within Btrfs:
int btrfs_tree_lock(struct extent_buffer *eb)
{
int i;
if (mutex_trylock(&eb->mutex))
return 0;
for (i = 0; i < 512; i++) {
cpu_relax();
if (mutex_trylock(&eb->mutex))
return 0;
}
cpu_relax();
mutex_lock_nested(&eb->mutex, BTRFS_MAX_LEVEL - btrfs_header_level(eb));
return 0;
}
The lock in question is a mutex, but it is being acquired in an interesting
way. If the lock is held by another process, this function will poll it up
to 512 times, without
sleeping, in the hope that it will become available quickly. Should that
happen, the lock can be acquired without sleeping at all. After 512
unsuccessful attempts, the function will finally give up and go to sleep.
Chris justifies this behavior this way:
Btrfs is using mutexes to protect the btree blocks, and btree
searching often hits hot nodes that are always in cache. For these
nodes, the spinning is much faster, but btrfs also needs to be able
to sleep with the locks held so it can read from the disk and do
other complex operations.
For btrfs, dbench 50 performance doubles with the unconditional spin,
mostly because that workload is almost all in ram.
For 50 procs creating 4k files in parallel, the spin is 30-50% faster.
This workload is a mixture of disk bound and CPU bound.
That kind of performance increase seems worth going for. In fact, it
reflects a phenomenon which has been observed in other situations as well:
even when sleeping locks are used, performance often improves if a
processor spins for a while in the hope that a contended lock will become
available. If the lock can be acquired without sleeping, then the overhead
associated with putting the process to sleep and waking it up can be
avoided. Beyond that, though, there is the fact that the process seeking
to acquire the lock is probably well represented in the CPU's cache.
Allowing that process to continue to run will, if the lock can be acquired
quickly, almost certainly lead to better system performance.
For this reason, the adaptive
realtime locks patch was developed last year, though it never found its
way into the mainline. In response to the Btrfs discussion, Peter Zijlstra
proposed a spinning mutex
patch which is intended to provide the same benefits as the special
Btrfs locking function, but for more general use and without the addition
of magic constants. In Peter's patch, an attempt to acquire a contended
lock will spin for as long as the process holding that lock is actually
running on a CPU. If the lock holder goes to sleep, any process trying to
acquire the lock also goes to sleep. The heuristic seems to make sense,
though detailed benchmarks have not been posted.
The patch was received reasonably
well, though Linus has insisted that some
changes be made.
So a more general spinning mutex may well find its way into the mainline.
Whether it will go in for 2.6.29 is not clear, though. Developers tend to
like their core locking primitives to be reasonably well tested; merging
something which was developed toward the end of the merge window could be a
hard sell. Until something like that happens, Chris is uninterested in removing his special locking
function:
But, if anyone working on adaptive mutexes is looking for a coder,
tester, use case, or benchmark for their locking scheme, my hand is
up. Until then, this is my for loop, there are many like it, but
this one is mine.
Finally, there is the question of the name. Some reviewers have suggested
that the filesystem should be merged with a name which makes it clear that
it's not meant for production use - "btrfsdev," for example. Chris is
resistant to that idea, noting that, unlike existing filesystems, Btrfs is
known to be new and has no reputation for stability. He has stated his
willingness to make the change, though, if it is truly considered to be
necessary. Bruce Fields pointed out that
calling it "Btrfs" from the beginning could possibly burn future developers
who boot an old kernel (with a non-production Btrfs) after switching to
a newer, production-ready version of the filesystem.
All of this adds up to an uncertain fate for Btrfs in 2.6.29; there are a fair
number of open issues and it's late in the merge window. Of course, Btrfs could
be merged after 2.6.29-rc1; since it is a completely new subsystem, it
won't cause regressions.
But if Linus concludes that there are enough loose ends in the current
Btrfs code, he may just decide to give it one more development cycle before
bringing it into the mainline. So, while nobody seems to doubt that Btrfs
will go in, the question of when remains open.
(With any luck, we hope to have an authoritative article on Btrfs for this
page in the near future, once the author - you know who you are! - gets it
written. Stay tuned.)
Comments (36 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
- Casey Dahlin: waitfd.
(January 7, 2009)
Development tools
Device drivers
- Dave Airlie: drm.
(January 5, 2009)
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>