Kernel development [LWN.net]

Kernel release status

The 2.6.27 merge window remains open, so there is no 2.6 development kernel release as of this writing. Patches continue to flow into the mainline repository; see the summary below for the highlights.

The 2.6.25.12 stable update is in the review process as of this writing; it should be released sometime around July 24. The proposed update contains 47 patches implementing a wide variety of fixes.

Comments (none posted)

Linus's mid-merge-window reflections

Linus has sent out an announcement that the 2.6.27 merge window is halfway done, and that he's taking a break for a few days. "In the last couple of days I _have_ merged 50+ trees, and while there's been some 'heated discussion' about some of them (you know who you are ;), I'm hoping that we're actually in reasonably good shape even though it's in the middle of the merge window, and that people will test out the snapshot kernels even though I'm not ready to do a -rc1 release."

Full Story (comments: 9)

Quotes of the week

There is no more distributed storage you knew before, instead there is completely new project being developed, which main goal is to provide a transport layer for the block requests only. Consider it as Network Block Device on huge steroids. Consider it as iSCSI on huge steroids. Consider it as ATA-over-Ethernet on even more huge steroids. It is just an example of what all those protocols should have. And only that.

-- Evgeniy Polyakov didn't get the "zero tolerance for doping" memo

If you want the kernel people to endorse your project, you'll have to please them. Its that simple. If that means having to radically re-structure your design, and/or break backwards compatibility then so be it. Such are the costs for not collaborating from the start.

If you stubbornly refuse to co-operate you'll either break the project or invite a fork/rewrite by someone else if the idea is deemed worthwhile enough.

-- Peter Zijlstra (on SystemTap)

Being a good citizen in Linux land often means improving whole subsystems rather than stuffing a bunch of fancy features into individual drivers. Working that way can be harder, but it spreads the benefits wider, and improves Linux as a whole.

-- Jesse Barnes

FWIW, I would rather see implications thought about *and* mentioned in the changelogs. OTOH, the above shows the real-world cases when breakage hadn't even been realized to be security-significant. Obviously broken behaviour (leak, for example) gets spotted and fixed. Fix looks obviously sane, bug it deals with - obviously real and worth fixing, so into a tree it goes... IOW, one _can't_ rely on having patches that close security holes marked as such. For that the authors have to notice that themselves in the first place.

-- Al Viro (read the whole thing)

Comments (5 posted)

More quotes of the week

Code cleanups sometimes expose fundamental disagreements about how the code should look; here some veteran kernel hackers show how it's done.

Rusty, in his peevish way, complained that macros defining constants should have a name which somewhat accurately reflects the actual purpose of the constant.

Aside from the fact that PTE_MASK gives no clue as to what's actually being masked, and is misleadingly similar to the functionally entirely different PMD_MASK, PUD_MASK and PGD_MASK, I don't really see what the problem is.

-- Jeremy Fitzhardinge

Has Rusty ever heard about the economy of the healthy flow of incoming regressions? What will we do without obscure names and hard to find bugs? First he writes a simple and readable hypervisor (ruining a whole industry based on obscurity!) and now that. It's _so_ unamerican and unaustralian. I'm worried.

-- Ingo Molnar

I am disgusted with this inappropriate emphasis on clarity over obscurity. It should be pretty clear to everyone here that we can't have both! Fortunately, there is a way to partially rectify the situation. Ingo, please apply.

[...]

+/* There's something suspicious about this line: see PTE_PFN_MASK comment. */
 #define __PHYSICAL_MASK ((phys_addr_t)(1ULL << __PHYSICAL_MASK_SHIFT) - 1)
 
@@ -19,6 +20,7 @@
 
 /* PTE_PFN_MASK extracts the PFN from a (pte|pmd|pud|pgd)val_t */
+/* This line is quite subtle.  See __PHYSICAL_MASK comment above. */
 #define PTE_PFN_MASK		((pteval_t)PHYSICAL_PAGE_MASK)

-- Rusty Russell

Comments (3 posted)

2.6.27 merge window, part 2

By Jonathan Corbet
July 23, 2008

As of this writing, just over 6200 changesets have been merged into the mainline git repository since the 2.6.26 release. Merge activity appears to be slowing down somewhat; it appears that most of the major trees have been pulled. Andrew Morton has not yet started to unload the -mm tree into the mainline, though; until that happens, the merge window can be expected to remain open.

User-visible changes merged since last week's summary include:

There are new drivers for Samsung S3C SD/MMC interfaces, Atmel Multimedia card interfaces, Ricoh Bay1Controller cards, S/390 QDIO controllers, Renesas SuperH SH7710 and SH7712 Ethernet controllers, Option HSDPA/HSUPA mobile network devices, Broadcom BCM57711 Ethernet adapters, Mikrotik RouterBoard 532 series boards, Anysee DVB-T/C USB2.0 receivers, Sensoray 2255 video capture devices, Siano SMS10xx digital television devices, SuperH Mobile CEU camera controllers, Niagara2 hardware random number generators, HTC Shift (X9500) touchscreens, iNexio serial touchscreens, Sahara TouchIT-213 touchscreens, Xilinx XPS PS/2 controllers, Maxim MAX7301 GPIO expanders, HP iLO/iLO2 management processors, Atheros L1E Gigabit Ethernet adapters, Marvell XOR DMA engines, Synopsys DesignWare DMA controllers, and Intel version 3.0 I/OAT DMA engines. There is also a new PCI "slot detection driver" which will attempt to find all PCI slots in the system and create corresponding entries in /sys/bus/pci/slots/.
Worthy of note: the "gspca" set of video drivers, long maintained outside of the mainline kernel tree, has been merged. These drivers support a large number of video devices; with their merge, most video camera devices on the market are supported by Linux.
The Fujitsu laptop driver has been updated with better hotkey and backlight support for more Fujitsu models.
The UBIFS filesystem for flash-based storage devices has been merged.
The multiqueue networking patches have been merged.
The IA-64 architecture has gained a paravirt_ops implementation to support virtualization.
The new directories found at /sys/dev/char and /sys/dev/block contain pointers to sysfs entries for devices organized by device number.

Changes visible to kernel developers include:

The new suspend and hibernate infrastructure has been merged, providing a wider set of callbacks for power management events. The PCI and platform bus interfaces have been enhanced with support for this new infrastructure.
The TTY layer continues to evolve; significant changes include the introduction of a new tty_port structure meant to hold information common to all TTY ports and a rework of the line discipline code.
The mac80211 code has a new module which can simulate any number of IEEE 802.11 radios; it is suitable for testing mac80211 functionality and associated user-space tools.
There is a new "rfkill" mechanism for unified handling of "radio off" switches on wireless devices.
A number of Video4Linux2 format-related callbacks have been renamed to make them match the names used with the associated buffer types. In addition, the vidioc_enum_fmt_vbi_cap() callback has been deprecated and marked for removal in 2.6.28.
The videobuf layer now has support for controllers which cannot do scatter/gather I/O.
The USB "gadget" framework has been massively reworked to provide better support for composite devices.
The prototype for device_create() has changed:
```
    struct device *device_create(struct class *class, 
                                 struct device *parent,
			         dev_t devt, 
				 void *drvdata, 
				 const char *fmt, ...);
```
Those who see a resemblance to device_create_drvdata() are right; all in-tree users were converted over to that interface, the old device_create() was removed, and device_create_drvdata() was renamed. For now, a macro makes calls to device_create_drvdata() do the right thing, but that macro will probably go away before the 2.6.27 final release.
User-space UIO drivers can now write a signed value to the /dev/uioX device to enable and disable interrupts.
Debugfs (finally) has a function for removing an entire directory tree:
```
    void debugfs_remove_recursive(struct dentry *dentry);
```
As a result, code creating hierarchies in debugfs no longer need remember the dentry of every file they create.

The tail end of the 2.6.27 merge window will be covered in next week's LWN Kernel Page.

Comments (none posted)

Linux-next meets the merge window

By Jonathan Corbet
July 23, 2008

Recent LWN articles on the linux-next tree have noted that, while this tree has been working well in its role of identifying merge conflicts between subsystem trees, it has not yet been through a full kernel development cycle. 2.6.27 will be the first kernel release where linux-next was in existence for the entire preceding cycle; in theory, everything which goes into 2.6.27 should have been aged in linux-next first. As the end of the 2.6.27 merge window nears, a look at how linux-next has affected the process seems warranted.

One might think that linux-next maintainer Stephen Rothwell would be able to take a break during the merge window; it should mostly be a matter of watching the linux-next tree drain into the mainline. As it happens, the daily linux-next postings (example) suggest a fair amount of scrambling to deal with merge conflicts, build failures, and more. There are a number of reasons for this, one of which being that subsystem trees are merged into the mainline in an order which is completely unrelated to their order in linux-next. Patches which remain in linux-next are being applied to a highly unstable base.

Another interesting phenomenon has been a fair number of patches appearing in linux-next during the merge window. Some of these are actually patches intended for 2.6.28; once maintainers have dumped their 2.6.27 patches into the mainline, they are starting to acquire stuff for the next time around. Stephen has asked them not to do that, requesting that 2.6.28 material not be directed toward linux-next until after the 2.6.27-rc1 release. The goal is that linux-next should be nearly empty when 2.6.27-rc1 comes out.

Other patches, though, are intended for 2.6.27 but simply have not done their time in the linux-next tree. That had led to a certain amount of developer grumpiness at times. It is interesting to note, though, that one of the biggest examples of linux-next avoidance - David Miller's merging of the multiqueue networking code which he had finished writing hours before - has generated relatively few complaints. But various other types of conflicts have generated a steady steam of terse notes from Andrew Morton (who is in the unfortunate position of basing his work on top of linux-next) on how new stuff should have been in linux-next weeks ago.

Another area of, say, colorful conversation has been around the TTY subsystem, currently been subjected to a much-needed thrashing by Alan Cox. Some developers have been unhappy with Alan for merging code which failed to compile, even though those problems had already been identified in linux-next. Alan, instead, has become irritated with other developers who have surprised him with TTY-layer changes of their own, causing Alan's patches not to apply. Alan has some quaint notions about actually testing his patches, so the resolution of this kind of conflict requires the running of a new set of regression tests and such; after this had happened a few times in a row, he started getting a little short-tempered. These issues would appear to have been worked out at this point, but the idea behind linux-next was to keep them from happening in the first place.

Yet another source of occasional merge issues is the rebasing of trees. Rebasing, in git-speak, is the process of modifying the commit history in a repository to cause a series of patches to look like they were written against a later version of the code than they really were. Rebasing can be a useful technique; it generates a series of patches which applies cleanly to the current state of the tree without generating a bunch of unsightly merge commits.

Rebasing can be especially useful in the context of linux-next. If testing turns up a patch which breaks the build, simply committing a fix will leave a period in the history where the kernel cannot be built, and that is bad for people running bisections. With the use of git's history editing features, the offending patch can be fixed in place and all evidence of the mistake disappears. In essence, that embarrassing commit mentioning the Eurasian campaign can be fixed up to properly note that we've always been at war with Eastasia.

But rebasing a repository changes the history (by design), creating, in the process, an entirely new set of commits. Those commits are new code, to the point that any results from testing the older version may no longer apply. The commits also have new names, so any other developer who was using a version of the repository will be shaken off and unable to merge. Issues related to rebasing have come up a couple of times during the merge window, leading Linus to post a series of lectures on the problems that rebasing can cause. It is clearly a tool which must be used with restraint, but occasional use of rebasing can, in the linux-next context, lead to a better final merge. Finding the right balance is something each developer will have to learn.

In the end, the merge window remains a bit of an unruly time. The process of channeling the work of several hundred developers into the mainline over a two-week period is unlikely to ever be an entirely smooth experience. But, for all its glitches, the 2.6.27 merge window has been (so far!) easier than 2.6.26. The presence of the linux-next tree almost certainly has something to do with that. This tree's role continues to evolve, but its benefits are starting to be felt.

Comments (1 posted)

Tracing: no shortage of options

By Jonathan Corbet
July 22, 2008

Three weeks ago, LWN looked at the renewed interest in dynamic tracing, with an emphasis on SystemTap. Tracing is a perennial presence on end-user wishlists; it remains a handy tool for companies like Sun Microsystems, which wish to show that their offerings (Solaris, for example) are superior to Linux. It is not surprising that there is a lot of interest in tracing implementations for Linux; the main surprise is that, after all this time, Linux still does not have a top-quality answer to DTrace - though, arguably, Linux had a working tracing mechanism long before DTrace made its appearance.

Even a casual reader of the kernel mailing list will have noticed that there are a lot of tracing-related patches in circulation at the moment. There are so many, in fact, that it is hard to keep track of them all. So this article will take a quick look at the code which has been posted in an attempt to make the various options a bit clearer.

SystemTap

SystemTap remains the presumptive Linux tracing solution of choice. It is hampered by a few problems, though, including usability issues, a complete lack of static trace points in the mainline kernel, and no user-space tracing capability. On the usability side, we are seeing a few more kernel developers trying to put SystemTap to work and posting about the problems they are having. If one takes as a working hypothesis the notion that, if kernel hackers cannot make SystemTap work, many other users are likely to encounter difficulties as well, then one might conclude that addressing the reported problems would be a priority for the SystemTap developers.

The SystemTap developers do seem to be interested in these reports, which is a good sign. There are other things happening in the SystemTap arena, including the release of version 0.7 on July 15. This release adds a number of new features and tapsets, and a substantial set of examples as well. Meanwhile, Anup Shan has posted an interesting integration of SystemTap and the fault injection framework, allowing tapsets to control fault injection and trace the results.

James Bottomley has been playing some with the SystemTap code; one result of that work is changes to SystemTap's internal relocation code in an attempt to make it more acceptable for mainline kernel inclusion. There can be no doubt that the out-of-tree nature of much of the SystemTap support code has made it harder for that code to progress, so any improvement which makes it more likely that some of this code will be merged is welcome.

Also by James is this patch implementing a new way to put markers into the kernel. The addition of markers (or static tracepoints) has always been problematic in that many of these markers, by their nature, need to go into some of the hottest code paths in the kernel. To support dynamic tracing, these markers need to be available on production systems, so they must work without creating any significant performance regressions. Quite a bit of work has gone into the static marker code which is in the kernel (but mostly unused) now, but some developers are still uncomfortable with putting them into performance-critical paths.

James's patch addresses these concerns by putting the tracepoints entirely outside of the code paths. Rather than add some sort of marker to the code, these markers just make a note of just where in the code the marker is supposed to be; this note is stored in a separate part of the kernel binary. That information is enough for a run-time tool to patch in an actual jump to a tracing function should somebody want to see the information from that tracepoint. An additional benefit is that these markers do not interfere with any optimizations done by the compiler. Other solutions can insert optimization barriers which, while they do make life easier for the tracing subsystem, also affect the speed of the code even when the trace points are not active.

Ftrace

The text above said that the kernel's static tracepoint code is "mostly unused." That would have been better expressed as "completely," except that the 2.6.27 kernel will include a user in the form of the ftrace framework. One of the things which makes ftrace truly unique is that its documentation was not only merged before the code itself, but well before: the 2.6.26 kernel includes the excellent Documentation/ftrace.txt file.

The ftrace (which stands for "function tracer") framework is one of the many improvements to come out of the realtime effort. Unlike SystemTap, it does not attempt to be a comprehensive, scriptable facility; ftrace is much more oriented toward simplicity. There is a set of virtual files in a debugfs directory which can be used to enable specific tracers and see the results. The function tracer after which ftrace is named simply outputs each function called in the kernel as it happens. Other tracers look at wakeup latency, events enabling and disabling interrupts and preemption, task switches, etc. As one might expect, the available information is best suited for developers working on improving realtime response in Linux. The ftrace framework makes it easy to add new tracers, though, so chances are good that other types of events will be added as developers think of things they would like to look at.

Tracepoints

The kernel markers mechanism is meant to be the way that static tracepoints are inserted into the kernel. To that end, a great deal of effort went into making these markers fast; they are, for all practical purposes, a set of no-op instructions until somebody wants to turn one on, at which point the real tracing code is patched into the running kernel. Since they were merged, however, kernel markers have been the subject of a few grumbles.

In particular, kernel markers use a somewhat awkward mechanism to ensure that any arguments passed to the tracing function are interpreted correctly there. Each marker has a printk()-style format string associated with it; that string describes the type of each "argument" (a variable or expression within the code being traced). When tracing code activates a marker, it will supply a function to be called when the marker is hit and a format string describing the arguments that the function expects. The marker code will ensure that both format strings match; otherwise the marker will not be enabled. The problem is that the format string requires extra work to write and is only approximate in its specification of the types involved. These strings can make it clear that a given argument is a pointer, for example, but they say nothing about what type is pointed to.

In response to various efforts to get around this issue, Mathieu Desnoyers (the original author of the kernel marker work) has proposed a new mechanism called tracepoints. They are another way of putting static trace points into the kernel, but with a simpler and more type-safe way of putting the pieces together.

With tracepoints, every trace point must be declared in a header file with a mildly ugly set of macros:

    #include <linux/tracepoint.h>

    DEFINE_TRACE(tracepoint_name,
                 TPPROTO(trace_function_prototype),
		 TPARGS(trace_function_args));

This definition will create a new tracepoint called tracepoint_name. Any function attached to that tracepoint must have a function prototype as provided in the TPPROTO() macro; the names of the associated arguments are provided with TPARGS().

Perhaps this is better understood with an example. The tracepoints patch set includes quite a few static points for use with the LTTng tracing toolkit. There is one called sched_wakeup which fires whenever the scheduler wakes up a process. It is defined with:

    DEFINE_TRACE(sched_wakeup,
	         TPPROTO(struct rq *rq, struct task_struct *p),
		 TPARGS(rq, p));

The actual insertion of the tracepoint is a line like this:

    trace_sched_wakeup(rq, p);

Note the trace_ prefix added to the supplied name. At this point in the code, a tracing function can be called with rq (the run queue of interest) and p (the process which is waking up) as parameters. Until an actual function is connected to the tracepoint, though, this declaration is essentially a no-op. Connection of a trace function is done through a call to:

    void my_sched_wakeup_tracer(struct rq *rq, struct task_struct *p);

    register_trace_sched_wakeup(my_sched_wakeup_tracer);

The register_trace_sched_wakeup() function (created as part of the DEFINE_TRACE() definition) will connect the supplied trace function to the tracepoint. The fact that the function prototype for the trace function is supplied as part of the tracepoint definition means that the compiler can perform thorough type checking; if the prototypes do not match up, compilation will fail. And that, in turn, should put an end to those embarrassing situations where turning on tracing causes the system to go down in flames.

Interestingly, tracepoints have dispensed with much of the mechanism developed to minimize the runtime impact of kernel markers; in particular, they do not use the "immediate values" code. Profiling has shown that the performance impact of tracepoints is so low that there is little value in the added complexity of runtime patching of kernel code. Still, there are signs that some kernel developers will object to the addition of tracepoints in their current form. Developers want tracing support - but not at the cost of slower performance, even if that cost is hard to measure.

Tracehook

Finally, Roland McGrath recently surfaced with the tracehook patch set. Tracehook has a rather different focus; it is, essentially, a cleanup of the way the kernel handles the ptrace() system call. The tracehook patches try to organize all of the process tracing code (much of which is architecture-dependent) into one place where it can be dealt with as a unit.

Tracehook is meant to be a first step toward the merging of a new version of the utrace code. Utrace has long been planned as the successor to the current ptrace() implementation, which has few admirers. But utrace has encountered a number of difficulties, so its path into the kernel has been slow. It disappeared from the lists entirely for a while, but a new version of the patches is said to be coming soon; Roland notes that he expects "some vigorous feedback" when that happens.

The real importance of the ptrace() rework is that it is the path toward integrated tracing of kernel- and user-space events. And that, of course, is one of the biggest features offered by DTrace which is not yet available in SystemTap. Getting user-space tracing into the kernel - especially if it could work with the tracepoints already being inserted into some applications for DTrace - would be a major step forward for Linux. A lot of people will be watching when this patch set comes around again.

Meanwhile, Roland would like to see the tracehook code merged for 2.6.27. He is late to the party, though, and this code has not done any time in linux-next. So it is not yet clear whether tracehook will go in before the merge window closes, or whether, instead, it will have to wait for 2.6.28.

In summary...

As can be seen, there is a lot happening in the area of tracing support for Linux. Tracing, it seems, is an idea whose time has come, at last. If the pieces described here can be merged and integrated into a unified framework, and if it can all be made sufficiently easy to use, the time for "DTrace envy" will come to an end. Those "ifs" are not small ones, though. There is quite a bit of work to be done yet; hopefully the current level of energy will remain until the job is done.

Comments (14 posted)

Adrian Bunk Linux 2.6.16.62-rc1 ?

Adrian Bunk Linux 2.6.16.62 ?

Adrian Bunk Linux 2.6.16.61 ?

Roland McGrath x86 step/syscall-trace fixes & cleanups ?

Ingo Molnar x86 updates for v2.6.27, phase #2 ?

David Miller : Sparc ?

Arjan van de Ven fastboot patches series 1 ?

Ingo Molnar scheduler updates for v2.6.27, phase #2 ?

Rusty Russell module and kmod patches ?

David Woodhouse Imprecise timers. ?

David Woodhouse schedule_timeout_range() ?

Mathieu Desnoyers Blktrace port to tracepoints ?

Eduard - Gabriel Munteanu kmemtrace RFC (resubmit 1) ?

Roland McGrath tracehook ?

Jason Wessel kgdb 2.6.27 mips ?

James Bottomley fix kallsyms to allow discrimination of local symbols ?

Andi Kleen Please pull ACPI updates ?

Marcin Obara Intel Management Engine Interface ?

Michael Buesch Add SPI over GPIO driver ?

Mauro Carvalho Chehab V4L/DVB updates ?

Luis R. Rodriguez Atheros IEEE 802.11n ath9k driver ?

Martin K. Petersen SCSI Data Integrity Support ?

Hannes Reinecke scsi_dh update ?

Greg KH USB patches for 2.6.26 ?

Greg KH driver core patches against 2.6.26 ?

Bartlomiej Zolnierkiewicz IDE updates (part 3) ?

Michael Kerrisk man-pages-3.05 is released ?

Alasdair G Kergon device-mapper update for 2.6.27 ?

J. Bruce Fields nfsd changes for 2.6.27 ?

Chris Mason New data=ordered code pushed out to btrfs-unstable ?

Balaji Rao NFS support for btrfs - v2 ?

Daniel Phillips A better solution to the HTree telldir problem ?

Daniel Phillips Tux3, a Versioning Filesystem ?

Takashi Sato freeze feature ver 1.9 ?

Krzysztof Halasa pull request: WAN ?

David Miller : Networking ?

David Miller : Final set of TX multiqueue changes. ?

Adam Langley TCP: Add TCP-AO support ?

Tetsuo Handa Add a counter in task_struct for skipping permission check. (Was: Should LSM hooks be called by filesystem's requests?) ?

Nicholas A. Bellinger - VHACS-VM x86_64 Alpha Preview - FOLLOWUP ?

Ranjit Manomohan Traffic control cgroups subsystem ?

Rusty Russell lguest and virtio patches ?

Jonathan Corbet gitdm v0.10 available ?

Kay Sievers udev 125 release ?

Hans de Goede Announcing libv4l 0.3.7 ?

Kernel development

Brief items

Kernel release status

Linus's mid-merge-window reflections

Kernel development news

Quotes of the week

More quotes of the week

2.6.27 merge window, part 2

Linux-next meets the merge window

Tracing: no shortage of options

SystemTap

Ftrace

Tracepoints

Tracehook

In summary...

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Networking

Security-related

Virtualization and containers

Miscellaneous