Kernel development

Brief items

Kernel release status

The 3.2 kernel was released on January 4, after 72 days of development. Among other things, this kernel adds the proportional rate reduction TCP algorithm, the extended verification module, the CPU scheduler bandwidth controller, the cross-memory attach IPC mechanism, the Hexagon DSP architecture, improved recovery of corrupted Btrfs filesystems, and the I/O-less dirty throttling code. See the Kernelnewbies 3.2 page for lots more information.

As of this writing, the 3.3 merge window is open; see below for details on what has been merged so far.

Stable updates: the 2.6.32.53, 3.0.16, and 3.1.8 stable kernel updates were released on January 6. Each contains the usual long list of important fixes (OK, 2.6.32.53 only has nine fixes, but the newer kernels have quite a few more).

The 2.6.32.54, 3.0.17, 3.1.9, and 3.2.1 stable updates are in the review process; they can be expected on or after January 12. 3.1.9 is likely to be the final update for the 3.1 kernel.

Comments (none posted)

Quotes of the week

Simplifying the code should always be the initial proposal. Adding more complexity on top is the worst-case when-all-else-failed option. Yet we so often reach for that option first :(

-- Andrew Morton

If this code were a character driver for an obscure serial port on a lesser-known chip architecture, I don't think it would get any attention at all. As it is, it's looking like at least a few man months of work will be required, as well as some relatively unneeded changes to Android user space, to get this feature into a permanently acceptable state. I wouldn't be surprised to see this stretch into a few calendar years.

Code that specializes the kernel in weird ways is accepted into the kernel all the time, and I've tried to figure out why this particular bit of code is treated differently. Especially since this code is self-contained, configurable, and imposes no perceivable long-term maintenance burden.

-- Tim Bird

So, I've said it many times before, and I'll say it again:

Yes, you are special and unique, just like everyone else.

The next person who says the "embedded is different" phrase again, owes me a beer of my choice.

-- Greg Kroah-Hartman

Comments (6 posted)

A long-term kernel support update

Greg Kroah-Hartman has posted an update on his plans for long-term kernel maintenance. As he announced before, the 3.1 series is almost at the end of its update period; he is also approaching the end of his maintenance for the long-lived 2.6.32 release. "It is approaching it's end-of-life, and I think I only have another month or so doing releases of this. After I am finished with it, it might be picked up by someone else, but I'm not going to promise anything." As it happens, Tim Gardner has stated that Ubuntu will support 2.6.32 through April 2015 - though whether that support will translate into kernel releases outside of the Ubuntu distribution is not clear. Ubuntu also plans to take on 3.2 as a long-term supported kernel.

Comments (9 posted)

No more system devices

By Jonathan Corbet
January 11, 2012

Since the early days of the Linux device model, there has been a special device class for "system devices," typically those built into the platform itself. For almost as long, the driver core developers have felt that there was no real need for this device type, which looks weirdly different from every other type of device. For 3.3, they have actually done something about it; system devices are no more.

All in-tree system device drivers have been fixed up to use regular devices instead. The process is relatively simple; it can be seen in, for example, this commit updating kernel/time/clocksource.c. In short, the embedded struct sys_device becomes a simple struct device instead. Attributes defined with SYSDEV_ATTR() are switched to DEVICE_ATTR(). The sysdev_class structure is turned into a nearly empty bus_type structure instead. That is about all that is required.

These changes, naturally, cause a user-space ABI change; system devices had their own special area under /sys which goes away. That has the potential to break programs and scripts, which would not be a good thing. To avoid this problem, a special function has been added:

    int subsys_system_register(struct bus_type *subsys,
			       const struct attribute_group **groups);

Registering a subsystem in this way will restore its old /sys/devices/system hierarchy. Needless to say, this function exists for backward compatibility purposes only; using it in new drivers is not likely to be received well.

Comments (none posted)

Yet another new approach to seccomp

By Jonathan Corbet
January 11, 2012

Over the years, we have seen a number of attempts to use the seccomp ("secure computing") mechanism to reduce the range of operations available to a given process. The hope is to use such a mechanism as part of a sandboxing solution that would allow (for example) a web browser to run third-party code in a safer manner. Thus far, all of these attempts have gone down in flames; see Seccomp filters: no clear path from last May for the most recent episode in this particular story.

Things have been quiet on the seccomp front recently - until now. Will Drewry, who has been behind the recent attempts to enhance seccomp, has come up with an interesting new approach to the problem. Whether this attempt will be more successful than its predecessors remains to be seen, but Will has managed to step around some of the traps that doomed his previous attempt.

In the last seccomp discussion, there was a fair amount of pressure to adapt the kernel's tracing infrastructure to this task; there was also resistance to using that infrastructure in that way. As explained in detail in the patch posting, Will has come to the conclusion that the tracing infrastructure is not really fit for the task anyway:

At every turn, it appears that the tracing infrastructure was unsuited for being used for attack surface reduction or as a larger security subsystem on its own. It is well suited for feeding a policy enforcement mechanism (like seccomp), but not for letting the logic co-exist. It doesn't mean that it has security problems, just that there will be a continued struggle between having a really good perf system and and really good kernel attack surface reduction system if they were merged.

Will's new approach has a stroke of brilliance to it: rather than use the ftrace filter mechanism, he has repurposed the networking layer's packet filtering mechanism (BPF). The BPF code normally operates on packets; in the seccomp context, instead, it operates on the register set at the time of each system call. The registers will contain the system call number and its parameters, allowing the filter to make a wide range of decisions on what will (or will not) be allowed. BPF is also well-maintained and well-optimized code; it even has an in-kernel just-in-time compiler. Some of these advantages are lost because seccomp uses its own BPF interpreter; one assumes that a way could be found to merge the two implementations if the underlying idea looks like it will pass muster.

As of this writing, there has not really been time for comments on the new patch. It will be interesting to see what the developers think. Meanwhile, those wanting more information should see the patch posting and the documentation file, which includes a sample program showing how to use the new facility.

Comments (8 posted)

Kernel development news

The first half of the 3.3 merge window

By Jonathan Corbet
January 11, 2012

As of this writing, just over 5,700 non-merge changesets have been pulled into the mainline for the 3.3 development cycle. A fair amount of work remains to be pulled, so it looks like another fairly active cycle, though perhaps not quite up to the level of 3.2.

Some of the more significant, user-visible changes merged so far include:

The "team" network driver - a lightweight mechanism for bonding multiple interfaces together - has been merged. The libteam project has the user-space code needed to operate this device.
The network priority control group controller has been added. This controller allows the administrator to specify the priority with which members of each control group have access to the network interfaces available on the system. See net_prio.txt from the documentation directory for more information.
Also added is the TCP buffer size controller which can be used to place limits on the amount of kernel memory used to hold TCP buffers.
The byte queue limits infrastructure has been added, enabling control over how much data can be queued for transmission over a network interface at any time.
The Open vSwitch virtual network switch has been merged.
The ARM architecture has gained support for the "large physical address extension," allowing 32-bit processors to address more than 4GB of installed memory.
The "adaptive RED" queue management algorithm is now supported by the networking layer.
The near-field communications (NFC) layer has gained support for the logical link control protocol (LLCP).
The beginnings of dynamic frequency selection support have been added to the wireless networking subsystem.
For S390 users who find the current limit of 3.8TB of RAM to be constraining: 3.3 will add support for four-level page tables and an upper limit of 64TB (for now).
Various Android drivers have returned to the staging tree; see this article for more information.
The C6X architecture (described in this article) has been merged.
The ext4 filesystem has added support for online resizing via the EXT4_IOC_RESIZE_FS ioctl() command. This operation does not (yet) work with filesystems using the "bigalloc" or "meta_bg" features.
The /proc filesystem has a new subdirectory for each process called map_files; it contains a symbolic link describing every file-backed mapping used by the relevant process. This feature is one of many needed to support the desired checkpoint/restart feature.
/proc also supports a couple of new mount options. When mounted with hidepid=1, /proc will deny access to any process directories not owned by the requesting process. With hidepid=2, even the existence of other processes will be hidden. The default (hidepid=0) behavior is unchanged. The other new option (gid=N) provides an ID for a group that is allowed to access information for all processes regardless of the hidepid= setting.
New drivers:
- Systems and processors: AppliedMicro APM8018X PowerPC processors, Numascale NumaChip systems, IBM Currituck (476fpe) boards, and NVIDIA Tegra30 processors.
- Input: TI TCA8418 keypad decoders, Wacom Intuos4 wireless tablets, EETI eGalax multi-touch panels, GPIO-connected tilt switches, Sharp GP2AP002A00F I2C Proximity/Opto sensors, and PIXCIR I2C touchscreens.
- Miscellaneous: P7IOC PowerPC I/O hubs, Dialog Semiconductor DA9052/53 PMIC devices, SiRF SoC Platform Serial ports, Analog Devices AD5421, AD5764, AD5744, and AD5380 digital to analog converters, GE PIO2 VME Parallel I/O cards, OMAP 2/3/4 displays, OMAP "Tiling and Isometric Lightweight Engine for Rotation" devices, Dialog DA9052/DA9053 regulators, VIA hardware watchdog timers, and TI TCA6507 I2C LED controllers.
- Network: Calxeda 1G/10G XGMAC Ethernet interfaces and ISA-based CC770 CAN controllers.
- USB: Marvell USB OTG transceivers and Marvell EHCI host controllers.
- Graduations: Microsoft's Hyper-V virtual network driver and the gma500 graphics driver have moved out of staging into the mainline.

Changes visible to kernel developers include:

A reworked version of the DMA buffer sharing API has been merged; this API has been described in a separate article.
The "memblock" low-level memory allocation API has been substantially reworked.
Quite a few VFS interfaces have been changed to use the umode_t type for file mode bits.
Also in the VFS: most of the members of struct vfsmount have been moved elsewhere (to a containing struct mount) and hidden from filesystem code. A number of callbacks in struct super_operations (specifically: show_stats(), show_devname(), show_path() and show_options()) now take a pointer to struct dentry instead of struct vfsmount.
The pin control subsystem has gained a new configuration interface.
Boolean module parameters have traditionally allowed the underlying module variable to be of either bool or int type. That tolerance is coming to an end with 3.3, where non-bool types will generate a warning; the plan is apparently to change those warnings to fatal compilation errors in the 3.4 cycle. A lot of modules have seen type changes for their parameters in preparation for the new regime.
The "system device" type has been removed from the kernel; all instances have been converted to regular devices instead. See this article for more information.

The merge window can be expected to remain open through approximately January 18.

Comments (4 posted)

Rethinking power-aware scheduling

By Jonathan Corbet
January 10, 2012

Sometimes it seems that there are few uncontroversial topics in kernel development, but saving power would normally be among them. Whether the concern is keeping a battery from running down too soon or keeping the planet from running down too soon, the ability to use less power per unit of computation is seen as a good thing. So when the kernel's scheduler maintainer threatened to rip out a bunch of power-saving code, it got some people's attention.

The main thing the scheduler can do to reduce power consumption is to allow as many CPUs as possible to stay in a deep sleep state for as long as possible. With contemporary hardware, a comatose CPU draws almost no power at all. If there is a lot of CPU-intensive work to do, there will be obvious limits on how much sleeping the CPUs can get away with. But, if the system is lightly loaded, the way the scheduler distributes running processes can have a significant effect on both performance and power use.

Since there is a bit of a performance tradeoff, the scheduler exports a couple of tuning knobs under /sys/devices/system/cpu. The first, called sched_mc_power_savings, has three possible settings:

The scheduler will not consider power usage when distributing tasks; instead, tasks will be distributed across the system for maximum performance. This is the default value.
One core will be filled with tasks before tasks will be moved to other cores. The idea is to concentrate the running tasks on a relatively small number of cores, allowing the others to remain idle.
Like (1), but with the additional tweak that newly awakened tasks will be directed toward "semi-idle" cores rather than started on an idle core.

There is another knob, sched_smt_power_savings, that takes the same set of values, but applies the results to the threads of symmetric multithreading (SMT) processors instead. These threads look a lot like independent processors, but, since they share most of the underlying hardware, they are not truly independent from each other.

Recently, Youquan Song noticed that sched_smt_power_savings did not actually work as advertised; a quick patch followed to fix the problem. Scheduler maintainer Peter Zijlstra objected to the fix, but he also made it clear that he objects to the power-saving machinery in general. Just to make that clear, he came back with a patch removing the whole thing and a threat to merge that patch unless somebody puts some effort into cleaning up the power-saving code.

Peter subsequently made it clear that he sees the value of power-aware scheduling; the real problem is in the implementation. And, within that, the real problem seems to be the control knobs. The two knobs provide similar behavioral controls at two levels of the scheduler domain hierarchy. But, with three possible values for each, the result is nine different modes that the scheduler can run in. That seems like too much complexity for a situation where the real choice comes down to "run as fast as possible," or "use as little power as possible."

In truth, it is not quite that simple. The performance cost of loading up every thread in an SMT processor is likely to be higher than that of concentrating tasks at higher levels. Those threads contend for the actual CPU hardware, so they will slow each other down. So one could conceive of situations where an administrator might want to enable different behavior at different levels, but such situations are likely to be quite rare. It is probably not worth the trouble of maintaining the infrastructure to support nine separate scheduler modes just in case somebody wants to do something special.

For added fun, early versions of the patch adding the "book" scheduling level (used only by the s390 architecture) included a sched_book_power_savings switch, though that switch went away before the patch was merged. There is also the looming possibility that somebody may want to do the same for scheduling at the NUMA node level. There comes a point where the number of possibilities becomes ridiculous. Some people - Peter, for example - think that point has already been reached.

That conclusion leads naturally to talk of what should replace the current mechanism. One solution would be a simple knob with two settings: "performance" or "low power." It could, as Ingo Molnar suggested, default to performance for line-connected systems and low power for systems on battery. That seems like a straightforward solution, but there is also a completely different approach suggested by Indan Zupancic: move that decision making into the CPU governor instead. The governor is charged with deciding which power state a CPU should be in at any given (idle) time. It could be given the additional task of deciding when CPUs should be taken offline entirely; the scheduler could then just do its normal job of distributing tasks among the CPUs that are available to it. Moving this responsibility to the governor is an interesting thought, but one which does not currently have any code to back it up; until somebody rectifies that little problem, a governor-based approach probably will not receive a whole lot more consideration.

Somebody probably will come through with the single-knob approach, though; whether they will follow through and clean up the power-saving implementation within the scheduler is harder to say. But it should be enough to avert the threat of seeing that code removed altogether. And that is certainly a good thing; imagine the power that would be uselessly consumed in a flamewar over a regression in the kernel's power-aware scheduling ability.

Comments (19 posted)

DMA buffer sharing in 3.3

By Jonathan Corbet
January 11, 2012

Back in August 2011, LWN looked at the DMA buffer sharing patch set posted by Marek Szyprowski. Since then, that patch has been picked up by Sumit Semwal, who modified it considerably in response to comments from a number of developers. The version of this patch that was merged for 3.3 differs enough from its predecessors that it merits another look here.

The core idea remains the same, though: this mechanism allows DMA buffers to be shared between drivers that might otherwise be unaware of each other. The initial target use is sharing buffers between producers and consumers of video streams; a camera device, for example, could acquire a stream of frames into a series of buffers that are shared with the graphics adapter, enabling the capture and display of the data with no copying in the kernel.

In the 3.3 sharing scheme, one driver will set itself up as an exporter of sharable buffers. That requires providing a set of callbacks to the buffer sharing code:

    struct dma_buf_ops {
	int (*attach)(struct dma_buf *buf, struct device *dev,
		      struct dma_buf_attachment *dma_attach);
	void (*detach)(struct dma_buf *buf, struct dma_buf_attachment *dma_attach);
	struct sg_table *(*map_dma_buf)(struct dma_buf_attachment *dma_attach,
					enum dma_data_direction dir);
	void (*unmap_dma_buf)(struct dma_buf_attachment *dma_attach, struct sg_table *sg);
	void (*release)(struct dma_buf *);
    };

Briefly, attach() and detach() inform the exporting driver when others take or release references to the buffer. The map_dma_buf() and unmap_dma_buf() callbacks, instead, cause the buffer to be prepared (or unprepared) for DMA and pass ownership between drivers. A call to release() will be made when the last reference to the buffer is released.

The exporting driver makes the buffer available with a call to:

    struct dma_buf *dma_buf_export(void *priv, struct dma_buf_ops *ops,
			           size_t size, int flags);

Note that the size of the buffer is specified here, but there is no pointer to the buffer itself. In fact, the current version of the interface never passes around CPU-accessible buffer pointers at all. One of the actions performed by dma_buf_export() is the creation of an anonymous file to represent the buffer; flags is used to set the mode bits on that file.

Since the file is anonymous, it is not visible to the rest of the kernel (or user space) in any useful way. Truly exporting the buffer, instead, requires obtaining a file descriptor for it and making that descriptor available to user space. The descriptor can be had with:

    int dma_buf_fd(struct dma_buf *dmabuf);

There is no standardized mechanism for passing that file descriptor to user space, so it seems likely that any subsystem implementing this functionality will add its own special ioctl() operation to get a buffer's file descriptor. The same is true for the act of passing a file descriptor to drivers that will share this buffer; it is something that will happen outside of the buffer-sharing API.

A driver wishing to share a DMA buffer has to go through a series of calls after obtaining the corresponding file descriptor, the first of which is:

    struct dma_buf *dma_buf_get(int fd);

This function obtains a reference to the buffer and returns a dma_buf structure pointer that can be used with the other API calls to refer to the buffer. When the driver is finished with the buffer, it should be returned with a call to dma_buf_put().

The next step is to "attach" to the buffer with:

    struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
					      struct device *dev);

This function will allocate and fill in yet another structure:

    struct dma_buf_attachment {
	struct dma_buf *dmabuf;
	struct device *dev;
	struct list_head node;
	void *priv;
    };

That structure will then be passed to the exporting driver's attach() callback. There seems to be a couple of reasons for the existence of this step, the first of which is simply to let the exporting driver know about the consumers of the buffer. Beyond that, the device structure passed by the calling driver can contain a pointer (in its dma_params field) to one of these structures:

    struct device_dma_parameters {
	unsigned int max_segment_size;
	unsigned long segment_boundary_mask;
    };

The exporting driver should look at these constraints and ensure that the buffer it is exporting can satisfy them; if not, the attach() call should fail. If multiple drivers attach to the buffer, the exporting driver will need to allocate the buffer in a way that satisfies all of their constraints.

The final step is to map the buffer for DMA:

    struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach,
					    enum dma_data_direction direction);

This call turns into a call to the exporting driver's map_dma_buf() callback. If this call succeeds, the return value will be a scatterlist that can be used to program the DMA operation into the device. A successful return also means that the calling driver's device owns the buffer; it should not be touched by the CPU during this time.

Note that mapping a buffer is an operation that can block for a number of reasons; if the buffer is busy elsewhere, for example. Also worth noting is that, until this call is made, the buffer need not necessarily be allocated anywhere. The exporting driver can wait until others have attached to the buffer so that it can see their DMA constraints and allocate the buffer accordingly. Of course, if the buffer lives in device memory or is otherwise constrained on the exporting side, it can be allocated sooner.

After the DMA operation is completed, the sharing driver should unmap the buffer with:

    void dma_buf_unmap_attachment(struct dma_buf_attachment *attach,
				  struct sg_table *sg_table);

That will, in turn, generate a call to the exporting driver's unmap_dma_buf() function. Detaching from the buffer (when it is no longer needed) can be done with:

    void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach);

As might be expected, this function will call the exporting driver's detach() callback.

As of 3.3, there are no users for this interface in the mainline kernel. There seems to be a fair amount of interest in using it, though, so Dave Airlie pushed it into the mainline with the idea that it would make the development of users easier. Some of those users can be seen (in an early form) in Dave's drm-prime repository and Rob Clark's OMAP4 tree.

Comments (5 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.2 ?

Greg KH Linux 3.1.8 ?

Greg KH Linux 3.0.16 ?

Greg KH Linux 2.6.32.53 ?

Architecture-specific

Fenghua Yu x86: Arbitrary CPU hot(un)plug support ?

Core kernel code

Con Kolivas BFS v0.416 CPU scheduler for 3.2 ?

Kay Sievers [PATCH] prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision ?

NeilBrown Multithread initcalls to auto-resolve ordering issues. ?

Joe Korty forward port jRCU to 3.2 ?

Srikar Dronamraju Uprobes patchset with perf probe support ?

Development tools

Steven Rostedt ktest: new and fancy updates ?

Tejun Heo ioblame: statistical IO analyzer ?

Stephane Eranian perf_events: add support for sampling taken branches (v3) ?

Device drivers

Javier Martinez Canillas Input: Cypress TTSP device driver ?

Donggeun Kim input: add MAX8997-haptic driver ?

Ravi Kumar V Add Qualcomm MSM ADM DMAEngine driver ?

Christian Gmeiner add led driver for Bachmann's ot200 Jan 09

Mauro Carvalho Chehab Add mt2063 frontend driver ?

Ameya Palande Texas Instruments DRV2665 Piezo Haptics Driver ?

MyungJoo Ham introduce External Connector Class (extcon) ?

Sjur Brændeland modem_shm: Driver for ST-E Thor M7400 LTE modem ?

Filesystems and block I/O

Darrick J. Wong ext4: Add metadata checksumming ?

Ilya Dryomov Btrfs: restriper ?

Memory management

Seth Jennings staging: zsmalloc: memory allocator for compressed pages ?

Johannes Weiner mm: memcg reclaim integration followups ?

Networking

Ian Campbell skb paged fragment destructors ?

Hans Schillstrom NETFILTER new target module, HMARK ?

Security-related

Will Drewry dynamic seccomp policies (using BPF filters) ?

Miscellaneous

Lucas De Marchi kmod 3 ?

Stephen Hemminger iproute2 3.2.0 ?

Page editor: Jonathan Corbet
Next page: Distributions>>