Kernel development
Brief items
Kernel release status
The 3.2 kernel was released on January 4, after 72 days of development. Among other things, this kernel adds the proportional rate reduction TCP algorithm, the extended verification module, the CPU scheduler bandwidth controller, the cross-memory attach IPC mechanism, the Hexagon DSP architecture, improved recovery of corrupted Btrfs filesystems, and the I/O-less dirty throttling code. See the Kernelnewbies 3.2 page for lots more information.As of this writing, the 3.3 merge window is open; see below for details on what has been merged so far.
Stable updates: the 2.6.32.53, 3.0.16, and 3.1.8 stable kernel updates were released on January 6. Each contains the usual long list of important fixes (OK, 2.6.32.53 only has nine fixes, but the newer kernels have quite a few more).
The 2.6.32.54, 3.0.17, 3.1.9, and 3.2.1 stable updates are in the review process; they can be expected on or after January 12. 3.1.9 is likely to be the final update for the 3.1 kernel.
Quotes of the week
Code that specializes the kernel in weird ways is accepted into the kernel all the time, and I've tried to figure out why this particular bit of code is treated differently. Especially since this code is self-contained, configurable, and imposes no perceivable long-term maintenance burden.
Yes, you are special and unique, just like everyone else.
The next person who says the "embedded is different" phrase again, owes me a beer of my choice.
A long-term kernel support update
Greg Kroah-Hartman has posted an update on his plans for long-term kernel maintenance. As he announced before, the 3.1 series is almost at the end of its update period; he is also approaching the end of his maintenance for the long-lived 2.6.32 release. "It is approaching it's end-of-life, and I think I only have another month or so doing releases of this. After I am finished with it, it might be picked up by someone else, but I'm not going to promise anything." As it happens, Tim Gardner has stated that Ubuntu will support 2.6.32 through April 2015 - though whether that support will translate into kernel releases outside of the Ubuntu distribution is not clear. Ubuntu also plans to take on 3.2 as a long-term supported kernel.
No more system devices
Since the early days of the Linux device model, there has been a special device class for "system devices," typically those built into the platform itself. For almost as long, the driver core developers have felt that there was no real need for this device type, which looks weirdly different from every other type of device. For 3.3, they have actually done something about it; system devices are no more.All in-tree system device drivers have been fixed up to use regular devices instead. The process is relatively simple; it can be seen in, for example, this commit updating kernel/time/clocksource.c. In short, the embedded struct sys_device becomes a simple struct device instead. Attributes defined with SYSDEV_ATTR() are switched to DEVICE_ATTR(). The sysdev_class structure is turned into a nearly empty bus_type structure instead. That is about all that is required.
These changes, naturally, cause a user-space ABI change; system devices had their own special area under /sys which goes away. That has the potential to break programs and scripts, which would not be a good thing. To avoid this problem, a special function has been added:
int subsys_system_register(struct bus_type *subsys, const struct attribute_group **groups);
Registering a subsystem in this way will restore its old /sys/devices/system hierarchy. Needless to say, this function exists for backward compatibility purposes only; using it in new drivers is not likely to be received well.
Yet another new approach to seccomp
Over the years, we have seen a number of attempts to use the seccomp ("secure computing") mechanism to reduce the range of operations available to a given process. The hope is to use such a mechanism as part of a sandboxing solution that would allow (for example) a web browser to run third-party code in a safer manner. Thus far, all of these attempts have gone down in flames; see Seccomp filters: no clear path from last May for the most recent episode in this particular story.Things have been quiet on the seccomp front recently - until now. Will Drewry, who has been behind the recent attempts to enhance seccomp, has come up with an interesting new approach to the problem. Whether this attempt will be more successful than its predecessors remains to be seen, but Will has managed to step around some of the traps that doomed his previous attempt.
In the last seccomp discussion, there was a fair amount of pressure to adapt the kernel's tracing infrastructure to this task; there was also resistance to using that infrastructure in that way. As explained in detail in the patch posting, Will has come to the conclusion that the tracing infrastructure is not really fit for the task anyway:
Will's new approach has a stroke of brilliance to it: rather than use the ftrace filter mechanism, he has repurposed the networking layer's packet filtering mechanism (BPF). The BPF code normally operates on packets; in the seccomp context, instead, it operates on the register set at the time of each system call. The registers will contain the system call number and its parameters, allowing the filter to make a wide range of decisions on what will (or will not) be allowed. BPF is also well-maintained and well-optimized code; it even has an in-kernel just-in-time compiler. Some of these advantages are lost because seccomp uses its own BPF interpreter; one assumes that a way could be found to merge the two implementations if the underlying idea looks like it will pass muster.
As of this writing, there has not really been time for comments on the new patch. It will be interesting to see what the developers think. Meanwhile, those wanting more information should see the patch posting and the documentation file, which includes a sample program showing how to use the new facility.
Kernel development news
The first half of the 3.3 merge window
As of this writing, just over 5,700 non-merge changesets have been pulled into the mainline for the 3.3 development cycle. A fair amount of work remains to be pulled, so it looks like another fairly active cycle, though perhaps not quite up to the level of 3.2.Some of the more significant, user-visible changes merged so far include:
- The "team" network driver - a lightweight mechanism for bonding
multiple interfaces together - has been merged. The libteam project has the
user-space code needed to operate this device.
- The network priority control group controller has been added. This
controller allows the administrator to specify the priority with which
members of each control group have access to the network interfaces
available on the system. See net_prio.txt from the documentation
directory for more information.
- Also added is the TCP buffer size
controller which can be used to place limits on the amount of
kernel memory used to hold TCP buffers.
- The byte queue limits
infrastructure has been added, enabling control over how much data can
be queued for transmission over a network interface at any time.
- The Open vSwitch virtual network
switch has been merged.
- The ARM architecture has gained support for the "large physical
address extension," allowing 32-bit processors to address more than
4GB of installed memory.
- The "adaptive RED" queue management algorithm is now supported by the
networking layer.
- The near-field communications (NFC) layer has gained support for the
logical link control protocol (LLCP).
- The beginnings of dynamic frequency
selection support have been added to the wireless networking
subsystem.
- For S390 users who find the current limit of 3.8TB of RAM to be
constraining: 3.3 will add support for four-level page tables and an
upper limit of 64TB (for now).
- Various Android drivers have returned to the staging tree; see this article for more information.
- The C6X architecture (described in this
article) has been merged.
- The ext4 filesystem has added support for online resizing via the
EXT4_IOC_RESIZE_FS ioctl() command. This operation
does not (yet) work with filesystems using the "bigalloc" or "meta_bg"
features.
- The /proc filesystem has a new subdirectory for each process
called map_files; it contains a symbolic link describing
every file-backed mapping used by the relevant process. This feature
is one of many needed to support the desired checkpoint/restart
feature.
- /proc also supports a couple of new mount options. When
mounted with hidepid=1, /proc will deny access to
any process directories not owned by the requesting process. With
hidepid=2, even the existence of other processes will be
hidden. The default (hidepid=0) behavior is unchanged. The
other new option (gid=N) provides an ID for a group that is
allowed to access information for all processes regardless of the
hidepid= setting.
- New drivers:
- Systems and processors:
AppliedMicro APM8018X PowerPC processors,
Numascale NumaChip systems,
IBM Currituck (476fpe) boards, and
NVIDIA Tegra30 processors.
- Input:
TI TCA8418 keypad decoders,
Wacom Intuos4 wireless tablets,
EETI eGalax multi-touch panels,
GPIO-connected tilt switches,
Sharp GP2AP002A00F I2C Proximity/Opto sensors, and
PIXCIR I2C touchscreens.
- Miscellaneous: P7IOC PowerPC I/O hubs,
Dialog Semiconductor DA9052/53 PMIC devices,
SiRF SoC Platform Serial ports,
Analog Devices AD5421, AD5764, AD5744, and AD5380 digital to
analog converters,
GE PIO2 VME Parallel I/O cards,
OMAP 2/3/4 displays,
OMAP "Tiling and Isometric Lightweight Engine for Rotation" devices,
Dialog DA9052/DA9053 regulators,
VIA hardware watchdog timers, and
TI TCA6507 I2C LED controllers.
- Network: Calxeda 1G/10G XGMAC Ethernet interfaces and
ISA-based CC770 CAN controllers.
- USB: Marvell USB OTG transceivers and
Marvell EHCI host controllers.
- Graduations: Microsoft's Hyper-V virtual network driver and the gma500 graphics driver have moved out of staging into the mainline.
- Systems and processors:
AppliedMicro APM8018X PowerPC processors,
Numascale NumaChip systems,
IBM Currituck (476fpe) boards, and
NVIDIA Tegra30 processors.
Changes visible to kernel developers include:
- A reworked version of the DMA buffer sharing API has been merged; this
API has been described in a separate
article.
- The "memblock" low-level memory allocation API has been substantially
reworked.
- Quite a few VFS interfaces have been changed to use the
umode_t type for file mode bits.
- Also in the VFS: most of
the members of struct vfsmount have been moved elsewhere
(to a containing struct mount) and hidden from
filesystem code. A number of callbacks in struct
super_operations (specifically: show_stats(),
show_devname(), show_path() and
show_options()) now take a pointer to struct dentry
instead of struct vfsmount.
- The pin control subsystem has gained a
new configuration interface.
- Boolean module parameters have traditionally allowed the underlying
module variable to be of either bool or int type.
That tolerance is coming to an end with 3.3, where non-bool
types will generate a warning; the plan is apparently to change those
warnings to fatal compilation errors
in the 3.4 cycle. A lot of modules have seen type changes for their
parameters in preparation for the new regime.
- The "system device" type has been removed from the kernel; all instances have been converted to regular devices instead. See this article for more information.
The merge window can be expected to remain open through approximately January 18.
Rethinking power-aware scheduling
Sometimes it seems that there are few uncontroversial topics in kernel development, but saving power would normally be among them. Whether the concern is keeping a battery from running down too soon or keeping the planet from running down too soon, the ability to use less power per unit of computation is seen as a good thing. So when the kernel's scheduler maintainer threatened to rip out a bunch of power-saving code, it got some people's attention.The main thing the scheduler can do to reduce power consumption is to allow as many CPUs as possible to stay in a deep sleep state for as long as possible. With contemporary hardware, a comatose CPU draws almost no power at all. If there is a lot of CPU-intensive work to do, there will be obvious limits on how much sleeping the CPUs can get away with. But, if the system is lightly loaded, the way the scheduler distributes running processes can have a significant effect on both performance and power use.
Since there is a bit of a performance tradeoff, the scheduler exports a couple of tuning knobs under /sys/devices/system/cpu. The first, called sched_mc_power_savings, has three possible settings:
- The scheduler will not consider power usage when distributing tasks;
instead, tasks will be distributed across the system for maximum
performance. This is the default value.
- One core will be filled with tasks before tasks will be moved to other
cores. The idea is to concentrate the running tasks on a relatively
small number of cores, allowing the others to remain idle.
- Like (1), but with the additional tweak that newly awakened tasks will be directed toward "semi-idle" cores rather than started on an idle core.
There is another knob, sched_smt_power_savings, that takes the same set of values, but applies the results to the threads of symmetric multithreading (SMT) processors instead. These threads look a lot like independent processors, but, since they share most of the underlying hardware, they are not truly independent from each other.
Recently, Youquan Song noticed that sched_smt_power_savings did not actually work as advertised; a quick patch followed to fix the problem. Scheduler maintainer Peter Zijlstra objected to the fix, but he also made it clear that he objects to the power-saving machinery in general. Just to make that clear, he came back with a patch removing the whole thing and a threat to merge that patch unless somebody puts some effort into cleaning up the power-saving code.
Peter subsequently made it clear that he sees the value of power-aware scheduling; the real problem is in the implementation. And, within that, the real problem seems to be the control knobs. The two knobs provide similar behavioral controls at two levels of the scheduler domain hierarchy. But, with three possible values for each, the result is nine different modes that the scheduler can run in. That seems like too much complexity for a situation where the real choice comes down to "run as fast as possible," or "use as little power as possible."
In truth, it is not quite that simple. The performance cost of loading up every thread in an SMT processor is likely to be higher than that of concentrating tasks at higher levels. Those threads contend for the actual CPU hardware, so they will slow each other down. So one could conceive of situations where an administrator might want to enable different behavior at different levels, but such situations are likely to be quite rare. It is probably not worth the trouble of maintaining the infrastructure to support nine separate scheduler modes just in case somebody wants to do something special.
For added fun, early versions of the patch adding the "book" scheduling level (used only by the s390 architecture) included a sched_book_power_savings switch, though that switch went away before the patch was merged. There is also the looming possibility that somebody may want to do the same for scheduling at the NUMA node level. There comes a point where the number of possibilities becomes ridiculous. Some people - Peter, for example - think that point has already been reached.
That conclusion leads naturally to talk of what should replace the current mechanism. One solution would be a simple knob with two settings: "performance" or "low power." It could, as Ingo Molnar suggested, default to performance for line-connected systems and low power for systems on battery. That seems like a straightforward solution, but there is also a completely different approach suggested by Indan Zupancic: move that decision making into the CPU governor instead. The governor is charged with deciding which power state a CPU should be in at any given (idle) time. It could be given the additional task of deciding when CPUs should be taken offline entirely; the scheduler could then just do its normal job of distributing tasks among the CPUs that are available to it. Moving this responsibility to the governor is an interesting thought, but one which does not currently have any code to back it up; until somebody rectifies that little problem, a governor-based approach probably will not receive a whole lot more consideration.
Somebody probably will come through with the single-knob approach, though; whether they will follow through and clean up the power-saving implementation within the scheduler is harder to say. But it should be enough to avert the threat of seeing that code removed altogether. And that is certainly a good thing; imagine the power that would be uselessly consumed in a flamewar over a regression in the kernel's power-aware scheduling ability.
DMA buffer sharing in 3.3
Back in August 2011, LWN looked at the DMA buffer sharing patch set posted by Marek Szyprowski. Since then, that patch has been picked up by Sumit Semwal, who modified it considerably in response to comments from a number of developers. The version of this patch that was merged for 3.3 differs enough from its predecessors that it merits another look here.The core idea remains the same, though: this mechanism allows DMA buffers to be shared between drivers that might otherwise be unaware of each other. The initial target use is sharing buffers between producers and consumers of video streams; a camera device, for example, could acquire a stream of frames into a series of buffers that are shared with the graphics adapter, enabling the capture and display of the data with no copying in the kernel.
In the 3.3 sharing scheme, one driver will set itself up as an exporter of sharable buffers. That requires providing a set of callbacks to the buffer sharing code:
struct dma_buf_ops { int (*attach)(struct dma_buf *buf, struct device *dev, struct dma_buf_attachment *dma_attach); void (*detach)(struct dma_buf *buf, struct dma_buf_attachment *dma_attach); struct sg_table *(*map_dma_buf)(struct dma_buf_attachment *dma_attach, enum dma_data_direction dir); void (*unmap_dma_buf)(struct dma_buf_attachment *dma_attach, struct sg_table *sg); void (*release)(struct dma_buf *); };
Briefly, attach() and detach() inform the exporting driver when others take or release references to the buffer. The map_dma_buf() and unmap_dma_buf() callbacks, instead, cause the buffer to be prepared (or unprepared) for DMA and pass ownership between drivers. A call to release() will be made when the last reference to the buffer is released.
The exporting driver makes the buffer available with a call to:
struct dma_buf *dma_buf_export(void *priv, struct dma_buf_ops *ops, size_t size, int flags);
Note that the size of the buffer is specified here, but there is no pointer to the buffer itself. In fact, the current version of the interface never passes around CPU-accessible buffer pointers at all. One of the actions performed by dma_buf_export() is the creation of an anonymous file to represent the buffer; flags is used to set the mode bits on that file.
Since the file is anonymous, it is not visible to the rest of the kernel (or user space) in any useful way. Truly exporting the buffer, instead, requires obtaining a file descriptor for it and making that descriptor available to user space. The descriptor can be had with:
int dma_buf_fd(struct dma_buf *dmabuf);
There is no standardized mechanism for passing that file descriptor to user space, so it seems likely that any subsystem implementing this functionality will add its own special ioctl() operation to get a buffer's file descriptor. The same is true for the act of passing a file descriptor to drivers that will share this buffer; it is something that will happen outside of the buffer-sharing API.
A driver wishing to share a DMA buffer has to go through a series of calls after obtaining the corresponding file descriptor, the first of which is:
struct dma_buf *dma_buf_get(int fd);
This function obtains a reference to the buffer and returns a dma_buf structure pointer that can be used with the other API calls to refer to the buffer. When the driver is finished with the buffer, it should be returned with a call to dma_buf_put().
The next step is to "attach" to the buffer with:
struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, struct device *dev);
This function will allocate and fill in yet another structure:
struct dma_buf_attachment { struct dma_buf *dmabuf; struct device *dev; struct list_head node; void *priv; };
That structure will then be passed to the exporting driver's attach() callback. There seems to be a couple of reasons for the existence of this step, the first of which is simply to let the exporting driver know about the consumers of the buffer. Beyond that, the device structure passed by the calling driver can contain a pointer (in its dma_params field) to one of these structures:
struct device_dma_parameters { unsigned int max_segment_size; unsigned long segment_boundary_mask; };
The exporting driver should look at these constraints and ensure that the buffer it is exporting can satisfy them; if not, the attach() call should fail. If multiple drivers attach to the buffer, the exporting driver will need to allocate the buffer in a way that satisfies all of their constraints.
The final step is to map the buffer for DMA:
struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, enum dma_data_direction direction);
This call turns into a call to the exporting driver's map_dma_buf() callback. If this call succeeds, the return value will be a scatterlist that can be used to program the DMA operation into the device. A successful return also means that the calling driver's device owns the buffer; it should not be touched by the CPU during this time.
Note that mapping a buffer is an operation that can block for a number of reasons; if the buffer is busy elsewhere, for example. Also worth noting is that, until this call is made, the buffer need not necessarily be allocated anywhere. The exporting driver can wait until others have attached to the buffer so that it can see their DMA constraints and allocate the buffer accordingly. Of course, if the buffer lives in device memory or is otherwise constrained on the exporting side, it can be allocated sooner.
After the DMA operation is completed, the sharing driver should unmap the buffer with:
void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, struct sg_table *sg_table);
That will, in turn, generate a call to the exporting driver's unmap_dma_buf() function. Detaching from the buffer (when it is no longer needed) can be done with:
void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach);
As might be expected, this function will call the exporting driver's detach() callback.
As of 3.3, there are no users for this interface in the mainline kernel. There seems to be a fair amount of interest in using it, though, so Dave Airlie pushed it into the mainline with the idea that it would make the development of users easier. Some of those users can be seen (in an early form) in Dave's drm-prime repository and Rob Clark's OMAP4 tree.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>