|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 4.5-rc1, released on January 24. The 4.5 merge window is now closed. Linus said: "It's a fairly normal release - neither unusually big or unusually small. The statistics look fairly normal too, with drivers being a bit over 70% of the bulk (the big driver areas being gpu, networking, sound, staging, fbdev, but its all over)."

Stable updates: 4.3.4, 4.1.16, 3.14.59, and 3.10.95 were all released on January 23. The 4.4.1, 4.3.5, 4.1.17, 3.14.60, and 3.10.96 updates are in the review process as of this writing; they can be expected on or after January 29. Greg Kroah-Hartman warns: "There are still a lot of pending stable patches in the queue, well over 400 of them to be specific, so some of your favorite/pet patches might not be included in these releases." One should thus expect more stable updates in the near future.

Comments (none posted)

Quotes of the week

But unless I add text like this occasionally, such people could easily read through much of memory-barriers.txt and think that they did in fact understand it. So I have to occasionally trip an assertion in their brain.
Paul McKenney

Many projects would consider 400 patches a major release, and here they are behind two dots.
Avi Kivity

Comments (none posted)

Kernel development news

4.5 merge window part 3

By Jonathan Corbet
January 25, 2016
As expected, Linus released the 4.5-rc1 development kernel and closed the merge window for this cycle on January 24. Less than 2,000 changes were pulled since last week's summary, but there were some significant changes to be found among them. Some of the more interesting changes include:

  • A new tool called UBSAN checks a running kernel for various types of undefined behavior that can lead to obscure bugs; the commit changelog contains a list of bugs that have already been found by UBSAN and fixed. See Documentation/ubsan.txt for an introduction to this tool.

  • The new CONFIG_IO_STRICT_DEVMEM option, which blocks access to memory (via /dev/mem) claimed by device drivers, turned out to break booting on some systems, so it is now off by default.

  • The ARM multiplatform work, which aims to build a single ARM kernel that can boot on a wide variety of processors, has reached an important milestone with the merging of work to bring a number of minor platforms into the fold.

    This branch is the culmination of 5 years of effort to bring the ARMv6 and ARMv7 platforms together such that they can all be enabled and boot the same kernel. It has been a tremendous amount of cleanup and refactoring by a huge number of people, and creation of several new (and major) subsystems to better abstract out all the platform details in an appropriate manner.

  • The filesystems in user space (FUSE) subsystem has added support for the SEEK_HOLE and SEEK_DATA options to the lseek() system call.

  • The epoll_ctl() system call supports a new flag, EPOLLEXCLUSIVE, that causes epoll_wait() to only wake one process when a file descriptor becomes ready. See this article for a description of this option and the use case for it.

  • Direct-access ("DAX") mappings now work properly with the msync() and fsync() system calls.

  • The ext4 filesystem has gained "project quota" support, wherein dispersed files can be assigned to the same "project" and given their own quota. The feature is rigorously undocumented, but some information be found in the header of this patch posting.

  • The implementation of the XFS XFS_IOC_FSSETXATTR and XFS_IOC_FSGETXATTR ioctl() commands has been moved up to the virtual filesystem level, and an implementation for the ext4 filesystem has been added. This operation, also severely undocumented, allows the querying (and setting) of various file attributes, including immutability, whether writes should always be synchronous, exclusion from backups, and more. See the defines near the top of this commit for the list of supported attributes.

  • The Ceph filesystem now has support for asynchronous I/O.

  • New hardware support includes:

    • Systems and processors: Renesas R-Car H3 systems, Ralink MT7621 processors, Microchip PIC32MZDA processors, Socionext UniPhier systems, and NVIDIA Tegra132 processors.

    • Miscellaneous: Qualcomm "shared memory state machine" controllers, Qualcomm wireless connectivity subsystem controllers, Qualcomm PCIe controllers, TI AMx3 Wkup-M3 inter-processor communication subsystems, Raspberry Pi power domain controllers, TI OMAP dual-mode timers, HiSilicon Hip06 PCIe host controllers, Intel "volume management device" PCI host bridges, and AMD "non-transparent bridge" performance-monitoring hardware.

Finally, back in December, Linus noticed that the user-space access utilities (get_user() and friends) were showing up heavily on some profiles, especially on systems where supervisor-mode access prevention is in use. The problem is that, often, the kernel needs to perform several accesses in a sequence, with the result that access prevention is turned off and back on numerous times.

The solution, as is so often the case, is batching: turn off access prevention once, do all the work, then turn it back on. To support this mode of access, Linus has introduced a new set of macros:

    user_access_begin();
    unsafe_put_user(value, user_space_pointer);
    unsafe_get_user(value, user_space_pointer);
    user_access_end();

As he puts it in the comments, the "unsafe" functions are not actually unsafe if they are used correctly, but developers must pay attention. The unsafe_put_user() and unsafe_get_user() macros can only be used after a user_access_begin() call is made, and the usual access_ok() checks must be done first. The first use of these functions is in the user-space string-manipulation functions. Only x86 is supported in 4.5, but support for other architectures should be forthcoming.

At the close of the merge window, 10,305 non-merge changesets had been pulled into the mainline repository. That suggests that 4.5 will be a relatively slow development cycle with regard to the number of changes merged. Much of that "slowness" can be attributed to a relatively small merge from the staging tree this time around; otherwise, the kernel developers appear to be working at full speed.

If the usual 63-day cycle holds, the release of the final 4.5 kernel can be expected to happen on March 13. Between now and then, though, there are certainly numerous bugs to be found and fixed.

Comments (8 posted)

Controlling access to user namespaces

By Jonathan Corbet
January 27, 2016
The user namespaces feature holds an interesting promise for system security: users can be confined within a namespace, given full root privileges within that namespace, and still be unable to adversely affect the system as a whole. The path to better security has, perhaps predictably, proved to be a bit rocky, however. In response, there is now an effort to make the feature configurable by system administrators, but this new configuration knob is proving to be a harder sell than one might expect.

User namespaces are created by passing the CLONE_NEWUSER flag to the clone() or unshare() system calls. Administrators who are nervous about allowing access to this feature currently only have one option: configure out support at kernel build time. That option is not easily available to the many systems running distribution-built kernels, though. Kees Cook set out to create an easier way with this patch set creating a new sysctl knob to control access to the user-namespace feature, saying:

There continues to be unexpected side-effects and security exposures via CLONE_NEWUSER. For many end-users running distro kernels with CONFIG_USER_NS enabled, there is no way to disable this feature when desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so admins not running containers or Chrome can avoid the risks of this feature.

In particular, the patch adds a knob called /proc/sys/kernel/userns_restrict. When it is set to the default value (zero), user namespaces are unrestricted. Setting it to one allows only privileged users to create user namespaces; a setting of two disables user namespaces altogether. In that final case, it is not possible to re-enable user namespaces without rebooting the system.

One of the first issues to be aired had to do with naming: it turns out that Debian currently carries a similar patch, but, on Debian systems, the knob is called unprivileged_userns_clone and doesn't support the "privileged users only" setting. Ben Hutchings agreed that the new naming was probably better and said that, should Kees's patch go upstream, Debian would slowly move over to it.

Some developers worried that allowing user namespaces to be turned off would slow the process of finding and fixing any remaining security issues. Additionally, Serge Hallyn suggested that, if application developers could not count on the availability of user namespaces, they wouldn't use them at all. He suggested that, if the knob is accepted, it be marked as a short-term workaround that would eventually be removed.

The strongest opposition, though, came from Eric Biederman, the creator of user namespaces and also the developer who has done the most work on the sysctl code in recent times. He stated flat out that "the code is buggy, and poorly thought through" and would not be merged. In another message he described his objections in detail, starting with a challenge to the idea that user namespaces are a security risk at all:

I don't actually think there do continue to be unexpected side-effects and security exposures with CLONE_NEWUSER. It takes a while for all of the fixes to trickle out to distros. At most what I have seen recently are problems with other kernel interfaces being amplified with user namespaces.

Others, though, seem to think that, if problems elsewhere are being "amplified," there is indeed a security exposure. Andy Lutomirski described some concerns of his own:

I consider the ability to use CLONE_NEWUSER to acquire CAP_NET_ADMIN over /any/ network namespace and to thus access the network configuration API to be a huge risk. For example, unprivileged users can program iptables. I'll eat my hat if there are no privilege escalations in there.

Eric echoed the point that making it possible to disable user namespaces would be a net loss in security, since the feature would not be available on all systems. He cited web browsing with Chrome as a use case; Kees responded that this patch wasn't really aimed at desktop systems in the first place.

Next on Eric's list was a complaint that a system-wide knob was too coarse; he suggested that perhaps the seccomp() mechanism should be used instead if access to user namespaces must really be restricted. Kees's answer here is that it's not really possible to set a global seccomp() policy, that performance would suffer in any case, and that seccomp() is meant for developers to use rather than system administrators. "It's an extraordinarily big hammer for wanting to turn off a single area of the kernel with a long history of problems." He noted that trying to use a Linux security module to achieve this end would have a number of similar problems.

Then, Eric said, the sysctl knob could create "a false sense of security" since it would have no effect on processes that are already running in a user namespace. If a security issue comes to light, just turning off the knob will not be enough to protect a system; a reboot will also be necessary. Eric returned to this point later, calling the patch "fatally flawed" as a result of the "subtlety and nuance" involved in using it.

Kees acknowledged the "corner case" in the sysctl implementation, one that, he said, applies to a number of other, existing knobs as well. But, he said, it really does not matter to an administrator who simply wants to disable the feature outright as a way of reducing the attack surface of a system. Even so, he allowed: "I'm open to having this sysctl kill all CLONE_NEWUSERed process trees", without noting that having a sysctl knob kill off processes might pose some interesting "subtlety and nuance" of its own.

As a sort of postscript, Eric suggested that, perhaps, the desired restriction could be implemented as a resource limit controlling the number of user namespaces that any user would be allowed to create. Setting that number to zero would effectively disable the feature. Kees indicated a willingness to look at this idea; it is the end result he wants, rather than the sysctl knob itself.

There is an evident desire for the ability to turn off access to user namespaces; various other developers spoke in its favor over the course of the discussion. But this desire is clearly not universal and, as a result, the current patches do not appear to have an easy path into the mainline. It is entirely possible that the concerns blocking this feature may eventually be addressed and overcome, but it also seems possible that, in the end, this knob ends up being part of the patch set carried by distributors and users. It seems that getting security-related changes into the kernel is still a difficult task.

Comments (1 posted)

Next-interrupt prediction

By Jonathan Corbet
January 27, 2016
There are many things an operating system would like to know about the future; one of those is when the next interrupt might come in. This information could be put to good use when it comes time to put an idle processor to sleep. Unfortunately, wormhole peripherals that can read information from the future are expensive, so the vast majority of processors are not equipped with them. That leaves no alternative to trying to guess this information using past behavior as a guide.

When a CPU has no work to do, it should go into a sleep state to save power. Modern processors offer a number of sleep states, though, with different characteristics. A shallow sleep is quick to get into and out of, but the power savings offered by shallow sleeps are relatively small. The deeper sleep states can reduce power consumption to nearly zero, but getting a processor back into a running state from a deep sleep state takes a long time and consumes a certain amount of power in its own right. So it only makes sense to enter a deep sleep state if the processor will remain asleep for a relatively long time.

The kernel can never really know how long a processor will be able to sleep, so it has to make its best guess. One way to do that is to look at the next scheduled timer event; that provides an upper bound on how long the processor will remain idle, but it is not the whole picture. The other thing that can wake a processor is an interrupt from a peripheral device (or an interprocessor interrupt (IPI) from another CPU). Current kernels try to take interrupts into account by looking at the length of recent idle cycles; if those cycles are reasonably regular, their length can be taken as a good guess for when the next wakeup will occur.

But, as Daniel Lezcano notes in his IRQ-based wakeup prediction patches, this approach has some shortcomings. It looks only at idle periods without taking the wakeup event into account, so it cannot separate the effects of timer events and interrupts. IPIs factor into the estimate as well, but IPIs are often generated by the scheduler, which is also trying to figure out what the next idle period might be, leading to interesting feedback loops. This approach is also unable to take into account the behavior of individual interrupt sources or to respond to their addition or removal.

Daniel has been working on this problem for a while; he presented one solution at the 2014 Linux Plumbers Conference. His approach at that time was a relatively elaborate, bucket-based system that tracked interrupts associated with each process on the system. There were some complaints at the time that interrupt behavior has more to do with devices than processes, and this work was never pushed into the mainline.

The new approach is conceptually simpler. In short, it tracks the recent interrupt behavior of each device in the system on a per-CPU basis. When the time comes to guess at the length of an idle stretch, each device's behavior is examined separately, and a guess is made regarding which device will interrupt next and when that will happen.

To gather this information, Daniel introduces a new mechanism to track interrupt timings. It is all based around a structure with a handful of functions to be called out of the interrupt-handling subsystem:

    typedef void (*irqt_handler_t)(unsigned int irq, ktime_t time, void *dev_id);
    struct irqtimings_ops {
	int (*alloc)(unsigned int irq);
	void (*free)(unsigned int irq);
	int (*setup)(unsigned int irq, struct irqaction *act);
	void (*remove)(unsigned int irq, void *dev_id);
	irqt_handler_t handler;
    };

Interestingly, there can only be one of these structures in the system, and it must be declared with the DECLARE_IRQ_TIMINGS() macro. This mechanism runs in interrupt mode, so it must do as little work as possible; that means there is no desire to add an elaborate mechanism to call multiple handlers. The creation of a single, global structure also ensures that, if the mechanism is built into the kernel (via the CONFIG_IRQ_TIMINGS parameter), there is also a consumer for the timing information. In the absence of that consumer, the global structure will not be defined, and the kernel build will fail.

The alloc() and free() operations are called when interrupt descriptors (the core data structure for managing interrupt sources) are added to or removed from the kernel. setup() and remove(), instead, are called when the first handler is set up for a given interrupt (or the last one removed). Finally, handler() is called whenever an actual interrupt happens for the given irq number; it is passed a timestamp saying when the interrupt occurred.

On the consumer side (the scheduler's idle-time estimation code), Daniel's patch sets up a data structure that looks like this:

    #define STATS_NR_VALUES 4

    struct stats {
	u64           sum;                     /* sum of values */
	u32           values[STATS_NR_VALUES]; /* array of values */
	unsigned char w_ptr;                   /* current window pointer */
    };

    struct wakeup {
	struct stats stats;
	ktime_t timestamp;
    };

Each CPU gets its own array of wakeup structures, with one entry for each interrupt number. That structure holds the time of the last observed interrupt and the stats structure which, in turn, holds a simple circular buffer of observed interrupt timings.

When the interrupt timing handler is called, it looks up the appropriate wakeup structure. The time since the last interrupt is calculated and inserted into the circular buffer; the sum of all the interrupt timings is updated as well. If it has been more than one second since the previous interrupt, though, the accumulated information is discarded instead and the statistics collection is restarted from the beginning. Once collection has been active for a bit, the code can easily calculate the mean time between interrupts and the variance in that time as well.

In the current patch set, there is no tracking at the level of individual devices; if multiple devices share an interrupt number, they will all appear together in the statistics. That may prove to be a shortcoming on systems with large amounts of interrupt sharing, but it should also be easy to fix should that turn out to be the case. Given that interrupt sharing appears to be slowly fading away, this may not be a concern in the end.

When the time comes to make a guess for the duration of the next idle period for a given CPU, the code iterates through all of the interrupts that have been active on that CPU. For each, the mean time between interrupts and the variance are calculated. If the timing between the last two interrupts was within one standard deviation of the mean, the code concludes that interrupts on this line are predictable; the next interrupt time is then calculated by adding the mean time to the time of the last interrupt. The interrupt that is predicted to happen the soonest is used to make a guess at the expected idle time.

One could certainly try to poke holes in this mechanism. Four samples seems like a small number to be drawing conclusions from. The fact that the algorithm skips over interrupts that seem unpredictable suggests that it might overestimate the length of the coming idle period. Daniel acknowledges some of these limitations, but says: "The statistics are very trivial and could be improved later but this first step shows we have a nice overall improvement in SMP." This mechanism does not show an improvement on uniprocessor systems, though.

There has been a fair amount of discussion around these patches, mostly focused on relatively low-level implementation details (whether timestamps should be kept in microseconds or nanoseconds, for example). One more significant issue is that the simple arrays indexed by interrupt number will not work on some systems with more complex interrupt setups. Instead, Thomas Gleixner said, the code needs to use a radix tree to track interrupt sources.

There does not appear to be opposition to the underlying approach, though. So, once the details have been worked out, this work may get the green light to go into the mainline kernel. Then, perhaps, our batteries will last a little longer, which cannot be a bad thing.

Comments (6 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.5-rc1 ?
Sebastian Andrzej Siewior 4.4-rt3 ?
Greg KH Linux 4.3.4 ?
Greg KH Linux 4.1.16 ?
Kamal Mostafa Linux 3.19.8-ckt13 ?
Greg KH Linux 3.14.59 ?
Greg KH Linux 3.10.95 ?
Ben Hutchings Linux 3.2.76 ?

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Security-related

Virtualization and containers

Boris Ostrovsky HVMlite domU support ?

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2016, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds