Kernel development
Brief items
Kernel release status
The current development kernel is 4.5-rc1, released on January 24. The 4.5 merge window is now closed. Linus said: "It's a fairly normal release - neither unusually big or unusually small. The statistics look fairly normal too, with drivers being a bit over 70% of the bulk (the big driver areas being gpu, networking, sound, staging, fbdev, but its all over)."
Stable updates:
4.3.4,
4.1.16,
3.14.59, and
3.10.95 were all released on
January 23.
The
4.4.1,
4.3.5,
4.1.17,
3.14.60, and
3.10.96 updates are in the review process as of
this writing; they can be expected on or after January 29. Greg
Kroah-Hartman warns:
"There are still a lot of pending stable patches in the queue, well
over 400 of them to be specific, so some of your favorite/pet
patches might not be included in these releases.
"
One should thus expect more stable updates in the near future.
Quotes of the week
Kernel development news
4.5 merge window part 3
As expected, Linus released the 4.5-rc1 development kernel and closed the merge window for this cycle on January 24. Less than 2,000 changes were pulled since last week's summary, but there were some significant changes to be found among them. Some of the more interesting changes include:
- A new tool called UBSAN checks a running kernel for various types of
undefined behavior that can lead to obscure bugs; the
commit changelog contains a list of bugs that have already been
found by UBSAN and fixed. See Documentation/ubsan.txt for an
introduction to this tool.
- The new CONFIG_IO_STRICT_DEVMEM option, which blocks access
to memory (via /dev/mem) claimed by device drivers, turned out
to break booting on some systems, so it is now off by default.
- The ARM multiplatform work, which aims to build a single ARM kernel
that can boot on a wide variety of processors, has reached an
important milestone with the merging
of work to bring a number of minor platforms into the fold.
This branch is the culmination of 5 years of effort to bring the ARMv6 and ARMv7 platforms together such that they can all be enabled and boot the same kernel. It has been a tremendous amount of cleanup and refactoring by a huge number of people, and creation of several new (and major) subsystems to better abstract out all the platform details in an appropriate manner.
- The filesystems in user space (FUSE) subsystem has added support for
the SEEK_HOLE and SEEK_DATA options to the
lseek() system call.
- The epoll_ctl() system call supports a new flag,
EPOLLEXCLUSIVE, that causes epoll_wait() to only wake
one process when a file descriptor becomes ready. See this article for a description of
this option and the use case for it.
- Direct-access ("DAX") mappings now work properly with the
msync() and fsync() system calls.
- The ext4 filesystem has gained "project quota" support, wherein
dispersed files can be assigned to the same "project" and given their
own quota. The feature is rigorously undocumented, but some
information be found in the header of this
patch posting.
- The implementation of the XFS XFS_IOC_FSSETXATTR and
XFS_IOC_FSGETXATTR ioctl() commands has been moved
up to the virtual filesystem level, and an implementation for the ext4
filesystem has been added. This operation, also severely
undocumented, allows the querying (and setting) of various file
attributes, including immutability, whether writes should always be
synchronous, exclusion from backups, and more. See the defines near
the top of this
commit for the list of supported attributes.
- The Ceph filesystem now has support for asynchronous I/O.
- New hardware support includes:
- Systems and processors:
Renesas R-Car H3 systems,
Ralink MT7621 processors,
Microchip PIC32MZDA processors,
Socionext UniPhier systems, and
NVIDIA Tegra132 processors.
- Miscellaneous: Qualcomm "shared memory state machine" controllers, Qualcomm wireless connectivity subsystem controllers, Qualcomm PCIe controllers, TI AMx3 Wkup-M3 inter-processor communication subsystems, Raspberry Pi power domain controllers, TI OMAP dual-mode timers, HiSilicon Hip06 PCIe host controllers, Intel "volume management device" PCI host bridges, and AMD "non-transparent bridge" performance-monitoring hardware.
- Systems and processors:
Renesas R-Car H3 systems,
Ralink MT7621 processors,
Microchip PIC32MZDA processors,
Socionext UniPhier systems, and
NVIDIA Tegra132 processors.
Finally, back in December, Linus noticed that the user-space access utilities (get_user() and friends) were showing up heavily on some profiles, especially on systems where supervisor-mode access prevention is in use. The problem is that, often, the kernel needs to perform several accesses in a sequence, with the result that access prevention is turned off and back on numerous times.
The solution, as is so often the case, is batching: turn off access prevention once, do all the work, then turn it back on. To support this mode of access, Linus has introduced a new set of macros:
user_access_begin();
unsafe_put_user(value, user_space_pointer);
unsafe_get_user(value, user_space_pointer);
user_access_end();
As he puts it in the comments, the "unsafe" functions are not actually unsafe if they are used correctly, but developers must pay attention. The unsafe_put_user() and unsafe_get_user() macros can only be used after a user_access_begin() call is made, and the usual access_ok() checks must be done first. The first use of these functions is in the user-space string-manipulation functions. Only x86 is supported in 4.5, but support for other architectures should be forthcoming.
At the close of the merge window, 10,305 non-merge changesets had been pulled into the mainline repository. That suggests that 4.5 will be a relatively slow development cycle with regard to the number of changes merged. Much of that "slowness" can be attributed to a relatively small merge from the staging tree this time around; otherwise, the kernel developers appear to be working at full speed.
If the usual 63-day cycle holds, the release of the final 4.5 kernel can be expected to happen on March 13. Between now and then, though, there are certainly numerous bugs to be found and fixed.
Controlling access to user namespaces
The user namespaces feature holds an interesting promise for system security: users can be confined within a namespace, given full root privileges within that namespace, and still be unable to adversely affect the system as a whole. The path to better security has, perhaps predictably, proved to be a bit rocky, however. In response, there is now an effort to make the feature configurable by system administrators, but this new configuration knob is proving to be a harder sell than one might expect.User namespaces are created by passing the CLONE_NEWUSER flag to the clone() or unshare() system calls. Administrators who are nervous about allowing access to this feature currently only have one option: configure out support at kernel build time. That option is not easily available to the many systems running distribution-built kernels, though. Kees Cook set out to create an easier way with this patch set creating a new sysctl knob to control access to the user-namespace feature, saying:
In particular, the patch adds a knob called /proc/sys/kernel/userns_restrict. When it is set to the default value (zero), user namespaces are unrestricted. Setting it to one allows only privileged users to create user namespaces; a setting of two disables user namespaces altogether. In that final case, it is not possible to re-enable user namespaces without rebooting the system.
One of the first issues to be aired had to do with naming: it turns out that Debian currently carries a similar patch, but, on Debian systems, the knob is called unprivileged_userns_clone and doesn't support the "privileged users only" setting. Ben Hutchings agreed that the new naming was probably better and said that, should Kees's patch go upstream, Debian would slowly move over to it.
Some developers worried that allowing user namespaces to be turned off would slow the process of finding and fixing any remaining security issues. Additionally, Serge Hallyn suggested that, if application developers could not count on the availability of user namespaces, they wouldn't use them at all. He suggested that, if the knob is accepted, it be marked as a short-term workaround that would eventually be removed.
The strongest opposition, though, came from Eric Biederman, the creator of
user namespaces and also the developer who has done the most work on the
sysctl code in recent times. He stated
flat out that "the code is buggy, and poorly thought through
"
and would not be merged. In another
message he described his objections in detail, starting with a challenge
to the idea that user namespaces are a security risk at all:
Others, though, seem to think that, if problems elsewhere are being "amplified," there is indeed a security exposure. Andy Lutomirski described some concerns of his own:
Eric echoed the point that making it possible to disable user namespaces would be a net loss in security, since the feature would not be available on all systems. He cited web browsing with Chrome as a use case; Kees responded that this patch wasn't really aimed at desktop systems in the first place.
Next on Eric's list was a complaint that a system-wide knob was too coarse;
he suggested that perhaps the seccomp() mechanism should be used
instead if access to user namespaces must really be restricted. Kees's
answer here is that it's not really possible to set a global
seccomp() policy, that performance would suffer in any case, and
that seccomp() is meant for developers to use rather than system
administrators. "It's an extraordinarily big hammer for wanting to
turn off a single area of the kernel with a long history of
problems.
" He noted that trying to use a Linux security module to
achieve this end would have a number of similar problems.
Then, Eric said, the sysctl knob could create "a false sense of
security
" since it would have no effect on processes that are
already running in a user namespace. If a security issue comes to light,
just turning off the knob will not be enough to protect a system; a reboot
will also be necessary. Eric returned to
this point later, calling the patch "fatally flawed
" as a result of
the "subtlety and nuance
" involved in using it.
Kees acknowledged the "corner case" in the
sysctl implementation, one that, he said, applies to a number of other,
existing knobs as well. But, he said, it really does not matter to an
administrator who simply wants to disable the feature outright as a way of
reducing the attack surface of a system. Even so, he allowed: "
As a sort of postscript,
Eric suggested that, perhaps, the desired restriction could be
implemented as a resource limit controlling the number of user namespaces
that any user would be allowed to create. Setting that number to zero
would effectively disable the feature. Kees indicated a willingness to
look at this idea; it is the end result he wants, rather than the sysctl
knob itself.
There is an evident desire for the ability to turn off access to user
namespaces; various other developers spoke in its favor over the course of
the discussion. But this desire is clearly not universal and, as a
result, the current
patches do not appear to have an easy path into the mainline. It is
entirely possible that the concerns blocking this feature may eventually be
addressed and overcome, but it also seems possible that, in the end, this
knob ends up being part of the patch set carried by distributors and
users. It seems that getting security-related changes into the kernel is
still a difficult task.
When a CPU has no work to do, it should go into a sleep state to save
power. Modern processors offer a number of sleep states, though, with
different characteristics. A shallow sleep is quick to get into and out
of, but the power savings offered by shallow sleeps are relatively small.
The deeper sleep states can reduce power consumption to nearly zero, but
getting a processor back into a running state from a deep sleep state takes
a long time and consumes a certain amount of power in its own right. So it
only makes sense to enter a deep sleep state if the processor will remain
asleep for a relatively long time.
The kernel can never really know how long a processor will be able to sleep, so
it has to make its best guess. One way to do that is to look at the next
scheduled timer event; that provides an upper bound on how long the
processor will remain idle, but it is not the whole picture. The other
thing that can wake a processor is an interrupt from a peripheral device
(or an interprocessor interrupt (IPI) from another CPU). Current kernels
try to take interrupts into account by looking at the length of recent idle
cycles; if those cycles are reasonably regular, their length can be taken
as a good guess for when the next wakeup will occur.
But, as Daniel Lezcano notes in his IRQ-based
wakeup prediction patches, this approach has some shortcomings. It
looks only at idle periods without taking the wakeup event into account, so
it cannot separate the effects of timer events and interrupts. IPIs factor
into the estimate as well, but IPIs are often generated by the scheduler,
which is also trying to figure out what the next idle period might be,
leading to interesting feedback loops. This approach is also unable to
take into account the behavior of individual interrupt sources or to respond
to their addition or removal.
Daniel has been working on this problem for a while; he presented one solution at the 2014 Linux
Plumbers Conference. His approach at that time was a relatively elaborate,
bucket-based system that tracked interrupts associated with each process on
the system. There were some complaints at the time that interrupt behavior
has more to do with devices than processes, and this work was never pushed
into the mainline.
The new approach is conceptually simpler. In short, it tracks the recent
interrupt behavior of each device in the system on a per-CPU basis. When
the time comes to guess at the length of an idle stretch, each device's
behavior is examined separately, and a guess is made regarding which device
will interrupt next and when that will happen.
To gather this information, Daniel introduces a new mechanism to track
interrupt timings. It is all based around a structure with a handful of
functions to be called out of the interrupt-handling subsystem:
Interestingly, there can only be one of these structures in the system, and
it must be declared with the DECLARE_IRQ_TIMINGS() macro. This
mechanism runs in interrupt mode, so it must do as little work as possible;
that means there is no desire to add an elaborate mechanism to call multiple
handlers. The creation of a single, global structure also ensures that, if
the mechanism is built into the kernel (via the CONFIG_IRQ_TIMINGS
parameter), there is also a consumer for the timing information. In the
absence of that consumer, the global structure will not be defined, and the
kernel build will fail.
The alloc() and free() operations are called when
interrupt descriptors (the core data structure for managing interrupt
sources) are added to or removed from the kernel. setup() and
remove(), instead, are called when the first handler is set up for
a given interrupt (or the last one removed). Finally, handler() is
called whenever an actual interrupt happens for the given irq
number; it is passed a timestamp saying when the interrupt occurred.
On the consumer side (the scheduler's idle-time estimation code),
Daniel's patch sets up a data structure that looks like this:
Each CPU gets its own array of wakeup structures, with one entry
for each interrupt number. That structure holds the time of the last
observed interrupt and the stats structure which, in turn, holds a
simple circular buffer of observed interrupt timings.
When the interrupt timing handler is called, it looks up the appropriate
wakeup structure. The time since the last interrupt is calculated
and inserted into the circular buffer; the sum of all the
interrupt timings is updated as well. If it has been more than one second
since the previous interrupt, though, the accumulated information is
discarded instead and the statistics collection is restarted from the
beginning. Once collection has been active for a bit, the code can easily
calculate the mean time between interrupts and the variance in that time as
well.
In the current patch set, there is no tracking at the level of individual
devices; if multiple devices share an interrupt number, they will all
appear together in the statistics. That may prove to be a shortcoming on
systems with large amounts of interrupt sharing, but it should also be easy
to fix should that turn out to be the case. Given that interrupt sharing
appears to be slowly fading away, this may not be a concern in the end.
When the time comes to make a guess for the duration of the next idle
period for a given CPU, the code iterates through all of the interrupts
that have been active on that CPU. For each, the mean time between
interrupts and the variance are calculated. If the timing between the last
two interrupts was within one standard deviation of the mean, the code
concludes that interrupts on this line are predictable; the next interrupt
time is then calculated by adding the mean time to the time of the last
interrupt. The interrupt that is predicted to happen the soonest is used
to make a guess at the expected idle time.
One could certainly try to poke holes in this mechanism. Four samples
seems like a small number to be drawing conclusions from. The fact that
the algorithm skips over interrupts that seem unpredictable suggests that
it might overestimate the length of the coming idle period. Daniel
acknowledges some of these limitations, but says: "
There has been a fair amount of discussion around these patches, mostly
focused on relatively low-level implementation details (whether timestamps
should be kept in microseconds or nanoseconds, for example). One more
significant issue is that the simple arrays indexed by interrupt number
will not work on some systems with more complex interrupt setups.
Instead, Thomas Gleixner said, the code
needs to use a radix tree to track
interrupt sources.
There does not appear to be opposition to the underlying approach, though.
So, once the details have been worked out, this work may get the green
light to go into the mainline kernel. Then, perhaps, our batteries will
last a little longer, which cannot be a bad thing.
I'm
open to having this sysctl kill all CLONE_NEWUSERed process trees
",
without noting that having a sysctl knob kill off processes might pose some
interesting "subtlety and nuance" of its own.
Next-interrupt prediction
There are many things an operating system would like to know about the
future; one of those is when the next interrupt might come in. This
information could be put to good use when it comes time to put an idle
processor to sleep. Unfortunately, wormhole peripherals that can read
information from the future are expensive, so the vast majority of
processors are not equipped with them. That leaves no alternative to
trying to guess this information using past behavior as a guide.
typedef void (*irqt_handler_t)(unsigned int irq, ktime_t time, void *dev_id);
struct irqtimings_ops {
int (*alloc)(unsigned int irq);
void (*free)(unsigned int irq);
int (*setup)(unsigned int irq, struct irqaction *act);
void (*remove)(unsigned int irq, void *dev_id);
irqt_handler_t handler;
};
#define STATS_NR_VALUES 4
struct stats {
u64 sum; /* sum of values */
u32 values[STATS_NR_VALUES]; /* array of values */
unsigned char w_ptr; /* current window pointer */
};
struct wakeup {
struct stats stats;
ktime_t timestamp;
};
The statistics are
very trivial and could be improved later but this first step shows we have
a nice overall improvement in SMP.
" This mechanism does not show an
improvement on uniprocessor systems, though.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
