|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The 3.16 kernel was released on August 3, (announcement) so there is no current development kernel. The 3.17 merge window has opened; see the separate article below for a summary of what has been pulled so far.

Stable updates: 3.15.8, 3.14.15, 3.10.51, and 3.4.101 were released on July 31. The 3.15.9, 3.14.16, 3.10.52, and 3.4.102 updates are in the review process as of this writing; they can be expected on or after August 7.

Comments (none posted)

Quotes of the week

All that to basically say "don't panic" (that's the kernel's job).
Ted Unangst

Fact is, I'm quite happy to be flamed by people who think that such behavioural changes is the work of the $ANTIDEITY. If we are not making some people unhappy, then we are not pushing the boundaries of what is possible enough. The key is to listen to why people are unhappy about the change and to then address those concerns without compromising the benefits of the original change.
Dave Chinner

Always warn your family and give them an estimate return hour before opening that file.
Frederic Weisbecker on entry_64.S

Our goal over the next few years is for the Linux kernel network stack to rival or exceed that of FreeBSD.
Facebook

Comments (9 posted)

Kernel development news

3.17 merge window part 1

By Jonathan Corbet
August 6, 2014
As of this writing, some 6,144 non-merge changesets have been pulled into the mainline repository for the 3.17 development cycle. The merge window, in other words, has just begun. Even so, a number of interesting features and internal changes have found their way in. Some of the more interesting user-visible changes include:

  • The perf tool can now perform page fault tracing and generate associated statistics. Additionally, "perf timechart" can generate charts displaying I/O events.

  • Four-level page tables have been added to the arm64 architecture, greatly expanding the amount of virtual memory that can be addressed. This feature will likely not be enabled in 3.17 kernels, though, due to the need for more work. The arm64 architecture has also gained audit support.

  • The cryptographic subsystem has gained support for the NIST-specified SP800-90A deterministic random bit generator. There is also support for parsing PKCS#7 signed-data messages and verifying the signatures.

  • The netfilter "ulog" target, long deprecated, has finally been removed. The "NFLOG" target should be used instead.

  • Cross-thread filter setting for the secure computing ("seccomp") facility has been added. The interface has changed over time, though; there is now a new system call for seccomp:

        int seccomp(unsigned int operation, unsigned int flag, const char *args);
    
    This man page describes the new system call and the available commands.

  • The security module subsystem has gained a new hook (kernel_fw_from_file()) that may be used to verify the integrity of firmware blobs before accepting them in the kernel.

  • The getrandom() system call has been merged.

  • New hardware support includes:

    • Block: NVIDIA Tegra124 AHCI SATA controllers, Qualcomm APQ8064 and IPQ806x SATA SerDes/PHY controllers, Marvell Berlin SATA PHY controllers, and STMicroelectronics MIPHY365X SATA PHY controllers.

    • Clock: TI Palmas clk32kg and clk32kgaudio clocks, Rockchip rk3188, rk3066, and rk3288 clocks, Qualcomm APQ8084 clock controllers, and Cirrus Logic CLPS711X clocks.

    • Hardware monitoring: Lattice POWR1220AT8 power-management ICs, Texas Instruments TMP103 temperature sensors, Texas Instruments TPS40422 power management chips, IBM POWERNV platform sensors, and PWM-connected fans.

    • IIO: EPCOS T5403 digital barometric pressure sensors, Kionix KXCJK-1013 triaxial acceleration sensors, Asahi Kasei AK09911 3-axis compasses, Microchip MCP4902, MCP4912, and MCP4922 digital-to-analog converters, Maxim max1027, max1029 and max1031 analog-to-digital converters, Intersil ISL29125 digital color light sensors, TAOS TCS3414 digital color sensors, and Honeywell HMC5983 3-Axis magnetometers.

    • Miscellaneous: Intel E3-1200 DRAM controllers, Intel DH895xcc crypto accelerators (and Intel "Quick Assist Technology" devices in general), Intel MIC X100 DMA controllers, Qualcomm crypto engine accelerators, Thunderbolt devices on Apple systems, Maxim DS2406 addressable switches, Rockchip SPI controllers, Dialog Semiconductor DA9211/DA9212 regulators, and Atmel AT91 interrupt controllers.

    • Networking: TI CC2520 802.15.4 wireless-PAN networking controllers, Marvell Armada 375 Ethernet controllers, and STMicroelectronics ST21NFCB NFC controllers.

    • USB: Silicon Mitus SM5502 USB port switches, Emma Mobile EMXX USB function device controllers, Renesas R-Car xHCI host controllers, and NetChip PLX USB3380 and USB3382 USB peripheral controllers.

    • Video4Linux: Allwinner sunXi IR controllers, AirSpy software-defined radio controllers, Silicon Labs SI2165 DVB-C/-T demodulators, and Samsung Exynos3250 JPEG codecs.

    It's worth noting that 14 unloved drivers were removed from the staging tree, eliminating some 250,000 lines of code. Indeed, as of this writing, 3.17 is actually smaller than 3.16. A kernel release has been smaller than its predecessor only one other time in kernel development history (2.6.36).

Changes visible to kernel developers include:

  • A number of changes have been made to the timekeeping core in order to make it ready for the year 2038; see this article for details.

  • The dma-buf fence API has been added. This subsystem enables cross-device synchronization and coordination, especially around the use of DMA buffers. In the end, this API was made available to all kernel modules; a push to change it to the EXPORT_SYMBOL_GPL() mechanism was not successful.

  • The "config-bisect" mode in the ktest utility has been reworked to be much smarter about finding the actual configuration change that causes problems.

  • The ftrace subsystem has been reworked to make things more efficient when multiple tracers are active, and, in particular, when function graph tracing is being performed.

  • Arm64 kernels can now be built with the -fstack-protector option to detect stack corruption.

  • The wait_on_bit() function (along with its variants) has been reworked to no longer require an "action" function since, as it turned out, most of those functions were duplicates of each other.

In the 3.16 announcement, Linus noted that he will be traveling during the latter part of the 3.17 merge window. That may, he said, cause this merge window to be a little longer than usual. Subsystem maintainers would be wise to not count on that when sending their pull requests, though. It seems likely that Linus will feel motivated to close the merge window and get the 3.17-rc1 release out before the Kernel Summit starts on August 18.

Comments (12 posted)

Year 2038 preparations in 3.17

By Jonathan Corbet
August 6, 2014
January 19, 2038 is, of course, that fateful date when 32-bit time_t variables wrap around, bringing about the end of time as Unix-like systems know it. Though 2038 may seem far away, it is not too soon to be worrying about this problem; code needs to be able to work with future dates, and some systems being deployed now will certainly still be around 24 years from now. A full solution to the problem of keeping 32-bit systems working in 2038 is going to take a while to reach. But some of the first baby steps in that direction were taken during the 3.17 merge window.

Much of the work involves changes to two structures used within the kernel: union ktime (usually referred to as ktime_t) and struct timespec. The ktime_t structure was introduced with the high-resolution timer patch set back in 2006; it is meant to be an opaque type for the storage of time values within the kernel. Indeed, it is sufficiently opaque that its definition varies widely depending on the underlying architecture.

For 64-bit systems, ktime_t has always been a simple integer count of nanoseconds. This "scalar" format is easy to manipulate and perform arithmetic on — as long as the CPU has fast 64-bit operations. Since such operations tend to be absent on 32-bit systems, ktime_t has often been defined differently there; it is represented as a structure with separate, 32-bit fields for seconds and nanoseconds. Kernel code uses a set of accessor functions for working with ktime_t values, so the difference in representation is well hidden and does not affect how the rest of the kernel works.

That difference will be rather less well hidden in 2038, though, when the 32-bit seconds field overflows with all the rest. So, for things to keep working in 2038, the ktime_t structure will have to change. One of the first changes merged for 3.17 is to simply get rid of the non-scalar form of ktime_t and force all architectures to use the 64-bit nanosecond count representation. This change may slow things down on 32-bit systems; in particular, conversions from other time formats may be significantly slower. But, as noted in the changelog, the ARM and x86 architectures were already using the scalar format anyway, so they will not get any slower.

Regardless of whether conversions between ktime_t and other formats are fast or not, avoidance of those conversions when possible seems like a promising way of optimizing code within the kernel. The 3.17 changes include a number of time-format changes within various kernel subsystems, causing them to just work with straight nanosecond time values. The result is generally a simplification of the code and, presumably, faster execution.

The other time-related structure used heavily within the kernel is struct timespec:

    struct timespec {
	__kernel_time_t	tv_sec;			/* seconds */
	long		tv_nsec;		/* nanoseconds */
    };

The __kernel_time_t type is just another name for time_t in current kernels; it is, thus, a 32-bit value on 32-bit systems. Unlike ktime_t, though, struct timespec cannot just be changed; it is used in user space as well and is a part of the kernel's ABI. What has been done instead in 3.17 is to add a new structure:

    struct timespec64 {
	time64_t	tv_sec;			/* seconds */
	long		tv_nsec;		/* nanoseconds */
    };

On 64-bit systems, this structure is identical to struct timespec. Within the core timekeeping code, every struct timespec has been changed to be struct timespec64 instead. The interfaces providing access to timekeeping functionality to the rest of the kernel have been tweaked to hide this change (so far), and a new set of interfaces has been added for code that is using struct timespec64. After this change, there are no more time values using 32-bit seconds counts in the timekeeping core.

The result of all this work is a long way from a solution to the year-2038 problem. But it is one important step in that direction: the core timekeeping code within the Linux kernel will no longer have problems when 2038 rolls around. With a couple more steps, a full solution to the problem may well be in sight. The first of those steps is to push use of struct timespec64 outward from the timekeeping core into the rest of the kernel. This task may involve a fair amount of work, but it is an example of the sort of evolutionary change that the kernel community is relatively good at. Given a bit of time, kernel code should be almost entirely free of year-2038 issues.

The harder step, of course, is to incorporate year-2038-safe structures into the kernel ABI and get user-space developers to change their code over. That will require cooperation with user-space developers from the C-library level on up and a lot of thought into how this change can be made with a minimum of pain. One should not expect it to happen quickly. But the problem is now well established on the radar of many of the relevant developers, so the chances of resolving most of the problems without a last-minute panic seem to be reasonably good. The first steps have been taken; hopefully the rest will follow before too long.

Comments (20 posted)

Control groups, part 6: A look under the hood

August 6, 2014

This article was contributed by Neil Brown


Control groups

I must confess that when I look under the hood of an automobile I can only just figure out which of the major components are which, and certainly wouldn't want to try changing the oil. Looking inside a sophisticated piece of software is quite a different story: even the smallest elements jump and dance around together like a scene from Disney's Fantasia. So, no tour of something as rich a Linux control groups could be complete without exploring the code to see what patterns we can find.

We have had a few little glimpses into several of the cgroup subsystems already, so this excursion is going to focus on the cgroups core and, particularly, on the interconnections between processes and cgroups. One question we want to keep in mind is the comparative costs of different approaches. Last time, we saw a concern that trying to use a single hierarchy and still allow independent classification against multiple resources could lead to a combinatorial explosion of groups. If there are Q administrative groups, such as login sessions, and in each we potentially want to classify processes M different ways against each of N different resources, then we might need Q x M x N different groups. If we allow a separate hierarchy for each resource, there would only be Q + M x N groups. So the question is: does the simplicity of a single hierarchy outweigh the cost of the larger number of cgroups?

Managing the hierarchies themselves is unlikely to be particularly interesting — basic tree data structures are taught in most computer science courses and the shape of a cgroup hierarchy is unlikely to differ greatly from such structures. The interesting question is how processes are connected to cgroups. As we found when exploring the various cgroup subsystems, it is sometimes necessary to map from a process to the cgroup that is relevant for a particular subsystem, and it is sometimes necessary to get the list of all processes that are in a cgroup. It is the data structures which enable those mappings that will be our main focus.

However, a clarification is in order. Cgroups are not actually groups of processes, despite the fact that we have repeatedly described them that way. They are in fact groups of threads. So we will start out by clarifying the distinction by looking at how threads and processes are connected to each other and to some related objects.

Joining the dots ... er ... processes ... er ... threads

In V6 Unix, on through the early BSDs and several years of Linux, the process was a central, well-defined object. A process had a single thread of execution, an address space for memory, a process ID number, and a set of signal handlers, among other details. There was nothing else like a process and no room for confusion.

In V6 Unix (released in 1975), these processes where represented by a pre-allocated array of struct proc that was indexed by PID. If there was ever a need to find a process by some other key, such as by controlling tty, the code would simply search the whole array. This was clearly a simpler, less complicated age. By the time of 4.3BSD (1986), this fixed array had become a more dynamic linked list and in 4.4BSD there were even secondary lists so the processes in a process group could be found without searching the whole process table. Since then, things have only become more complex, though there is still some room for simple elegance.

[process diagram] In Linux, the three-level hierarchy we first saw in 4.4BSD (session, process group, and process) has gained an extra level below processes: the thread. The result is a hierarchy something like that pictured to the right. A thread has its own execution context and its own "thread ID" number but can share most other details, particularly an address space, a set of signal handlers, and a process ID number, with other threads in the same process.

Internally, a thread is represented by a struct task_struct and is sometimes referred to as a task. Unfortunately, processes are also sometimes referred to as tasks — there is a do_each_pid_thread() macro that, quite reasonably, iterates over threads. The matching macro for iterating over processes is do_each_pid_task() (both macros are defined in pid.h). The term "process" is somewhat more reliable, but, since the term PID (or "process ID") is sometimes used for threads, processes, process groups, and sessions, it is sometimes safer to stick with the more precise term "thread group".

It would be particularly elegant if all four of these objects, the session, process group, process, and thread, were managed in a uniform way. While that is nearly the case, threads are still handled a bit specially. While it may not justify all of the handling differences, there are two properties of threads that do make them truly different. First, there is always one distinguished thread within a process that is known as the "group_leader". This provides something clear to point to and say "this is the process". The thread ID of the group_leader is the process ID of the whole process. A session does have a leader process, though only in a weak sense, as this process can exit before the rest of the session; process groups don't have any sort of leader at all. Secondly, a thread can only leave a process by exiting — it is not possible for a thread to move to a different process, unlike process groups where a process can move from one process group to another. This has important implications for locking.

Since Linux gained the concept of "PID namespaces", where the same process can have a different PID in different namespaces, there has been a "struct pid" (pictured blue) that links the first three object types together. For this discussion the important member of struct pid is

    struct hlist_head tasks[PIDTYPE_MAX];

where PIDTYPE_MAX is defined in:

    enum pid_type { PIDTYPE_PID, PIDTYPE_PGID, PIDTYPE_SID, PIDTYPE_MAX };

(and so is clearly a number of PIDTYPEs, not the MAXimum) along with the corresponding field (shown in yellow):

    struct pid_link
    {
	struct hlist_node node;
	struct pid *pid;
    } pids[PIDTYPE_MAX];

in struct task_struct (note that some license has been taken in reformatting this code for publication).

Each PID has three lists (the hlist_head) of task_structs, which are linked together through the three hlist_nodes in each task_struct. The session and process group lists (PIDTYPE_SID and PIDTYPE_PGID) are really lists of processes and only the thread group leaders appear in these lists.

The PIDTYPE_PID list is not, as one might hope, a list of all the threads in a thread group, but is a list of just the thread (if it still exists) with this PID as its thread ID. It seems strange to use a list to contain just a single entry, but there is a reason. When one thread in a process makes an exec() system call (in any of its flavors), all other threads in the process are killed (SIGKILL) and the thread performing exec() takes over the identity (particularly the PID) of the thread group leader. This can result in two different threads having the same thread ID for a short time. Having a list for PIDTYPE_PID makes this possible.

The list of threads in a process is managed quite separately, using the field:

    struct list_head thread_group;

The first three lists had a distinct "head", in the struct pid, and then a number of "nodes", one in each struct task_struct. This list of threads forms a loop (drawn in red) with no head or tail. This is problematic for a subtle, but important, reason that, perhaps surprisingly, would not be a problem for the other three lists.

Because threads are only removed from a thread group (aka process) when they are being destroyed, it is safe to walk the list without strong locking. The lightweight "RCU read lock" is sufficient. When a thread is removed from the list, it remains half-in for just long enough that any code in the middle of walking the list and currently looking at the thread being deleted can still continue on to reach the end — though, given that the list is a loop, we should really say that it reaches the point in the loop where it started.

A potential problem with this picture is that if some code iterates over the thread list and starts from a thread which exits before the iteration finishes, it will never find its starting point and could loop forever. It is always safe to start from thread group leader (as it is remains a "zombie" until all other threads exit), but there is nothing in the API to force that starting point. One example of such questionable usage is in fill_stats_for_tgid() which accumulates some statistics across all threads in a list. If this was requested for a PID that was not a thread group leader (which would be odd, but quite possible), it could hit problems if the thread exited at the wrong time. According to Oleg Nesterov (writing during Linux 3.14 development), "almost every lockless usage [of this thread linkage] is wrong".

Consequently, in Linux 3.14, a new linkage (not pictured) was introduced that has a distinct thread_head, which is stored in the signal_struct, that contains a number of fields that are specific to a process (as opposed to a thread). I find it a little disappointing that this head doesn't appear in struct pid like the other list heads, but that probably wouldn't have much practical benefit. In due course, all users should move away from the thread_group linkage, which can then be discarded.

This issue reminds us that locking is both subtle and important, so it must be understood properly. For all the lists we have looked at here, as well as the children/sibling list that links children of a process together, and the init_task/tasks list that links all processes together, there is a reader/writer spinlock, helpfully called tasklist_lock, that protects all accesses and changes. The only list accesses that are permitted without that lock are to follow the threads in a process (preferably using the new thread_head list) and to iterate over all task group leaders starting from init_task. These are safe with only the RCU read lock as threads are never moved out of these lists into another.

There is some suggestion that this tasklist_lock has problems. One of the problems is that multiple overlapping readers can starve a process which needs to get write access. From Oleg Nesterov again (in a reply to this patchset): "Everyone seem to agree that tasklist[_lock] should die".

It would be an interesting exercise to make all the lists of processes and threads RCU-safe. This certainly wouldn't be trivial and, despite being proposed by Thomas Gleixner over four years ago, it still hasn't happened. However given that the VFS's directory entry (dentry) cache can often be accessed under RCU, it seems likely that something can be done for the process tree too.

Cgroups linkages

These various linkages between threads and processes are made even more interesting by cgroups. The linkage to form cgroups into a hierarchy is fairly unsurprising (it is being rearranged substantially for 3.16, but the principle is unchanged):

    struct list_head sibling;   /* my parent's children */
    struct list_head children;  /* my children */
    struct cgroup *parent;      /* my parent */

Much more interesting is the linkage between threads and cgroups. As noted previously, there can be multiple cgroup hierarchies and each thread belongs to one cgroup in each of them. This requires an M x N mapping, so a few lists won't do. The required mapping is achieved using two intermediate data structures: the css_set and the cgrp_cset_link.

[Cgroup hierarchy structures]

When a process forks, the child will be in all the same cgroups that the parent is in. While either process could be moved around, they very often are not. This means that it is quite common for a collection of processes (and the threads within them) to all be in the same set of cgroups. To make use of this commonality, the struct css_set exists. It identifies a set of cgroups (css stands for "cgroup subsystem state") and each thread is attached to precisely one css_set. All css_sets are linked together in a hash table so that when a process or thread is moved to a new cgroup, a pre-existing css_set can be reused, if one exists with the required set of cgroups.

With similar threads and processes linked into a css_set, we still have an M x N mapping problem, it is just that the M is no longer the number of threads, but is now the much smaller number of css_sets. To link the various cgroups to the various css_sets we have the aptly-named cgrp_cset_link:

    struct cgrp_cset_link {
	struct cgroup       *cgrp;
	struct css_set      *cset;
	struct list_head    cset_link;
	struct list_head    cgrp_link;
    };

For each css_set, there is one cgrp_cset_link for every hierarchy. Each cgroup has a cset_links field that connects together all the cgrp_cset_links for that cgroup and from which all the threads in that cgroup can be found. Similarly each css_set has a cgrp_links field from which all the cgroups containing any of the threads in that css_set can be found (if all these list_heads are making your head spin, makelinux has a fairly good introduction to Linux linked lists).

This data structure quite effectively links all the threads and all the cgroups together. Whether it does so efficiently is a different question. It is quite good for finding all of the cgroups for a thread, or all the threads for a cgroup. It is far from suitable for finding the cgroup that provides, say, the net_cl subsystem for a particular process (in order to determine the class ID to assign to a new socket).

To meet this need there is some extra linkage (not shown in the picture above). Each css_set contains an array of pointers indexed by subsystem number that provides a direct link to the relevant cgroup for each subsystem, bypassing the cgrp_css_link structures.

    struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];

(cgroup_subsys_state holds subsystem-specific information tightly connected to a cgroup).

If we turn now to the locking used to manage these lists, there are a couple of surprises. First, while there is a reader/writer lock that protects all of these linkages like there was with processes and tasks, this lock — css_set_rwsem — is a semaphore rather than a spinlock. This is surprising because semaphores are normally used when locks can be held for longer, so the waiting process might need to sleep (rather than just spin). Examining the history of this lock, it turns out that it was a spinlock until quite recently and was changed to a semaphore as part of a process to incrementally tidy up some code. It seems likely that this will become a spinlock again.

The other surprise is more interesting: cgroups adds another lock that can stop threads from joining or leaving a thread group: group_rwsem. There is one of these locks for each thread group (stored in the signal_struct); it is rarely taken exclusively (i.e. for write), so it is unlikely to affect performance though, generally, avoiding new locks is best. It does raise a question, though: why do we need this new lock to protect the threads list?

The lock is needed because cgroups are groups of threads, rather than groups of processes. This means that when a process is added to a cgroup, each individual thread must be added separately, and it is necessary to keep the list of threads stable while they are being removed from one cgroup and added to another. If cgroups were truly groups of processes, there would be no need to move threads individually, and this lock could be discarded. The value of listing threads, rather than just process (or group leaders), is not obvious in the code. An analysis of the various subsystems from the perspective of whether any could make use of being able to distinguish between different threads is left as an exercise for the interested reader.

What did we learn?

Most of the value of looking into data structures is simply fleshing out a picture of how things plug together so that we can reason about the consequences of any possible changes. Nonetheless, there are a few specific lessons we can take away from this exploration.

First, we had a reminder that terminology can be challenging, and that we should be careful when interpreting what we read in the kernel code. A task is often a task, and a MAX is mostly a maximum. But not always.

Secondly, locking can be tricky. It is generally best to avoid requiring locks where possible, and doing so is easier when the data structures are simple and elegant. Reducing the impact of tasklist_lock is probably possible and apparently desirable. Reducing the impact of the cgroups locks, should that be necessary, would be harder simply because the data structures are more complex.

Finally, a key point of that complexity is the proliferation of multiply-linked cgroup_cset_link structures. There would be as many of these as there would be of cgroups — if we had a single hierarchy and created a multitude of cgroups, as discussed last time, to allow different combinations of different process classifications for different resources.

Put another way, the combinatorial explosion we were concerned about is unavoidable. It may be explicit in a multitude of cgroups, or it may be implicit in a multitude of cgroup_cset_link structures, but it will exist. Currently, cgroups are much bigger structures than the link structures, and cgroup_cset_links are created automatically, rather than requiring mkdir requests, so possibly having a proliferation of links is cheaper than a proliferation of cgroups.

This does suggest, though, that we could express the concern a different way. Rather than being concerned about a proliferation, we could instead be concerned about the size of cgroups and what mechanisms could be used to create cgroups automatically.

With that puzzle left as an open question, we come to the end of what we can discover about cgroups as they are found in Linux 3.15, the current release of Linux at the time of writing. We aren't quite finished yet though. The next and final installment will look beyond 3.15 to find out what value the proposed unified hierarchy might bring, and to try to present all the core issues into one coherent picture.

Comments (5 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.16 ?
Greg KH Linux 3.15.8 ?
Greg KH Linux 3.14.15 ?
Jiri Slaby Linux 3.12.26 ?
Greg KH Linux 3.10.51 ?
Greg KH Linux 3.4.101 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Filesystems and block I/O

Memory management

Networking

Virtualization and containers

jgross@suse.com Add XEN pvSCSI support ?

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds