|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 4.9-rc7, released on November 27. Linus said that things are shaping up and it is possible, but perhaps not likely, that the final 4.9 release will happen on December 4. "I basically reserve the right to make up my mind next weekend."

4.9-rc6 was released on November 20.

The latest 4.9 regressions list, posted on November 20, shows ten open issues.

Stable updates: the last two weeks have seen the release of 4.8.9 and 4.4.33 (November 19), 4.8.10 and 4.4.34 (November 21), and 4.8.11 and 4.4.35 (November 26). As if that weren't enough, 4.8.12 and 4.4.36 are in the review process with an expected release date of December 2.

Comments (none posted)

Quotes of the week

Given the history of RCU, I am clearly a dishonest practitioner. Nevertheless, I believe that for routine work, honesty is the best policy. But when the going gets tough, a robust combination of contamination and dishonesty is often the only way forward.
Paul McKenney

We've secretly replaced their regular MODVERSIONS with nothing at all, let's see if they notice.
Linus Torvalds

Comments (none posted)

Kernel development news

The end of modversions?

By Jonathan Corbet
November 30, 2016
The 4.9-rc1 kernel prepatch, released on October 15, introduced a large set of new features — and, inevitably, a smaller set of new regressions. One of those problems, a module-related bootstrap failure, remains unfixed in the mainline even after the 4.9-rc7 release. A fix to the problem has been written and is known to work, but it may never be merged if, as seems reasonably likely, the community chooses a simpler option.

The problem of module compatibility

Loading modules into the kernel is a tricky business. Among other things, the module must precisely match the kernel into which it is being loaded in any of a number of ways. If a function prototype differs between the module and the kernel, bad things are sure to happen when that function is called. The same holds for data-structure layouts, configuration options, and even the version of the compiler used to build the various pieces. The obvious way to be sure that everything matches is to build the kernel and all loadable modules together; that is, indeed, how it is done most of the time. But there are users who want to be able to build the kernel and its modules separately.

One obvious use case for separately built modules is code that is not in the mainline, and, thus, cannot be built with the rest. There are also cases where users want to build and run a new kernel without necessarily rebuilding the modules that they use. Supporting these users while trying to protect the kernel against the loading of incompatible modules has led to the addition of a couple of layers of infrastructure.

The first of those is the "vermagic" string compiled into the kernel and into every loadable module. The system on which this article is being written features the following vermagic string:

    4.8.6-2-default SMP preempt mod_unload modversions

In the simplest configuration, the module loader will simply check to ensure that a module and the kernel have the exact same vermagic string. That ensures that the module was built for the same kernel version and that major options like SMP support were configured in the same way. If the test fails, the module will not be loaded.

That test, however, will thwart users who want to use the same binary module in multiple versions of the kernel. Even users who have a module built for a distribution kernel will run into trouble when the distributor ships an update; the version number will increment to something like 4.8.6-3 and the test will fail, even though the new kernel only adds a few fixes and is almost certainly compatible with the old module. Supporting those users requires a more nuanced compatibility test.

The "modversions" configuration option is meant to be that test. When enabled, modversions changes both the compilation process and the module loader. When the kernel is built, a checksum is calculated from the prototype of every exported function; those checksums are stored in a special section of the binary. When modules are built, those same checksums are calculated for every exported function that the module calls; the result is built into the module binary. At module-load time, the kernel will drop the first part of the vermagic string (the kernel version number) before comparing it, meaning that modules can now be loaded into versions other than the one they were built for. But the loader will also compare the checksums for all kernel symbols used by the module; should one of those checksums fail to match, the module will not be loaded. This test will, thus, catch major changes in the functions used by modules, but it still cannot catch more subtle changes.

Recent changes and modversions

Back in February, Al Viro posted a set of changes to the symbol-export mechanism; these changes were designed to, among other things, allow the placement of EXPORT_SYMBOL() directives in assembly code for functions defined there. These changes, merged into the mainline for 4.9-rc1, improved symbol exports in a number of ways, but there was one little problem: the generation of checksums for symbols exported from assembly code does not work properly with binutils 2.27. In particular, those checksums (which were set to zero anyway) would be dropped entirely; the module loader would then complain about the missing checksums and refuse to load the module. As a result, systems with that version of binutils and with modversions enabled will fail to boot if they require a module that uses symbols defined in assembly code.

One fix, developed by Nick Piggin, is to create a special include file containing prototypes for functions exported from assembly code; the build process can read that file to generate the necessary modversions checksums. That ensures that the checksums are not only present, but also that they correspond to the symbols and can be meaningfully checked. This fix was merged for 4.9-rc6, but it failed to actually fix the problem because it did not finish the job. Functions defined in assembly code are, by their nature, architecture-specific, so the include file containing the prototypes must be created for each architecture. Those files were not actually created for any architecture beyond PowerPC so, as of 4.9-rc7, users of other architectures (i.e. most of us) can still run into the problem. Adam Borowski has posted a patch adding this file for the x86 architecture, but it has not been merged as of this writing.

And, indeed, it may never be merged, because it seems that most of the use cases for modversions no longer exist. Some distributors (notably Debian) make use of it but, since they take pains to not change APIs in supported kernels, all they really gain is the ability to avoid the kernel-version check (though Debian also counts on modversions to allow internal API changes to be made without changing the kernel version). As Linus Torvalds noted, the feature was once useful for developers who were tired of tracking down problems that were caused by stale kernel modules. In 2016, where the kernel version can contain the actual Git revision that was built and where the time required to build a full set of modules is short, modversions is no longer as useful as it once was. And, Piggin noted, modversions uses a fair amount of complicated machinery for a mediocre result:

But still, modversions is pretty complicated for what it gives us. It sends preprocessed C into a C parser that makes CRCs using type definitions of exported symbols, then turns those CRCs into a linker script which which is used to link the .o file with. What we get in return is a quite limited symbol "versioning" system.

By "quite limited," he is referring to the fact that many changes will elude the modversions check. In particular, changes to a structure passed to a function will not be caught. Piggin suggested that a better result could be obtained if the whole mechanism were removed and replaced by a simple, manually maintained version number attached to each exported symbol. Whenever a developer made an incompatible change, they would be expected to increment the version number; modules using the affected interface would then fail to load until they were rebuilt.

The version-number suggestion did not get far; the chances of those numbers actually being maintained in a useful manner are quite small. But the idea of removing modversions was better received. Torvalds agreed that the whole thing "may just be too painful to bother with" and that the number of users is quite small — an idea reinforced by the fact that few testers complained about this issue. So, rather than apply the fix, Torvalds chose instead to mark modversions as "broken" (essentially disabling the feature altogether) instead. That change was merged just prior to the 4.9-rc7 release.

It seems, though, that not everybody is ready to see modversions go away quite yet; in particular, Debian, which is planning on using 4.9 for the upcoming "stretch" release, would like to have modversions available. So, after 4.9-rc7 was released, Torvalds committed another change re-enabling modversions, but with a change. Rather than refuse to load a module when a checksum is missing, the loader will log a complaint and continue. That should suffice to get modversions working again on all systems without requiring the addition of architecture-specific include files. His real goal is clear, though: "Some day I really do want to remove MODVERSIONS entirely. Sadly, today does not appear to be that day."

When that day does come, Piggin has a patch removing modversions altogether and replacing it with a simple option for distributors to supply their own ABI version string to be used instead of vermagic. Getting rid of modversions removes about 7,700 lines of code (much of which is generated by lex and bison) and simplifies the module-loading logic. It seems like a relatively easy sell — if distributors agree that they can do without modversions in the future.

Comments (7 posted)

statx() v3

By Jonathan Corbet
November 30, 2016
Some developments just take a long time to truly come to fruition. That has proved to be the case for the proposed statx() system call — at least, the "long time" part has, even if we may still be waiting for "fruition". By most accounts, though, this extension to the stat() system call would appear to be getting closer to being ready. Recent patches show the current state of statx() and where the remaining sticking points are.

The stat() system call, which returns metadata about a file, has a long history, having made its debut in the Version 1 Unix release in 1971. It has changed little in the following 45 years, even though the rest of the operating system has changed around it. Thus, it's unsurprising that stat() tends to fall short of current requirements. It is unable to represent much of the information relevant to files now, including generation and version numbers, file creation time, encryption status, whether they are stored on a remote server, and so on. It gives the caller no choice about which information to obtain, possibly forcing expensive operations to obtain data that the application does not need. The timestamp fields have year-2038 problems. And so on.

David Howells has been sporadically working on replacing stat() since 2010; his version 3 patch (counting since he restarted the effort earlier this year) came out on November 23. While the proposed statx() system call looks much the same as it did when we looked at it in May, there have been a few changes.

The prototype for statx() is still:

    int statx(int dfd, const char *filename, unsigned atflag, unsigned mask,
	      struct statx *buffer);

Normally, dfd is a file descriptor identifying a directory, and filename is the name of the file of interest; that file is expected to be found relative to the given directory. If filename is passed as NULL, then dfd is interpreted as referring directly to the file being queried. Thus, statx() supersedes the functionality of both stat() and fstat().

The atflag argument modifies the behavior of the system call. It handles a couple of flags that already exist in current kernels: AT_SYMLINK_NOFOLLOW to return information about a symbolic link rather than following it, and AT_NO_AUTOMOUNT to prevent the automounting of remote filesystems. A set of new flags just for statx() controls the synchronization of data with remote servers, allowing applications to adjust the balance between I/O activity and accurate results. AT_STATX_FORCE_SYNC will force a synchronization with a remote server, even if the local kernel thinks its information is current, while AT_STATX_DONT_SYNC inhibits queries to the remote server, yielding fast results that may be out-of-date or entirely unavailable.

The atflag parameter, thus, controls what statx() will do to obtain the data; mask, instead, controls which data is obtained. The available flags here allow the application to request file permissions, type, number of links, ownership, timestamps, and more. The special value STATX_BASIC_STATS returns everything stat() would, while STATX_ALL returns everything available. Reducing the amount of information requested might reduce the amount of I/O required to execute the system call, but some reviewers worry that developers will just use STATX_ALL to avoid the need to think about it.

The final argument, buffer, contains a structure to be filled with the relevant information; in this version of the patch this structure looks like:

    struct statx {
	__u32	stx_mask;	/* What results were written [uncond] */
	__u32	stx_blksize;	/* Preferred general I/O size [uncond] */
	__u64	stx_attributes;	/* Flags conveying information about the file [uncond] */
	__u32	stx_nlink;	/* Number of hard links */
	__u32	stx_uid;	/* User ID of owner */
	__u32	stx_gid;	/* Group ID of owner */
	__u16	stx_mode;	/* File mode */
	__u16	__spare0[1];
	__u64	stx_ino;	/* Inode number */
	__u64	stx_size;	/* File size */
	__u64	stx_blocks;	/* Number of 512-byte blocks allocated */
	__u64	__spare1[1];
	struct statx_timestamp	stx_atime;	/* Last access time */
	struct statx_timestamp	stx_btime;	/* File creation time */
	struct statx_timestamp	stx_ctime;	/* Last attribute change time */
	struct statx_timestamp	stx_mtime;	/* Last data modification time */
	__u32	stx_rdev_major;	/* Device ID of special file [if bdev/cdev] */
	__u32	stx_rdev_minor;
	__u32	stx_dev_major;	/* ID of device containing file [uncond] */
	__u32	stx_dev_minor;
	__u64	__spare2[14];	/* Spare space for future expansion */
    };

Here, stx_mask indicates which fields are actually valid; it will be the intersection of the information requested by the application and what the filesystem is able to provide. stx_attributes contains flags describing the state of the file; they indicate whether the file is compressed, encrypted, immutable, append-only, not to be included in backups, or an automount point.

The timestamp fields contain this structure:

    struct statx_timestamp {
	__s64	tv_sec;
	__s32	tv_nsec;
	__s32	__reserved;
    };

The __reserved field was added in the version 3 patch as the result of one of the strongest points of disagreement in recent discussions about statx(). Dave Chinner suggested that, at some point in the future, nanosecond resolution may no longer be adequate; he said that the interface should be able to handle femtosecond timestamps. He was mostly alone on that point; other participants, such as Alan Cox, said that the speed of light will ensure that we never need timestamps below nanosecond resolution. Chinner insisted, though, so Howells added the __reserved field with the idea that it can be pressed into service should the need arise in the future.

Chinner had a number of other objections about the interface, some of which have not yet been addressed. These include the definition of the STATX_ATTR_ flags, which shadow a set of existing flags used with the FS_IOC_GETFLAGS and FS_IOC_SETFLAGS ioctl() calls. Reusing the flags allows a micro-optimization of the statx() code but, Chinner says, it perpetuates some interface mistakes made in the past. Ted Ts'o offered similar advice when reviewing a 2015 version of the patch set, but version 3 retains the same flag definitions.

The largest of Chinner's objections, though, may well be the absence of a comprehensive set of tests for statx(). This code, he said, should not go in until those tests are provided:

Quite frankly, I think this has to be an unconditional requirement for such generic, expandable new syscall functionality - either we get test coverage for it before merge, or we don't merge it. We've demonstrated time and time again that shit doesn't work if it's not tested and cannot be widely verified by independent filesystem developers.

This position has been echoed by others (Michael Kerrisk, for example) recently. The kernel does have a long history of merging new system calls that do not work as advertised, with corresponding pain resulting later on. Howells will likely end up providing such tests, but not yet:

Given the amount of bikeshedding that's taken place on this, I'm glad I *haven't* done the testsuite yet - it would have much more than doubled the amount of work. I *still* don't know what the final form is going to be.

The rate of change of the patch set does seem to be slowing so, perhaps, its final form is beginning to come into focus. The history of this work suggests that it would not be wise to predict its merging in the near future, though. The stat() system call has been with us for a long time; it's reasonable to expect that statx() will last for just as long. A bit of extra "bikeshedding" to get the interface right seems understandable in that context.

Comments (21 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.9-rc7 Nov 27
Linus Torvalds Linux 4.9-rc6 Nov 20
Greg KH Linux 4.8.11 Nov 26
Greg KH Linux 4.8.10 Nov 21
Greg KH Linux 4.8.9 Nov 19
Greg KH Linux 4.4.35 Nov 26
Greg KH Linux 4.4.34 Nov 21
Greg KH Linux 4.4.33 Nov 19
Steven Rostedt 4.4.32-rt43 Nov 18
Ben Hutchings Linux 3.16.39 Nov 20
Jiri Slaby Linux 3.12.68 Nov 29
Steven Rostedt 3.4.113-rt145 Nov 16
Ben Hutchings Linux 3.2.84 Nov 20

Architecture-specific

Vladimir Murzin Allow NOMMU for MULTIPLATFORM Nov 29
Hemant Kumar IMA Instrumentation Support Nov 21
Michael Ellerman powerpc: Implement kexec_file_load() Nov 29
David Matlack VMX Capability MSRs Nov 22

Build system

Core kernel code

Device drivers

Rick Chang Add Mediatek JPEG Decoder Nov 17
YT Shen MT2701 DRM support Nov 25
Bartosz Golaszewski iio: misc: add a generic regulator driver Nov 29
Lorenzo Pieralisi ACPI IORT ARM SMMU support Nov 21
Benjamin Gaignard Add pwm and IIO timer drivers for stm32 Nov 22
Benjamin Gaignard Add pwm and IIO timer drivers for stm32 Nov 24
Vishwanathapura, Niranjana HFI Virtual Network Interface Controller (VNIC) Nov 18
Jordan Crouse Adreno A5XX support Nov 28

Device driver infrastructure

Heikki Krogerus USB Type-C Connector class Nov 17
Anshuman Khandual Define coherent device memory node Nov 22
Brian Starkey Introduce writeback connectors Nov 25
Sricharan R IOMMU probe deferral support Nov 30

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2016, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds