|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 4.1-rc2, released on May 3. "As usual, it's a mixture of driver fixes, arch updates (with s390 really standing out due to that one prng commit), and some filesystem and networking."

Stable updates: none have been released in the last week. The 4.0.2, 3.19.7, 3.14.41, and 3.10.77 updates are in the review process as of this writing. They were originally expected on May 4, but have been held up due to some problems caused by the inclusion of some ill-advised patches.

Comments (none posted)

Quotes of the week

Over the past 10 years the x86 FPU has organically grown into somewhat of a spaghetti monster that few (if any) kernel developers understand and which code few people enjoy to hack.

Many people suggested over the years that it needs a major cleanup, and some time ago I went "what the heck" and started doing it step by step to see where it leads - it cannot be that hard!

Three weeks and 200+ patches later I think I have to admit that I seriously underestimated the magnitude of the project! ;-)

Ingo Molnar embarks on a small cleanup task

PLEASE. We're not programming in Pascal (and thank all Gods for that), so we can have labels that have meaningful names. Also, we're not ashamed of using goto where it makes sense, so we don't need to try to hide the labels by making them look like specks of dirt on our monitor.
Linus Torvalds

Comments (1 posted)

Kernel development news

System call conversion for year 2038

By Jonathan Corbet
May 5, 2015
There are now less than 23 years remaining until that fateful day in January 2038 when signed 32-bit time_t values — used to represent time values in Unix-like systems — run out of bits and overflow. As that date approaches, 32-bit systems can be expected to fail in all kinds of entertaining ways and current LWN readers can look forward to being called out of retirement in a heroic (and lucrative) effort to stave off the approaching apocalypse. Or that would be the case if it weren't for a group of spoilsport developers who are trying to solve the year-2038 problem now and ruin the whole thing. The shape of that effort has come a bit more into focus with the posting by Arnd Bergmann of a new patch set (later updated) showing the expected migration path for time-related system calls.

Current Linux system calls use a number of different data types to represent times, from the simple time_t value through the timeval and timespec structures and others. Each, though, has one thing in common: an integer value counting the number of seconds since the beginning of 1970 (or from the current time in places where a relative time value is needed). On 32-bit systems, that count is a signed 32-bit value; it clearly needs to gain more bits to function in a world where post-2038 dates need to be represented.

Time representations

One possibility is to simply create 64-bit versions of these time-related structures and use them. But if an incompatible change is to be made, it might be worthwhile thinking a bit more broadly; to that end, Thomas Gleixner recently suggested the creation of a new set of (Linux-specific) system calls that would use a signed, 64-bit nanosecond counter instead. This counter would mirror the ktime_t type (defined in <include/linux/ktime.h>) used to represent times within the kernel:

    union ktime {
	s64	tv64;
    };
    typedef union ktime ktime_t;		/* Kill this */

(Incidentally, the "kill this" comment was added by Andrew Morton in 2007; nobody has killed it yet.)

Having user space work with values that mirror those used within the kernel has a certain appeal; a lot of time-conversion operations could be eliminated. But Arnd Bergmann pointed out a number of difficulties with this approach, including the fact that it makes a complicated changeover even more so. The fatal flaw, though, turns up in this survey of time-related system calls posted by Arnd shortly thereafter: system calls that deal with file timestamps need to be able to represent times prior to 1970. They also need to be able to express a wider range of times than is possible with a 64-bit ktime_t. So some variant of time_t must be used with them, at least. (The need to represent times before 1970 also precludes the use of an unsigned value to extend the forward range of a 32-bit time_t value).

So universal use of signed nanosecond time values does not appear to be in the cards, at least not as part of the year-2038 disaster-prevention effort. Still, there is room for some simplification. The current plan is to use the 64-bit version of struct timespec (called, appropriately, struct timespec64 in the kernel, though user space will still see it as simply struct timespec) for almost all time values passed into or out of the kernel. The various system calls that use the other (older) time formats can generally be emulated in user space. So, for example, a call to gettimeofday() (which uses struct timeval) will be turned into a call to clock_gettime() before entry into the kernel. That reduces the number of system calls for which compatibility must be handled in kernel space.

Thus, in the future, a 32-bit system that is prepared to survive 2038 will use struct timespec64 for all time values exchanged with the kernel. That just leaves the minor problem of how to get there with a minimal amount of application breakage. The current plan can be seen in Arnd's patch set, which includes a number of steps to move the kernel closer to a year-2038-safe mode of operation.

Getting to a year-2038-safe system

The first of those steps is to prepare to support 32-bit applications while moving the kernel's internal time-handling code to 64-bit times in all situations. The internal kernel work has been underway for a while, but the user-space interfaces still need work, starting with the implementation of a set of routines that will convert between 32-bit and 64-bit values at the system-call boundary. The good news is that these routines already exist in the form of the "compatibility" system calls used by 32-bit applications running on a 64-bit kernel. In the future, all kernels will be 64-bit when it comes to time handling, so the compatibility functions are just what is needed (modulo a few spots where other data types must be converted differently). So the patch set causes the compatibility system calls to be built into 32-bit kernels as well as 64-bit kernels. These compatibility functions are ready for use, but will not be wired up until the end of the patch series.

The next step is the conversion of the kernel's native time-handling system calls to use 64-bit values exclusively. This process is done in two broad sub-steps, the first of which is to define a new set of types describing the format of native time values in user space. For example, system calls that currently accept struct timespec as a parameter will be changed to take struct __kernel_timespec instead. By default, the two structures are (nearly) the same, so the change has no effect on the built kernel. If the new CONFIG_COMPAT_TIME configuration symbol is set, though, struct __kernel_timespec will look like struct timespec64 instead.

The various __kernel_ types are used at the system-call boundary, but not much beyond that point. Instead, they are immediately converted to 64-bit types on all machines; on 64-bit machines, obviously, there is little conversion to do. Once each of the time-related system calls is converted in this manner, it will use 64-bit time values internally, even if user space is still dealing in 32-bit time values. Any time values returned to user space are converted back to the __kernel_ form before the system call returns. There is still no change visible to user space, though.

The final step is to enable the use of 64-bit time values on 32-bit systems without breaking existing 32-bit binaries. There are three things that must all be done together to make that happen:

  • The CONFIG_COMPAT_TIME symbol is set, causing all of the __kernel_ data structures to switch to their 64-bit versions.

  • All of the existing time-related system calls are replaced with the 32-bit compatibility versions. So, for example, on the ARM architecture, clock_gettime() is system call number 263. After this change, applications invoking system call 263 will get compat_sys_clock_gettime() instead. If the compatibility functions have been done correctly, binary applications will not notice the change.

  • The native 64-bit versions of the system calls are given new system call numbers; clock_gettime() becomes system call 388, for example. Thus, only newly compiled code that is prepared to deal with 64-bit time values will see the 64-bit versions of these calls.

And that is about as far as the kernel can take things. Existing 32-bit binaries will call the compatibility versions of the time-related system calls and will continue to work — until 2038 comes around, of course.

That leaves a fair amount of work to be done in user space, of course. In a simplified view of the situation, the C libraries can be changed to use the 64-bit data structures and invoke the new versions of the relevant system calls. Applications can then be recompiled against the new library, perhaps with some user-space fixes required as well; after that, they will no longer participate in the year 2038 debacle. In practice, all of the libraries in a system and all applications may need to be rebuilt together to ensure that they have a coherent idea of how times are represented. The GNU C library uses symbol versioning, so it can be made to work with both time formats simultaneously, but many other libraries lack that flexibility. So converting a full distribution is likely to be an interesting challenge even once the work on the kernel side is complete.

Finishing the job

Even on the kernel side, though, there are a few pieces of the puzzle that have not yet been addressed. One significant problem is ioctl() calls; of the thousands of them supported by the kernel, a few deal in time_t values. They will have to be located and fixed one-by-one, a process that could take some time. The ext4 filesystem stores timestamps as 32-bit time_t values, though some variants of the on-disk format extend those fields to 34 bits. Ext3 does not support 34-bit timestamps, though, so the solution there is likely to be to drop it entirely in favor of ext4. NFSv3 has a similar problem, and may meet a similar fate; XFS also has some challenges to deal with. The filesystem issues, notably, affect 64-bit systems as well. There are, undoubtedly, many other surprises like this lurking in both the kernel and user space, so the task of making a system ready for 2038 goes well beyond migrating to 64-bit time values in system calls. Still, fixing the system calls is a start.

Once the remaining problems have been addressed, there is a final patch that can be applied. It makes CONFIG_COMPAT_TIME optional, but in a way that leaves the 64-bit paths in place while removing the 32-bit compatibility system calls. If this option is turned off, any binary using the older system calls will fail to run. This is thus a useful setting for testing year-2038 conversions or deploying long-lived systems that must survive past that date. As Arnd put it:

This is meant mostly as a debugging help for now, to let people build a y2038 safe distro, but at some point in the 2030s, we should remove that option and all the compat handling.

Presumably somebody will be paying attention and will remember to carry out this removal twenty years from now (if they are feeling truly inspired, they might just kill ktime_t while they are at it). At that point, they will likely be grateful to the developers who put their time into dealing with this problem before it became an outright emergency. The rest of us, instead, will just have to find some other way to fund our retirement.

(Thanks to Arnd Bergmann for his helpful comments and suggestions on an earlier draft of this article.)

Comments (15 posted)

The OrangeFS distributed filesystem

By Jake Edge
May 6, 2015

Vault 2015

There is no shortage of parallel, distributed filesystems available in Linux today. Each have their strengths and weaknesses, as well as their advocates and use cases. Orange File System (or OrangeFS) is another; it is targeted at providing high I/O performance on systems with up to several thousand multicore storage nodes, but the project is planning to support millions of cores eventually. The OrangeFS client code was proposed for the Linux kernel back in January. Walt Ligon, one of the principals behind the filesystem, gave a talk about OrangeFS at the Vault conference back in March.

At the beginning of the talk, Ligon noted that OrangeFS was similar "in some ways" to GlusterFS, which was the subject of an earlier Vault presentation. But OrangeFS grew out of a research project from 1993 called Parallel Virtual File System (PVFS). That filesystem (now in version 2, called PVFS2) is in use today by various commercial organizations as well as by universities. In 2008, the PVFS project was renamed to OrangeFS as part of changing its focus to a more general filesystem for "big data".

Overview

At its core, OrangeFS has a client-server architecture, most of which runs in user space. All of the code is available under the LGPL. There are multiple ways for client systems to use the PVFS protocol to access data on the servers. That includes libpvfs2 for low-level access, MPI-IO, Filesystem in Userspace (FUSE), web-related mechanisms (e.g. WebDAV, REST), and a Linux virtual filesystem (VFS) client implementation for mounting OrangeFS like any other filesystem in Linux. The latter is what is being proposed for upstream inclusion.

OrangeFS servers handle objects, called dataspaces, that can have both byte-stream and key-value components. The "Trove" subsystem determines how to store those components. Currently, the byte streams are stored as files on the underlying filesystem, while the key-value data is mostly stored in Berkeley DB files, though there is starting to be some use of LMDB.

[File structure diagram] As seen in the diagram at right, files are stored as a collection of objects: a metadata object and one or more distributed file ("Dfile") objects. Those are accessed from directory objects that include a metadata object. Each of those point to various DirData objects, which contain Dirent (directory entry) objects that point to the metadata object of a file.

Instead of blocks, OrangeFS is all about objects and leaves the block mapping to the underlying filesystems. There are no metadata servers, as all servers can handle all kinds of requests. It is possible to configure an OrangeFS filesystem to store its metadata separate from its data using parameters that govern how the objects should be distributed. Files are typically striped across multiple servers to facilitate parallel access.

OrangeFS provides a unified namespace, so that all files are accessible from a single mount point. It has a client protocol that supports lots of parallel clients and servers. That provides "high aggregate throughput", Ligon said.

In the past, users wanted MPI-IO access to files, but that has changed. Now, POSIX access is "what everyone wants to use". They want to be able to write Python scripts to access their data. But the POSIX API "can be a real limiting factor" because it doesn't understand parallel files, striping, and so on.

Another of the goals for OrangeFS is to "enable the future" by being flexible about the underlying technologies it uses. It wants to provide ways to swap in new redundancy, availability, and stability techniques. For example, OrangeFS is designed to allow users to use their own distribution equation, which is used to find and store data. That equation allows the system to determine which servers go with each object.

Another goal is to make OrangeFS grow to "exascale". One way to keep increasing storage is to add more disks to the computer, but that will eventually hit a wall. There is not enough bandwidth and compute [Walt Ligon] power within a system to access all that data with reasonable performance; the solution to that problem is to add more computers into the mix.

That dramatically increases the number of cores accessing the data, but you can only increase the amount of storage per server to a certain point. Just as with the single computer system above, various limits will be hit, so a better solution is to add more servers with more network connections, but that can get costly. In an attempt to build a lower-cost alternative, Ligon has a new project to create, say, 500 storage servers, each using a Raspberry Pi with a disk. It will be much cheaper, but he thinks it will also be faster—though he still needs to prove that.

There are a number of planned OrangeFS attributes that are missing from the discussion so far, he said. For example, with a large enough number of servers and disks, there will be failures every day. Even if there are no failures, systems will need to be taken down to update the operating system or other software, so there is a need for features that provide availability.

Security is a "major issue" that has mostly been dealt with using "chewing gum and string", Ligon said. Data integrity is another important attribute, as the stored data must be periodically checked and repaired. There is also a need for ways to redistribute files and objects for load or space reasons, as well as a need for monitoring and administration tools.

OrangeFS V3

Some of the "core values" for the next major version of OrangeFS (3.0 or V3) are directly targeted at solving those problems. At the top of that list is "parallelism"; the filesystem should allow parallel access to files, directories, and metadata, while providing scalability through adding servers. The filesystem should also recognize that things are going to fail regularly. If a copy goes bad, throw it away and recreate it; if a node fails, simply discard and replace it.

OrangeFS V3 will minimize the dependencies between servers by not sharing state between them. That will allow servers to be added and removed as needed. Avoiding locks is key to providing better performance, which may require relaxing the semantics of some operations. Finally, 3.0 will target flexible site-customization policies for things like object placement, replication, migration, and so forth.

In order to do all of that, OrangeFS will change the PVFS handle that has been used to identify objects. It is a 64-bit value that encodes both the object and the server it lives on. That scheme has a number of limitations. Objects cannot migrate or be replicated and the collection of servers is static. That works well up to around 128 static servers, he said, but it won't work for OrangeFS V3.

The new handles will contain both an object ID and one or more server IDs, both of which will be 128-bit values. The number of server IDs will typically be somewhere between two and four that will be set when the filesystem is created; it can change, but in practice rarely will. These handles are internal-only, typically stored in metadata objects. By making this change, OrangeFS V3 will be able to do replication and migration.

This will allow all of the filesystem structure to be replicated, as well as the file data. A set of back references is also created, so that maintenance operations can find other copies of the structures. Each of those pieces and copies could be stored on different servers if that was desired. Another possibility is to use "file stuffing", which places the first data object on the same server as its metadata object.

Reads can be done from any server that has a copy of the object, while writes are done to the primary object. Its server then initiates the copy (or copies) needed for replication. The write will only complete and return to the client after a certain number of copies have completed. This is known as the "write confidence" required. For example, if one copy is sent to a much slower archive device, the write could complete after all or some of the non-archive copies have completed.

V3 adds a server ID database, rather than a fixed set of servers. That allows dynamic addition of servers with site-defined attributes (e.g. number, building, rack, etc.). A client doesn't have to know about all the servers, only the set it is using. Servers maintain a partial list of other servers that they tend to work with and there is a server resolution protocol to find others as needed.

The security model is already present in OrangeFS 2.9 (which is the current version of the filesystem). The model is based around capabilities that get returned based on the credentials presented when a metadata object is accessed. That capability is then passed when accessing the data objects. Certificates and public/private key pairs are used to authenticate clients and their credentials.

The final OrangeFS feature that Ligon described was the "parallel background jobs" (PBJs) that are used for maintenance and data integrity. They can be run to check the integrity of the data stored and to repair problems that are found. They can also handle tasks like rebalancing where data is stored to avoid access hotspots and the like.

As he said at the outset, Ligon's talk provided a high-level overview of the filesystem. It seems to not be a particularly well known filesystem, but one that has some interesting attributes. Beyond just handling large data sets for parallel computation, it is also targeted as a research platform that can be used to test ideas for enhancements or broad restructuring. The kernel patches did not receive any comments, but they are also fairly small (less than 10,000 lines of code), so it seems plausible that we will see an OrangeFS client land in the mainline sometime in the future.

[I would like to thank the Linux Foundation for travel support to Boston for Vault.]

Comments (3 posted)

Improving kernel string handling

By Jonathan Corbet
May 6, 2015
The handling and parsing of string data has long been acknowledged as a fertile breeding ground for bugs and security issues; that is doubly true when the C language — whose string model leaves a bit to be desired — is in use. Various attempts have been made to improve C string handling, both in the kernel and in user space, but few think that the problem has been solved. A couple of current projects may improve the situation on the kernel side, though.

String copying

The venerable strcpy() family of functions has long been seen as error-prone and best avoided. In most settings, they are replaced with functions like strncpy() or strlcpy(). The last time your editor wrote about criticisms of strlcpy(), he was treated to a long series of incendiary emails from one of its supporters. So, for the purposes of this article, suffice to say that not all developers are fond of those functions. Even so, the kernel contains implementations of both, and there are over 1,000 call sites for each.

That doesn't mean that there isn't room for improvement, though. Chris Metcalf thinks he has an improvement in the form of the proposed strscpy() API, which provides two new functions:

    ssize_t strscpy(char *dest, const char *src, size_t count);
    ssize_t strscpy_truncate(char *dest, const char *src, size_t count);

As with similar functions, strscpy() will copy a maximum of count bytes from src to dest, but it differs in the details. The return value in this case is the number of bytes copied, unless the source string is longer than count bytes; in that case, the return value will be -E2BIG instead. Another difference is that, in the overflow case, dest will be set to the empty string rather than a truncated version of src.

This behavior is designed to make overflows as obvious as possible and to prevent code from blithely proceeding with a truncated string. When questioned on this behavior, Chris justified it this way:

1. A truncated string with an error return may still cause program errors, even if the caller checks for the error return, if the buffer is later interpreted as a valid string due to some other program error. It's defensive programming.

2. Programmers are fond of ignoring error returns. My experience with truncated strings is that in too many cases, truncation causes program errors down the line. It's better to ensure that no partial string is returned in this case.

In a perfect world, all error returns would be checked, and there would be no need for this, but we definitely don't live in that world :-)

For cases where the code can handle a truncated string, strscpy_truncate() can be used. Its return value convention is the same, but it will fit as much of the string as possible (null-terminated) in dest.

Integer parsing

The kernel must often turn strings into integer values; the interpretation of numbers written to sysfs files or found on the kernel command line are a couple of obvious examples. This parsing can be done with functions like simple_strtoul() (which decodes a string to an unsigned long), but they were marked as being obsolete in 2011. The checkpatch script complains about their use, but there are still about 1,000 call sites in the kernel. Current advice is to use kstrtoul() and the better part of a dozen variants, also added in 2011. There are almost 2,000 uses of these functions in the kernel, but Alexey Dobriyan thinks we can do better.

Alexey has a few complaints about the current APIs. One of the reasons for moving beyond the simple_strto*() functions was that they would silently stop conversion at a non-digit character — "123abc" would be successfully converted to 123. That is the sort of behavior for which PHP is roundly criticized, but, Alexey says, there are times when it is useful. He gives the parsing of device numbers (usually given in the "major:minor" format) as an example. The kstrto*() family cannot easily be used for that kind of parsing, but there are plenty of reasons to not go back to simple_strto*() for that kind of work.

His suggestion is the addition of a new function:

    int parse_integer(const char *s, unsigned int base, <type> *val);

In truth, parse_integer() is not a function; it is instead a rather unsightly macro that arranges to do the right thing for a wide variety of types for val. So, if val is an unsigned short, the decoding will be done on an unsigned basis and will be checked to ensure that the resulting value does not exceed the range of a short.

A successful decoding will cause the result to be placed in val; the number of characters decoded will come back as the return value. If it is expected that the entire string will be decoded, a quick check to see whether s[return_value] is a null byte can verify that. Otherwise, parsing of the string can continue from the indicated point. If the base is ORed with the undocumented value PARSE_INTEGER_NEWLINE, a final newline character will be skipped over — useful for parsing input to sysfs files. If no characters at all are converted, the return value will be -EINVAL; an overflow will return -ERANGE instead.

Alexey's patch set turns the kstrto*() functions into calls to parse_integer(); it also converts a number of simple_strto*() calls to direct parse_integer() calls. The end result is an apparent simplification of the code and net reduction in lines of code.

Whether either of these patch sets will find its way into the kernel is not entirely clear; kernel developers do not, in general, tend to get too excited about string-parsing functions. In both cases, though, the potential exists for improvements to the massive amounts of parsing code found in the kernel while simultaneously making it simpler. In the end, most developers will find it hard to argue against something like that.

Comments (3 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.1-rc2 ?
Jiri Slaby Linux 3.12.42 ?
Jiri Slaby Linux 3.12.41 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2015, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds