|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.8-rc4, released on January 17. Linus was "late" a day in releasing it, which sent him on a mission to figure out which day was the most common for releases (Sunday). "Anyway, with that digression, I can happily report that -rc4 is smaller than -rc3 despite the extra day, although not by much. There's not really a whole lot that stands out: apart from one new wireless driver (the Atheros Wilocity driver) and some OMAP drm changes, the diffstat looks pretty flat and spread out. Which just means lots of small changes all over."

Stable updates were not in short supply this week. 3.7.3, 3.4.26, 3.0.59, and 2.6.34.14 were all released on January 17; the 2.6.34.14 announcement carried a warning that updates for this kernel will cease in the near future. 3.7.4, 3.4.27 and 3.0.60 were released on January 21.

Comments (none posted)

Quotes of the week

I'm leaving the Linux world and Intel for a bit for family reasons. I'm aware that "family reasons" is usually management speak for "I think the boss is an asshole" but I'd like to assure everyone that while I frequently think Linus is an asshole (and therefore very good as kernel dictator) I am departing quite genuinely for family reasons and not because I've fallen out with Linus or Intel or anyone else.
— Best wishes, Alan Cox, we'll miss you

Yes, it's very unlikely, but we are in the business of dealing with the very unlikely. That's because in our business, the very unlikely is very likely. Damn, I need to buy a lotto ticket!
Steven Rostedt

About the only thing Kernel developers agree on is they use C and don't comment their code.
Tom St Denis

Documentation is generally considered a good thing, but few people can be bothered to write it, and few of the other people that should read it actually do.
Arnd Bergmann

Comments (none posted)

Long-term support initiative 3.4 kernel released

The Long-Term Support Initiative helps to provide support for selected kernels for a two-year period. But the project has also intended to release additional kernels aimed at the needs of the consumer electronics industry. That has come about with the announcement of the release of the LTSI 3.4 kernel. It is based on 3.4.25, but with an improved CMA memory allocator, the out-of-tree AF_BUS protocol implementation, and a backport of the CoDel queue management algorithm, along with various hardware enablement patches and other useful bits of code.

Comments (14 posted)

Kernel development news

Supporting variable-sized huge pages

By Michael Kerrisk
January 23, 2013

Huge pages are an optimization technique designed to increase virtual memory performance. The idea is that instead of a traditional small virtual memory page size (4 kB on most architectures), an application can employ (much) larger pages (e.g., 2 MB or 1 GB on x86-64). For applications that can make full use of larger pages, huge pages provide a number of performance benefits. First, a single page fault can fault in a large block of memory. Second, larger page sizes equate to shallower page tables, since fewer page-table levels are required to span the same range of virtual addresses; consequently, less time is required to traverse page table entries when translating virtual addresses to physical addresses. Finally, and most significantly, since entries for huge pages in the translation lookaside buffer (TLB) span much greater address ranges, there is an increased chance that a virtual address already has a match in one of the limited set of entries currently cached in the TLB, thus obviating the need to traverse page tables.

Applications can explicitly request the use of huge pages when making allocations, using either shmget() with the SHM_HUGETLB flag (since Linux 2.6.0) or mmap() with the MAP_HUGETLB flag (since Linux 2.6.32). It's worth noting that explicit application requests are not needed to employ huge pages: the transparent huge pages feature merged in Linux 2.6.38 allows applications to gain much of the performance benefit of huge pages without making any changes to application code. There is, however, a limitation to these APIs: they provide no way to specify the size of the huge pages to be used for an allocation. Instead, the kernel employs the "default" huge page size.

Some architectures only permit one huge page size; on those architectures, the default is in fact the only choice. However, some modern architectures permit multiple huge page sizes, and where the system administrator has configured the system to provide huge page pools of different sizes, applications may want to choose the page size used for their allocation. For example, this may be useful in a NUMA environment, where a smaller huge page size may be suitable for mappings that are shared across CPUs, while a larger page size is used for mappings local to a single CPU.

A patch by Andi Kleen that was accepted during the 3.8 merge window extends the shmget() and mmap() system calls to allow the caller to select the size used for huge page allocations. These system calls have the following prototypes:

    void *mmap(void *addr, size_t length, int prot, int flags,
               int fd, off_t offset);
    int shmget(key_t key, size_t size, int shmflg);

Neither of those calls provides an argument that can be directly used to specify the desired page size. Therefore, Andi's patch shoehorns the value into some bits that are currently unused in one of the arguments of each call—in the flags argument for mmap() and in the shmflg argument for shmget().

In both system calls, the huge page size is encoded in the six bits from 26 through to 31 (i.e., the bit mask 0xfc000000). The value in those six bits is the base-two log of the desired page size. As a special case, if the value encoded in the bits is zero, then the kernel selects the default huge page size. This provides binary backward compatibility for the interfaces. If the specified page size is not supported by the architecture, then shmget() and mmap() fail with the error ENOMEM.

An application can manually perform the required base-two log calculation and bit shift to generate the required bit-mask value, but this is clumsy. Instead, an architecture can define suitable constants for the huge page sizes that it supports. Andi's patch defines two such constants corresponding to the available page sizes on x86-64:

    #define SHM_HUGE_SHIFT  26
    #define SHM_HUGE_MASK   0x3f
    /* Flags are encoded in bits (SHM_HUGE_MASK << SHM_HUGE_SHIFT) */

    #define SHM_HUGE_2MB    (21 << SHM_HUGE_SHIFT)   /* 2 MB huge pages */
    #define SHM_HUGE_1GB    (30 << SHM_HUGE_SHIFT)   /* 1 GB huge pages */

Corresponding MAP_* constants are defined for use in the mmap() system call.

Thus, to employ a 2 MB huge page size when calling shmget(), one would write:

    shmget(key, size, flags | SHM_HUGETLB | SHM_HUGE_2MB);

That is, of course, the same as this manually calculated version:

    shmget(key, size, flags | SHM_HUGETLB | (21 << HUGE_PAGE_SHIFT));

In passing, it's worth noting that an application can determine the default page size by looking at the Hugepagesize entry in /proc/meminfo and can, if the kernel was configured with CONFIG_HUGETLBFS, discover the available page sizes on the system by scanning the directory entries under /sys/kernel/mm/hugepages.

One concern raised by your editor when reviewing an earlier version of Andi's patch was whether the bit space in the mmap() flags argument is becoming exhausted. Exactly how many bits are still unused in that argument turns out to be a little difficult to determine, because different architectures define the same flags with different values. For example, the MAP_HUGETLB flag has the values 0x4000, 0x40000, 0x80000, or 0x100000, depending on the architecture. It turns out that before Andi's patch was applied, there were only around 11 bits in flags that were unused across all architectures; now that the patch has been applied, just six are left.

The day when the mmap() flags bit space is exhausted seems to be slowly but steadily approaching. When that happens, either a new mmap()-style API with a 64-bit flags argument will be required, or, as Andi suggested, unused bits in the prot argument could be used; the latter option would be easier to implement, but would also further muddy the interface of an already complex system call. In any case, concerns about the API design didn't stop Andrew Morton from accepting the patch, although he was prompted to remark "I can't say the userspace interface is a thing of beauty, but I guess we'll live."

The new API features will roll out in few weeks' time with the 3.8 release. At that point, application writers will be able to select different huge page sizes for different memory allocations. However, it will take a little longer before the MAP_* and SHM_* page size constants percolate through to the GNU C library. In the meantime, programmers who are in a hurry will have to define their own versions of these constants.

Comments (4 posted)

GPIO in the kernel: future directions

By Jonathan Corbet
January 23, 2013
Last week's article covered the kernel's current internal API for general-purpose I/O (GPIO) lines. The GPIO API has seen relatively little change in recent years, but that situation may be about to change as the result of a couple of significant patch sets that seek to rework how the GPIO API works in the interest of greater robustness and better performance.

No more numbers

The current GPIO API relies on simple integers to identify specific GPIO lines. It works, but there are some shortcomings to this approach. Kernel code is rarely interested in "GPIO #37"; instead, it wants "the GPIO connected to the monitor's DDC line" or something to that effect. For well-defined systems where the use of GPIO lines never changes, preprocessor definitions can be used to identify lines, but that approach falls apart when the same GPIO can be put to different uses in different systems. As hardware gets more dynamic, with GPIOs possibly showing up at any time, there is no easy way to know which GPIO goes where. It can be easy to get the wrong one by mistake.

As a result, platform and driver developers have come up with various ways to locate GPIOs of interest. Even your editor once submitted a patch adding a gpio_lookup() function to the GPIO API, but that patch didn't pass muster and was eventually dropped in favor of a driver-specific solution. So the number-based API has remained — until now.

Alexandre Courbot's descriptor-based GPIO interface seeks to change the situation by introducing a new struct gpio_desc * pointer type. GPIO lines would be represented by one of these pointers; what lives behind the pointer would be hidden from GPIO users, though. Internally, gpiolib (the implementation of the GPIO API used by most architectures) is refactored to use descriptors rather than numbers, and a new set of functions is presented to users. These functions will look familiar to users of the current GPIO API:

    #include <linux/gpio/consumer.h>

    int gpiod_direction_input(struct gpio_desc *desc);
    int gpiod_direction_output(struct gpio_desc *desc, int value);
    int gpiod_get_value(struct gpio_desc *desc);
    void gpiod_set_value(struct gpio_desc *desc, int value);
    int gpiod_to_irq(struct gpio_desc *desc);
    int gpiod_export(struct gpio_desc *desc, bool direction_may_change);
    int gpiod_export_link(struct device *dev, const char *name,
			  struct gpio_desc *desc);
    void gpiod_unexport(struct gpio_desc *desc);

In short: the gpio_ prefix on the existing GPIO functions has been changed to gpiod_ and the integer GPIO number argument is now a struct gpio_desc *. There is also a new include file for the new functions; otherwise the interfaces are identical. The existing, integer-based API still exists, but it has been reimplemented as a layer on top of the descriptor-based API shown here.

What is missing from the above list, though, is any way of obtaining a descriptor for a GPIO line in the first place. One way to do that is to get the descriptor from the traditional GPIO number:

    struct gpio_desc *gpio_to_desc(unsigned gpio);

There is also a desc_to_gpio() for going in the opposite direction. Using this function makes it easy to transition existing code over to the new API. Obtaining a descriptor in this manner will ensure that no code accesses a GPIO without having first properly obtained a descriptor for it, but it would be better to do away with the numbers altogether in favor of a more robust way of looking up GPIOs. The patch set adds this functionality in this form:

    struct gpio_desc *gpiod_get(struct device *dev, const char *name);

Here, dev should be the device providing the GPIO line, and "name" describes that line. The dev pointer is needed to disambiguate the name, and because code accessing a GPIO line should know which device it is working through in any case. So, for example, a video acquisition bridge device may need access to GPIO lines with names like "sensor-power", "sensor-reset", "sensor-i2c-clock" and "sensor-i2c-data". The driver could then request those lines by name with gpiod_get() without ever having to be concerned with numbers.

Needless to say, there is a gpiod_put() for releasing access to a GPIO line.

The actual association of names with GPIO lines can be done by the driver that implements those lines, if the names are static and known. In many cases, though, the routing of GPIO lines will have been done by whoever designed a specific system-on-chip or board; there is no way for the driver author to know ahead of time how a specific system may be wired. In this case, the names of the GPIO lines will most likely be specified in the device tree, or, if all else fails, in a platform data structure.

The response to this interface is generally positive; it seems almost certain that it will be merged in the near future. The biggest remaining concern, perhaps, is that the descriptor interface is implemented entirely within the gpiolib layer. Most architectures use gpiolib to implement the GPIO interface, but it is not mandatory; in some cases, the gpio_* functions are implemented as macros that access the device registers directly. Such an implementation is probably more efficient, but GPIO is not usually a performance-critical part of the system. So there may be pressure for all architectures to move to gpiolib; that, in turn, would facilitate the eventual removal of the number-based API entirely.

Block GPIO

The GPIO interface as described so far is focused on the management of individual GPIO lines. But GPIOs are often used together as a group. As a simple example, consider a pair of GPIOs used as an I2C bus; one line handles data, the other the clock. A bit-banging driver can manage those two lines together to communicate with connected I2C devices; the kernel contains a driver in drivers/i2c/busses/i2-gpio.c for just this purpose.

Most of the time, managing GPIOs individually, even when they are used as a group, works fine. Computers are quite fast relative to the timing requirements of most of the serial communications protocols that are subject to implementation with GPIO. But there are exceptions, especially when the hardware implementing the GPIO lines themselves is slow; that can make it hard to change multiple lines in a simultaneous manner. But, sometimes, the hardware can change lines simultaneously if properly asked; often the lines are represented by bits in the same device register and can all be changed together with a single I/O memory write operation.

Roland Stigge's block GPIO patch set is an attempt to make that functionality available in the kernel. Code that needs to manipulate multiple GPIOs as a group would start by associating them in a single block with:

    struct gpio_block *gpio_block_create(unsigned int *gpios, size_t size,
				     	 const char *name);

gpios points to an array of size GPIO numbers which are to be grouped into a block; the given name can be used to work with the block from user space. The GPIOs should have already been requested with gpio_request(); they also need to have their direction set individually. It's worth noting that the GPIOs need not be located on the same hardware; if they are spread out, or if the underlying driver does not implement the internal block API, the block GPIO interface will just access those lines individually as is done now.

Manipulation of GPIO blocks is done with:

    unsigned long gpio_block_get(struct gpio_block *block, unsigned long mask);
    void gpio_block_set(struct gpio_block *block, unsigned long mask,
		    	unsigned long values);

For both functions, block is a GPIO block created as described above, and mask is a bitmask specifying which GPIOs in the block are to be acted upon; each bit in mask enables the corresponding GPIO in the array passed to gpio_block_create(). This API implies that the number of bits in a long forces an upper bound on number of lines grouped into a GPIO block; that seems unlikely to be a problem in real-world use. gpio_block_get() will read the specified lines, simultaneously if possible, and return a bitmask with the result. The lines in a GPIO block can be set as a unit with gpio_block_set().

A GPIO block is released with:

    void gpio_block_free(struct gpio_block *block);

There is also a pair of registration functions:

    int gpio_block_register(struct gpio_block *block);
    void gpio_block_unregister(struct gpio_block *block);

Registering a GPIO block makes it available to user space. There is a sysfs interface that can be used to query and set the GPIOs in a block. Interestingly, registration also creates a device node (using the name provided to gpio_block_create()); reading from that device returns the current state of the GPIOs in the block, while writing it will set the GPIOs accordingly. There is an ioctl() operation (which, strangely, uses zero as the command number) to set the mask to be used with read and write operations.

This patch set has not generated as much discussion as the descriptor-based API patches (it is also obviously not yet integrated with the descriptor API). Most likely, relatively few developers have felt the need for a block-based API. That said, there are cases when it is likely to be useful, and there appears to be no opposition, so this API can eventually be expected to be merged as well.

Comments (8 posted)

Making EPERM friendlier

By Michael Kerrisk
January 19, 2013

Error reporting from the kernel (and low-level system libraries such as the C library) has been a primitive affair since the earliest UNIX systems. One of the consequences of this is that end users and system administrators often encounter error messages that provide quite limited information about the cause of the error, making it difficult to diagnose the underlying problem. Some recent discussions on the libc-alpha and Linux kernel mailing lists were started by developers who would like to improve this state of affairs by having the kernel provide more detailed error information to user space.

The traditional UNIX (and Linux) method of error reporting is via the (per-thread) global errno variable. The C library wrapper functions that invoke system calls indicate an error by returning -1 as the function result and setting errno to a positive integer value that identifies the cause of the error.

The fact that errno is a global variable is a source of complications for user-space programs. Because each system call may overwrite the global value, it is sometimes necessary to save a copy of the value if it needs to be preserved while making another system call. The fact that errno is global also means that signal handlers that make system calls must save a copy of errno on entry to the handler and restore it on exit, to prevent the possibility of overwriting a errno value that had previously been set in the main program.

Another problem with errno is that the information it reports is rather minimal: one of somewhat more than one hundred integer codes. Given that the kernel provides hundreds of system calls, many of which have multiple error cases, the mapping of errors to errno values inevitably means a loss of information.

That loss of information can be particularly acute when it comes to certain commonly used errno values. In a message to the libc-alpha mailing list, Dan Walsh explained the problem for two errors that are frequently encountered by end users:

Traditionally, if a process attempts a forbidden operation, errno for that thread is set to EACCES or EPERM, and a call to strerror() returns a localized version of "Permission Denied" or "Operation not permitted". This string appears throughout textual uis and syslogs. For example, it will show up in command-line tools, in exceptions within scripting languages, etc.

Those two errors have been defined on UNIX systems since early times. POSIX defines EACCES as "an attempt was made to access a file in a way forbidden by its file access permissions" and EPERM as "an attempt was made to perform an operation limited to processes with appropriate privileges or to the owner of a file or other resource". These definitions were fairly comprehensible on early UNIX systems, where the kernel was much less complex, the only method of controlling file access was via classical rwx file permissions, and the only kind of privilege separation was via user and group IDs and superuser versus non-superuser. However, life is rather more complex on modern UNIX systems.

In all, EPERM and EACCES are returned by more than 3000 locations across the Linux 3.7 kernel source code. However, it is not so much the number of return paths yielding these errors that is the problem. Rather, the problem for end users is determining the underlying cause of the errors. The possible causes are many, including denial of file access because of insufficient (classical) file permissions or because of permissions in an ACL, lack of the right capability, denial of an operation by a Linux Security Module or by the seccomp mechanism, and any of a number of other reasons. Dan summarized the problem faced by the end user:

As we continue to add mechanisms for the Kernel to deny permissions, the Administrator/User is faced with just a message that says "Permission Denied" Then if the administrator is lucky enough or skilled enough to know where to look, he might be able to understand why the process was denied access.

Dan's mail linked to a wiki page ("Friendly EPERM") with a proposal on how to deal with the problem. That proposal involves changes to both the kernel and the GNU C library (glibc). The kernel changes would add a mechanism for exposing a "failure cookie" to user space that would provide more detailed information about the error delivered in errno. On the glibc side, strerror() and related calls (e.g., perror()) would access the failure cookie in order obtain information that could be used to provide a more detailed error message to the user.

Roland McGrath was quick to point out that the solution is not so simple. The problem is that it is quite common for applications to call strerror() only some time after a failed system call, or to do things such as saving errno in a temporary location and then restoring it later. In the meantime, the application is likely to have performed further system calls that may have changed the value of the failure cookie.

Roland went on to identify some of the problems inherent in trying to extend existing standardized interfaces in order to provide useful error information to end users:

It is indeed an unfortunate limitation of POSIX-like interfaces that error reporting is limited to a single integer. But it's very deeply ingrained in the fundamental structure of all Unix-like interfaces.

Frankly, I don't see any practical way to achieve what you're after. In most cases, you can't even add new different errno codes for different kinds of permission errors, because POSIX specifies the standard code for certain errors and you'd break both standards compliance and all applications that test for standard errno codes to treat known classes of errors in particular ways.

In response, Eric Paris, one of the other proponents of the failure-cookie idea acknowledged Roland's points, noting that since the standard APIs can't be extended, then changes would be required to each application that wanted to take advantage of any additional error information provided by the kernel.

Eric subsequently posted a note to the kernel mailing list with a proposal on the kernel changes required to support improved error reporting. In essence, he proposes exposing some form of binary structure to user space that describes the cause of the last EPERM or EACCES error returned to the process by the kernel. That structure might, for example, be exposed via a thread-specific file in the /proc filesystem.

The structure would take the form of an initial field that indicates the subsystem that triggered the error—for example, capabilities, SELinux, or file permissions—followed by a union of substructures that provide subsystem-specific detail on the circumstances that triggered the error. Thus, for a file permissions error, the substructure might return the effective user and group ID of the process, the file user ID and group ID, and the file permission bits. At the user-space level, the binary structure could be read and translated to human-readable strings, perhaps via a glibc function that Eric suggested might be named something like get_extended_error_info().

Each of the kernel call sites that returned an EPERM or EACCES error would then need to be patched to update this information. But, patching all of those call sites would not be necessary to make the feature useful. As Eric noted:

But just getting extended denial information in a couple of the hot spots would be a huge win. Put it in capable(), LSM hooks, the open() syscall and path walk code.

There were various comments on Eric's proposal. In response to concerns from Stephen Smalley that this feature might leak information (such as file attributes) that could be considered sensitive in systems with a strict security policy (enforced by an LSM), Eric responded that the system could provide a sysctl to disable the feature:

I know many people are worried about information leaks, so I'll right up front say lets add the sysctl to disable the interface for those who are concerned about the metadata information leak. But for most of us I want that data right when it happens, where it happens, so It can be exposed, used, and acted upon by the admin trying to troubleshoot why the shit just hit the fan.

Reasoning that its best to use an existing format and its tools rather than inventing a new format for error reporting, Casey Schaufler suggested that audit records should be used instead:

the string returned by get_extended_error_info() ought to be the audit record the system call would generate, regardless of whether the audit system would emit it or not. If the audit record doesn't have the information you need we should fix the audit system to provide it. Any bit of the information in the audit record might be relevant, and your admin or developer might need to see it.

Eric expressed concerns that copying an audit record to the process's task_struct would carry more of a performance hit than copying a few integers to that structure, concluding:

I don't see a problem storing the last audit record if it exists, but I don't like making audit part of the normal workflow. I'd do it if others like that though.

Jakub Jelinek wondered which system call Eric's mechanism should return information about, and whether its state would be reset if a subsequent system call succeeded. In many cases, there is no one-to-one mapping between C library calls and system calls, so that some library functions may make one system call, save errno, then make some other system call (that may or may not also fail), and then restore the first system call's errno before returning to the caller. Other C library functions themselves set errno. "So, when would it be safe to call this new get_extended_error_info function and how to determine to which syscall it was relevant?"

Eric's opinion was that the mechanism should return information about the last kernel system call. "It would be really neat for libc to have a way to save and restore the extended errno information, maybe even supply its own if it made the choice in userspace, but that sounds really hard for the first pass."

However, there are problems with such a bare-bones approach. If the value returned by get_extended_error_info() corresponds to the last system call, rather than the errno value actually returned to user space, this risks confusing user-space applications (and users). Carlos O'Donell, who had earlier raised some of the same questions as Jakub and pointed out the need to properly handle the extended error information when a signal handler interrupts the main program, agreed with Casey's assessment that get_extended_error_info() should always return a value that corresponds to the current content of errno. That implies the need for a user-space function that can save and restore the extended error information.

Finally, David Gilbert suggested that it would be useful to broaden Eric's proposal to handle errors beyond EPERM and EACESS. "I've wasted way too much time trying to figure out why mmap (for example) has given me an EINVAL; there are just too many holes you can fall into."

In the last few days, discussion in the thread has gone quiet. However, it's clear that Dan and Eric have identified a very real and practical problem (and one that has been identified by others in the past). The solution would probably need to address the concerns raised in the discussion—most notably the need to have get_extended_error_info() always correspond to the current value of errno—and might possibly also be generalized beyond EPERM and EACCES. However, that should all be feasible, assuming someone takes on the (not insignificant) work of fleshing out the design and implementing it. If they do, the lives of system administrators and end users should become considerably easier when it comes to diagnosing the causes of software error reports.

Comments (90 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.8-rc4 ?
Greg KH Linux 3.7.4 ?
Greg KH Linux 3.7.3 ?
Greg KH Linux 3.4.27 ?
Greg KH Linux 3.4.26 ?
Steven Rostedt 3.4.25-rt37 ?
Steven Rostedt 3.2.37-rt55 ?
Greg KH Linux 3.0.60 ?
Greg KH Linux 3.0.59 ?
Steven Rostedt 3.0.58-rt83 ?
Paul Gortmaker Linux 2.6.34.14 ?

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Douglas Gilbert sg3_utils-1.35 available ?
Theodore Ts'o Release of E2fsprogs 1.42.7 ?

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds