Brief items
The current development kernel is 3.8-rc4,
released on January 17. Linus was "late"
a day in releasing it, which sent him on a mission to figure out which day
was the most common for releases (Sunday). "
Anyway, with that
digression, I can happily report that -rc4 is smaller than -rc3 despite the
extra day, although not by much. There's not really a whole lot that stands
out: apart from one new wireless driver (the Atheros Wilocity driver) and
some OMAP drm changes, the diffstat looks pretty flat and spread out. Which
just means lots of small changes all over."
Stable updates were not in short supply this week.
3.7.3,
3.4.26,
3.0.59, and
2.6.34.14 were all released on
January 17; the 2.6.34.14 announcement carried a warning that updates
for this kernel will cease in the near future.
3.7.4, 3.4.27 and 3.0.60 were released on January 21.
Comments (none posted)
I'm leaving the Linux world and Intel for a bit for family
reasons. I'm aware that "family reasons" is usually management
speak for "I think the boss is an asshole" but I'd like to assure
everyone that while I frequently think Linus is an asshole (and
therefore very good as kernel dictator) I am departing quite
genuinely for family reasons and not because I've fallen out with
Linus or Intel or anyone else.
— Best wishes,
Alan
Cox, we'll miss you
Yes, it's very unlikely, but we are in the business of dealing with
the very unlikely. That's because in our business, the very
unlikely is very likely. Damn, I need to buy a lotto ticket!
—
Steven Rostedt
About the only thing Kernel developers agree on is they use C and
don't comment their code.
—
Tom St Denis
Documentation is generally considered a good thing, but few people
can be bothered to write it, and few of the other people that
should read it actually do.
—
Arnd Bergmann
Comments (none posted)
The
Long-Term Support Initiative helps to
provide support for selected kernels for a two-year period. But the
project has also intended to release additional kernels aimed at the needs
of the consumer electronics industry. That has come about with the
announcement
of the release of the LTSI 3.4 kernel. It is based on 3.4.25, but with
an improved
CMA memory allocator, the
out-of-tree
AF_BUS protocol implementation,
and a backport of the
CoDel queue management
algorithm, along with various hardware enablement patches and other
useful bits of code.
Comments (14 posted)
Kernel development news
By Michael Kerrisk
January 23, 2013
Huge pages are an optimization
technique designed to increase virtual memory performance. The idea is that
instead of a traditional small virtual memory page size (4 kB on most
architectures), an application can employ (much) larger pages (e.g., 2 MB
or 1 GB on x86-64). For applications that can make full use of larger pages,
huge pages provide a number of performance benefits. First, a single page
fault can fault in a large block of memory. Second, larger page sizes
equate to shallower page tables, since fewer page-table levels are required
to span the same range of virtual addresses; consequently, less time is
required to traverse page table entries when translating virtual addresses
to physical addresses. Finally, and most significantly, since entries for
huge pages in the translation lookaside buffer (TLB) span much greater
address ranges, there is an increased chance that a virtual address already
has a match in one of the limited set of entries currently cached in the
TLB, thus obviating the need to traverse page tables.
Applications can explicitly request the use of huge pages when making
allocations, using either shmget() with the SHM_HUGETLB
flag (since Linux 2.6.0) or mmap() with the MAP_HUGETLB
flag (since Linux 2.6.32). It's worth noting that explicit application
requests are not needed to employ huge pages: the transparent huge pages feature merged in Linux
2.6.38 allows applications to gain much of the performance benefit of
huge pages without making any changes to application code. There is,
however, a limitation to these APIs: they provide no way to specify the
size of the huge pages to be used for an allocation. Instead, the kernel
employs the "default" huge page size.
Some architectures only permit one huge page size; on those
architectures, the default is in fact the only choice. However, some
modern architectures permit multiple huge page sizes, and where the system
administrator has configured the system to provide huge page pools of
different sizes, applications may want to choose the page size used for
their allocation. For example, this may be useful in a NUMA environment,
where a smaller huge page size may be suitable for mappings that are shared
across CPUs, while a larger page size is used for mappings local to a single
CPU.
A patch by Andi Kleen that was accepted
during the 3.8 merge window extends the shmget() and
mmap() system calls to allow the caller to select the size used
for huge page allocations. These system calls have the following
prototypes:
void *mmap(void *addr, size_t length, int prot, int flags,
int fd, off_t offset);
int shmget(key_t key, size_t size, int shmflg);
Neither of those calls provides an argument that can be directly used
to specify the desired page size. Therefore, Andi's patch shoehorns the
value into some bits that are currently unused in one of the arguments of
each call—in the flags argument for mmap() and in the
shmflg argument for shmget().
In both system calls, the huge page size is encoded in the six bits
from 26 through to 31 (i.e., the bit mask 0xfc000000). The value
in those six bits is the base-two log of the desired page size. As a special case, if the value encoded in the bits is
zero, then the kernel selects the default huge page size. This provides
binary backward compatibility for the interfaces. If the
specified page size is not supported by the architecture, then
shmget() and mmap() fail with the error
ENOMEM.
An
application can manually perform the required base-two log calculation and
bit shift to generate the required bit-mask value, but this is
clumsy. Instead, an architecture can define suitable constants for the huge
page sizes that it supports. Andi's patch defines two such constants corresponding to the
available page sizes on x86-64:
#define SHM_HUGE_SHIFT 26
#define SHM_HUGE_MASK 0x3f
/* Flags are encoded in bits (SHM_HUGE_MASK << SHM_HUGE_SHIFT) */
#define SHM_HUGE_2MB (21 << SHM_HUGE_SHIFT) /* 2 MB huge pages */
#define SHM_HUGE_1GB (30 << SHM_HUGE_SHIFT) /* 1 GB huge pages */
Corresponding MAP_* constants are defined for use in
the mmap() system call.
Thus, to employ a 2 MB huge page size when calling shmget(), one
would write:
shmget(key, size, flags | SHM_HUGETLB | SHM_HUGE_2MB);
That is, of course, the same as this manually calculated version:
shmget(key, size, flags | SHM_HUGETLB | (21 << HUGE_PAGE_SHIFT));
In passing, it's worth noting that an application can determine the
default page size by looking at the Hugepagesize entry in
/proc/meminfo and can, if the kernel was configured with
CONFIG_HUGETLBFS, discover the available page sizes on the system
by scanning the directory entries under /sys/kernel/mm/hugepages.
One concern raised by your editor when
reviewing an earlier version of Andi's patch was whether the bit space in
the mmap() flags argument is becoming exhausted. Exactly
how many bits are still unused in that argument turns out to be a little
difficult to determine, because different architectures define the same
flags with different values. For example, the MAP_HUGETLB flag has
the values 0x4000, 0x40000, 0x80000, or 0x100000, depending on the
architecture. It turns out that before Andi's patch was applied, there were
only around 11 bits in flags that were unused across all
architectures; now that the patch has been applied, just six are left.
The day when the mmap() flags bit space is exhausted
seems to be slowly but steadily approaching. When that happens, either a
new mmap()-style API with a 64-bit flags argument will be
required, or, as Andi suggested, unused
bits in the prot argument could be used; the latter option would
be easier to implement, but would also further muddy the interface of an
already complex system call. In any case, concerns about the API design
didn't stop Andrew Morton from accepting the patch, although he was
prompted to remark "I can't say the
userspace interface is a thing of beauty, but I guess we'll live."
The new API features will roll out in few weeks' time with the 3.8
release. At that point, application writers will be able to select
different huge page sizes for different memory allocations. However, it
will take a little longer before the MAP_* and SHM_* page
size constants percolate through to the GNU C library. In the meantime,
programmers who are in a hurry will have to define their own versions of
these constants.
Comments (4 posted)
By Jonathan Corbet
January 23, 2013
Last week's article covered the kernel's
current internal API for general-purpose I/O (GPIO) lines. The GPIO API has seen
relatively little change in recent years, but that situation may be about
to change as the result of a couple of significant patch sets that
seek to rework how the GPIO API works in the interest of greater robustness
and better performance.
No more numbers
The current GPIO API relies on simple integers to identify specific GPIO
lines. It works, but there are some shortcomings to this approach. Kernel
code is rarely interested in "GPIO #37"; instead, it wants "the GPIO
connected to the monitor's DDC line" or something to that effect. For
well-defined systems where the use of GPIO lines never changes,
preprocessor definitions can be used to identify lines, but that approach
falls apart when the same GPIO can be put to different uses in different
systems. As hardware gets more dynamic, with GPIOs possibly showing up
at any time, there is no easy way to know which GPIO goes where. It can be
easy to get the wrong one by mistake.
As a result, platform and driver developers have come up with various ways
to locate GPIOs of interest. Even your editor once submitted a patch adding a
gpio_lookup() function to the GPIO API, but that patch didn't
pass muster and was eventually dropped in favor of a driver-specific
solution. So the number-based API has remained — until now.
Alexandre Courbot's descriptor-based GPIO
interface seeks to change the situation by introducing a new struct
gpio_desc * pointer type. GPIO lines would be represented by one
of these pointers; what lives behind the pointer would be hidden from GPIO
users, though. Internally, gpiolib (the implementation of the GPIO API
used by most architectures) is refactored to use descriptors rather
than numbers, and a new set of functions is presented to users. These
functions will look familiar to users of the current GPIO API:
#include <linux/gpio/consumer.h>
int gpiod_direction_input(struct gpio_desc *desc);
int gpiod_direction_output(struct gpio_desc *desc, int value);
int gpiod_get_value(struct gpio_desc *desc);
void gpiod_set_value(struct gpio_desc *desc, int value);
int gpiod_to_irq(struct gpio_desc *desc);
int gpiod_export(struct gpio_desc *desc, bool direction_may_change);
int gpiod_export_link(struct device *dev, const char *name,
struct gpio_desc *desc);
void gpiod_unexport(struct gpio_desc *desc);
In short: the gpio_ prefix on the existing GPIO functions has been
changed to gpiod_ and the integer GPIO number argument is now a
struct gpio_desc *. There is also a new include file for the
new functions; otherwise the interfaces are identical.
The existing, integer-based API still exists, but it has been reimplemented
as a layer on top of the descriptor-based API shown here.
What is missing from the above list, though, is any way of obtaining a
descriptor for a GPIO line in the first place. One way to do that is to
get the descriptor from the traditional GPIO number:
struct gpio_desc *gpio_to_desc(unsigned gpio);
There is also a desc_to_gpio() for going in the opposite
direction. Using this function makes it easy to transition existing code
over to the new API. Obtaining a descriptor in this manner will ensure that no code
accesses a GPIO without having first properly obtained a descriptor for
it, but it would be better to do away with the numbers altogether in favor
of a more robust way of looking up GPIOs. The patch set adds this
functionality in this form:
struct gpio_desc *gpiod_get(struct device *dev, const char *name);
Here, dev should be the device providing the GPIO line, and "name"
describes that line. The dev pointer is needed to disambiguate
the name, and because code accessing a GPIO line should know which device
it is working through in any case. So, for example, a video acquisition
bridge device may need access to GPIO lines with names like "sensor-power",
"sensor-reset", "sensor-i2c-clock" and "sensor-i2c-data". The driver could
then request those lines by name with gpiod_get() without ever
having to be concerned with numbers.
Needless to say, there is a gpiod_put() for releasing access to a
GPIO line.
The actual association of names with GPIO lines can be done by the driver
that implements those lines, if the names are static and known. In many
cases, though, the routing of GPIO lines will have been done by whoever
designed a specific system-on-chip or board; there is no way for the driver
author to know ahead of time how a specific system may be wired. In this
case, the names of the GPIO lines will most likely be specified in the
device tree, or, if all else fails, in a platform data structure.
The response to this interface is generally positive; it seems almost
certain that it will be merged in the near future. The biggest remaining
concern, perhaps, is that the descriptor interface is implemented entirely
within the gpiolib layer. Most architectures use gpiolib to implement the
GPIO interface, but it is not mandatory; in some cases, the gpio_*
functions are implemented as macros that access the device registers
directly. Such an implementation is probably more efficient, but GPIO is
not usually a performance-critical part of the system. So there may be
pressure for all architectures to move to gpiolib; that, in turn, would
facilitate the eventual removal of the number-based API entirely.
Block GPIO
The GPIO interface as described so far is focused on the management of
individual GPIO lines. But GPIOs are often used together as a group. As a
simple example, consider a pair of GPIOs used as an I2C bus; one line
handles data, the other the clock. A bit-banging driver can manage those
two lines together to communicate with connected I2C devices; the kernel
contains a driver in drivers/i2c/busses/i2-gpio.c for just this
purpose.
Most of the time, managing GPIOs individually, even when they are used as a
group, works fine. Computers are quite fast relative to the timing
requirements of most of the serial communications protocols that are
subject to implementation with GPIO. But there are exceptions, especially
when the hardware implementing the GPIO lines themselves is slow; that can
make it
hard to change multiple lines in a simultaneous manner. But, sometimes, the
hardware can change lines simultaneously if properly asked; often
the lines are represented by bits in the same device register and can all
be changed together with a single I/O memory write operation.
Roland Stigge's block GPIO patch set is an
attempt to make that functionality available in the kernel. Code that
needs to manipulate multiple GPIOs as a group would start by associating
them in a single block with:
struct gpio_block *gpio_block_create(unsigned int *gpios, size_t size,
const char *name);
gpios points to an array of size GPIO numbers which are
to be grouped into a block; the given name can be used to work
with the block from user space. The GPIOs should have already been
requested with gpio_request(); they also need to have their
direction set individually. It's worth noting that the GPIOs need not be
located on the same hardware; if they are spread out, or if the underlying
driver does not implement the internal block API, the block GPIO
interface will just access those lines individually as is done now.
Manipulation of GPIO blocks is done with:
unsigned long gpio_block_get(struct gpio_block *block, unsigned long mask);
void gpio_block_set(struct gpio_block *block, unsigned long mask,
unsigned long values);
For both functions, block is a GPIO block created as described
above, and mask is a bitmask specifying which GPIOs in the block
are to be acted upon; each bit in mask enables the corresponding
GPIO in the array passed to gpio_block_create().
This API implies that the number of bits in a
long forces an upper bound on number of lines grouped into a GPIO
block; that seems unlikely to be a problem in real-world use.
gpio_block_get() will read the specified lines,
simultaneously if possible, and return a bitmask with the result. The
lines in a GPIO block can be set as a unit with gpio_block_set().
A GPIO block is released with:
void gpio_block_free(struct gpio_block *block);
There is also a pair of registration functions:
int gpio_block_register(struct gpio_block *block);
void gpio_block_unregister(struct gpio_block *block);
Registering a GPIO block makes it available to user space. There is a
sysfs interface that can be used to query and set the GPIOs in a block.
Interestingly, registration also creates a device node (using the name
provided to gpio_block_create()); reading from that device returns
the current state of the GPIOs in the block, while writing it will set the
GPIOs accordingly. There is an ioctl() operation (which,
strangely, uses zero as the command number) to set the mask to be used with
read and write operations.
This patch set has not generated as much discussion as the descriptor-based
API patches (it is also obviously not yet integrated with the descriptor
API). Most likely, relatively few developers have felt the need for a
block-based API. That said, there are cases when it is likely to be
useful, and there appears to be no opposition, so this API can eventually
be expected to be merged as well.
Comments (7 posted)
By Michael Kerrisk
January 19, 2013
Error reporting from the kernel (and low-level system libraries such as
the C library) has been a primitive affair since the earliest UNIX
systems. One of the consequences of this is that end users and system
administrators often encounter error messages that provide quite limited
information about the cause of the error, making it difficult to diagnose
the underlying problem. Some recent discussions on the libc-alpha and Linux
kernel mailing lists were started by developers who would like to improve
this state of affairs by having the kernel provide more detailed error
information to user space.
The traditional UNIX (and Linux) method of error reporting is via the
(per-thread) global errno variable. The C library wrapper
functions that invoke system calls indicate an error by returning -1 as the
function result and setting errno to a positive integer value that
identifies the cause of the error.
The fact that errno is a global variable is a source of
complications for user-space programs. Because each system call may
overwrite the global value, it is sometimes necessary to save a copy of the
value if it needs to be preserved while making another system call. The
fact that errno is global also means that signal handlers that
make system calls must save a copy of errno on entry to the
handler and restore it on exit, to prevent the possibility of overwriting a
errno value that had previously been set in the main program.
Another problem with errno is that the information it reports
is rather minimal: one of somewhat more than one hundred integer
codes. Given that the kernel provides hundreds of system calls, many of
which have multiple error cases, the mapping of errors to
errno values inevitably means a loss of information.
That loss of information can be particularly acute when it comes to
certain commonly used errno values. In a message to the libc-alpha mailing list, Dan
Walsh explained the problem for two errors that are frequently encountered
by end users:
Traditionally, if a process attempts a forbidden operation, errno for that
thread is set to EACCES or EPERM, and a call to strerror() returns a
localized version of "Permission Denied" or "Operation not permitted". This
string appears throughout textual uis and syslogs. For example, it will
show up in command-line tools, in exceptions within scripting languages,
etc.
Those two errors have been defined on UNIX systems since early times. POSIX
defines
EACCES as "an attempt was made to access a file in a way
forbidden by its file access permissions" and EPERM as
"an attempt was made to perform an operation limited to processes
with appropriate privileges or to the owner of a file or other
resource." These definitions were fairly comprehensible on early
UNIX systems, where the kernel was much less complex, the only method of
controlling file access was via classical rwx file permissions,
and the only kind of privilege separation was via user and group IDs and
superuser versus non-superuser. However, life is rather more complex on
modern UNIX systems.
In all, EPERM and EACCES are returned by more than
3000 locations across the Linux 3.7 kernel source code. However, it is not
so much the number of return paths yielding these errors that is the
problem. Rather, the problem for end users is determining the underlying
cause of the errors. The possible causes are many, including denial of file
access because of insufficient (classical) file permissions or because of
permissions in an ACL, lack of the right capability, denial of an operation
by a Linux Security Module or by the seccomp
mechanism, and any of a number of other reasons. Dan summarized the
problem faced by the end user:
As we continue to add mechanisms for the Kernel to deny permissions, the
Administrator/User is faced with just a message that says "Permission Denied"
Then if the administrator is lucky enough or skilled enough to know where to
look, he might be able to understand why the process was denied access.
Dan's mail linked to a wiki page
("Friendly EPERM") with a proposal on how to deal with the
problem. That proposal involves changes to both the kernel and the GNU C
library (glibc). The kernel changes would add a mechanism for exposing a
"failure cookie" to user space that would provide more detailed information
about the error delivered in errno. On the glibc side,
strerror() and related calls (e.g., perror()) would
access the failure cookie in order obtain information that could be used to
provide a more detailed error message to the user.
Roland McGrath was quick to point out
that the solution is not so simple. The problem is that it is quite common
for applications to call strerror() only some time after a failed
system call, or to do things such as saving errno in a temporary
location and then restoring it later. In the meantime, the application is
likely to have performed further system calls that may have changed the
value of the failure cookie.
Roland went on to identify some of the problems inherent in trying to extend
existing standardized interfaces in order to provide useful error information to
end users:
It is indeed an unfortunate limitation of POSIX-like interfaces that
error reporting is limited to a single integer. But it's very deeply
ingrained in the fundamental structure of all Unix-like interfaces.
Frankly, I don't see any practical way to achieve what you're after.
In most cases, you can't even add new different errno codes for different
kinds of permission errors, because POSIX specifies the standard code for
certain errors and you'd break both standards compliance and all
applications that test for standard errno codes to treat known classes of
errors in particular ways.
In response, Eric Paris, one of the other proponents of the
failure-cookie idea acknowledged Roland's
points, noting that since the standard APIs can't be extended, then changes
would be required to each application that wanted to take advantage of any
additional error information provided by the kernel.
Eric subsequently posted a note to the
kernel mailing list with a proposal on the kernel changes required to
support improved error reporting. In essence, he proposes exposing some
form of binary structure to user space that describes the cause of the last
EPERM or EACCES error returned to the process by the
kernel. That structure might, for example, be exposed via a thread-specific
file in the /proc filesystem.
The structure would take the form of an initial field that indicates
the subsystem that triggered the error—for example, capabilities,
SELinux, or file permissions—followed by a union of substructures
that provide subsystem-specific detail on the circumstances that triggered
the error. Thus, for a file permissions error, the substructure might
return the effective user and group ID of the process, the file user ID and
group ID, and the file permission bits. At the user-space
level, the binary structure could be read and translated to human-readable
strings, perhaps via a glibc function that Eric suggested might be named
something like get_extended_error_info().
Each of the kernel call sites that returned an EPERM or
EACCES error would then need to be patched to update this
information. But, patching all of those call sites would not be necessary
to make the feature useful. As Eric noted:
But just getting extended denial information in a couple of
the hot spots would be a huge win. Put it in capable(), LSM hooks, the
open() syscall and path walk code.
There were various comments on Eric's proposal. In response to concerns from
Stephen Smalley that this feature might leak information (such as
file attributes) that could be
considered sensitive in systems with a strict security policy (enforced by
an LSM), Eric responded
that the system could provide a sysctl to disable the feature:
I know many people are worried about information leaks, so I'll right up
front say lets add the sysctl to disable the interface for those who are
concerned about the metadata information leak. But for most of us I
want that data right when it happens, where it happens, so It can be
exposed, used, and acted upon by the admin trying to troubleshoot why
the shit just hit the fan.
Reasoning that its best to use an existing format and its tools rather
than inventing a new format for error reporting, Casey Schaufler suggested that audit records should be used instead:
the string returned by get_extended_error_info()
ought to be the audit record the system call would generate, regardless
of whether the audit system would emit it or not.
If the audit record doesn't have the information you need we should
fix the audit system to provide it. Any bit of the information in
the audit record might be relevant, and your admin or developer might
need to see it.
Eric expressed concerns that copying an
audit record to the process's task_struct would carry more of a
performance hit than copying a few integers to that structure, concluding:
I don't see a problem storing the last audit record if it exists, but I
don't like making audit part of the normal workflow. I'd do it if others
like that though.
Jakub Jelinek wondered which system
call Eric's mechanism should return information about, and whether its
state would be reset if a subsequent system call succeeded. In many cases,
there is no one-to-one mapping between C library calls and system calls, so
that some library functions may make one system call, save errno,
then make some other system call (that may or may not also fail), and then
restore the first system call's errno before returning to the
caller. Other C library functions themselves set errno. "So,
when would it be safe to call this new get_extended_error_info function and
how to determine to which syscall it was relevant?"
Eric's opinion was that the mechanism
should return information about the last kernel system call. "It
would be really neat for libc to have a way to save and restore the
extended errno information, maybe even supply its own if it made the choice
in userspace, but that sounds really hard for the first pass."
However, there are problems with such a bare-bones approach. If the
value returned by get_extended_error_info() corresponds to the last
system call, rather than the errno value actually returned to user
space, this risks confusing user-space applications (and users). Carlos
O'Donell, who had earlier raised some of
the same questions as Jakub and pointed out the need to properly handle the
extended error information when a signal handler interrupts the main
program, agreed with Casey's assessment that
get_extended_error_info() should always return a value that
corresponds to the current content of errno. That implies the need
for a user-space function that can save and restore the extended error
information.
Finally, David Gilbert suggested that
it would be useful to broaden Eric's proposal to handle errors beyond
EPERM and EACESS. "I've wasted way too much time
trying to figure out why mmap (for example) has given me an EINVAL; there
are just too many holes you can fall into."
In the last few days, discussion in the thread has gone quiet. However,
it's clear that Dan and Eric have identified a very real and practical
problem (and one that has been identified
by others in the past). The solution would probably need to address the
concerns raised in the discussion—most notably the need to have
get_extended_error_info() always correspond to the current value
of errno—and might possibly also be generalized beyond
EPERM and EACCES. However, that should all be feasible,
assuming someone takes on the (not insignificant) work of fleshing out the
design and implementing it. If they do, the lives of system administrators
and end users should become considerably easier when it comes to diagnosing
the causes of software error reports.
Comments (90 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>