Kernel development
Brief items
Kernel release status
The 3.14 merge window remains open, so there is no current development kernel release.Stable updates: 3.12.9 and 3.10.28 were released on January 25, followed by 3.13.1 and 3.4.78 on January 29.
Quotes of the week
#define PERM__rw_r__r__ 0644 #define PERM__r________ 0400 #define PERM__r__r__r__ 0444 #define PERM__r_xr_xr_x 0555
Gorman: LSF/MM 2014 so far
Mel Gorman, chair of the 2014 Linux Storage, Filesystem, and Memory Management Summit notes that the CFP deadline is approaching and that the event is shaping up nicely. "I am pleased to note that there are a number of new people sending in attend and topic mails. The long-term health of the community depends on new people getting involved and breaking through any perceived barrier to entry. At least, it has been the case for some time that there is more work to do in the memory manager than there are people available to do it. It helps to know that there are new people on the way." Anybody wanting to attend who has not yet sent in a proposal should not delay much further.
Kernel development news
3.14 Merge window part 2
As of this writing, almost 8,600 non-merge changesets have been pulled into the mainline repository for the 3.14 development cycle — 5,300 since last week's merge window summary. As can be seen from the list below, quite a bit of new functionality has been added to the kernel in the last week. Some of the more significant, user-visible changes merged include:
- The event triggers feature has been
added to the tracing subsystem. See this
commit for some information on how to use this feature.
- The user-space probes (uprobes) subsystem has gained support for a
number of "fetch methods" providing access to data on the stack, from
process memory, and more. See the
patchset posting for more information.
- The Xen paravirtualization subsystem has gained support for a "paravirtualization
in an HVM container" (PVH) mode which makes better use of hardware
virtualization extensions to speed various operations (page table
updates, for example).
- The ARM architecture can be configured to protect kernel module text
and read-only data from modification or execution. The help text for
this feature notes that it may interfere with dynamic tracing.
- The new SIOCGHWTSTAMP network ioctl() command allows
an application to retrieve the current hardware timestamping
configuration without changing it.
- "TCP autocorking" is a new networking feature that will delay small
data transmissions in the hope of coalescing them into larger
packets. The result can be better CPU and network utilization. The
new tcp_autocorking sysctl knob can be used to turn off this
feature, which is on by default.
- The Bluetooth Low Energy support now handles connection-oriented
channels, increasing the number of protocols that can work over the LE
mode. 6LoWPAN emulation
support is also now available for Bluetooth LE devices.
- The Berkeley Packet Filter subsystem has acquired a couple of new
user-space tools: a debugger and a simple assembler. See the newly
updated Documentation/networking/filter.txt for
more information.
- The new "heavy-hitter filter" queuing discipline tries to distinguish
small network flows from the big ones, prioritizing the former. This
commit has
some details.
- The "Proportional Integral controller Enhanced" (PIE) packet scheduler
is aimed at eliminating bufferbloat problems. See this
commit for more information.
- The xtensa architecture code has gained support for multiprocessor systems.
- The Ceph distributed filesystem now has support for access control
lists.
- New hardware support includes:
- Processors and systems:
Marvell Berlin systems-on-chip (SoCs),
Energy Micro EFM32 SoCs,
MOXA ART SoCs,
Freescale i.MX50 processors,
Hisilicon Hi36xx/Hi37xx processors,
Snapdragon 800 MSM8974 SoCs,
Systems based on the ARM "Trusted Foundations" secure monitor,
Freescale TWR-P102x PowerPC boards, and
Motorola/Emerson MVME5100 single board computers.
- Clocks:
Allwinner sun4i/sun7i realtime clocks (RTCs),
Intersil ISL12057 RTCs,
Silicon Labs 5351A/B/C programmable clock generators,
Qualcomm MSM8660, MSM8960, and MSM8974 clock controllers, and
Haoyu Microelectronics HYM8563 RTCs.
- Miscellaneous:
AMD cryptographic coprocessors,
Freescale MXS DCP cryptographic coprocessors (replacement for an
older, unmaintained driver),
OpenCores VGA/LCD core 2.0 framebuffers,
generic GPIO-connected beepers,
Cisco Virtual Interface InfiniBand cards,
Active-Semi act8865 voltage regulators,
Maxim 14577 voltage regulators,
Broadcom BCM63XX HS SPI controllers, and
Atmel pulse width modulation controllers.
- Multimedia Card (MMC):
Arasan SDHCI controllers and
Synopsys DesignWare interfaces on Hisilicon K3 SoCs.
- Networking: Marvell 8897 WiFi and near-field communications (NFC) interfaces, Intel XL710 X710 Virtual Function Ethernet controllers, and Realtek RTL8153 Ethernet adapters.
Note also that the AIC7xxx SCSI driver, deprecated since the 2.4 days, has finally been removed from the kernel.
- Processors and systems:
Marvell Berlin systems-on-chip (SoCs),
Energy Micro EFM32 SoCs,
MOXA ART SoCs,
Freescale i.MX50 processors,
Hisilicon Hi36xx/Hi37xx processors,
Snapdragon 800 MSM8974 SoCs,
Systems based on the ARM "Trusted Foundations" secure monitor,
Freescale TWR-P102x PowerPC boards, and
Motorola/Emerson MVME5100 single board computers.
Changes visible to kernel developers include:
- The ARM architecture code can be configured to create a file
(kernel_page_tables) in the debugfs filesystem where the
layout of the kernel's page tables can be examined.
- The checkpatch script will now complain about memory allocations using
the __GFP_NOFAIL flag.
- There is a new low-level library for computing hash values in situations where speed is more important than the quality of the hash; see this commit for details.
At this point, the 3.14 merge window appears to be winding down. If the usual two-week standard applies, the window should stay open through February 2, but Linus has made it clear in the past that the window can close earlier if he sees fit. Either way, next week's Kernel Page will include a summary of the final changes pulled for this development cycle.
Preparing for large-sector drives
Back in the distant past (2010), kernel developers were working on supporting drives with 4KB physical sectors in Linux. That work is long since done, and 4KB-sector drives work seamlessly. Now, though, the demands on the hard drive industry are pushing manufacturers toward the use of sectors larger than 4KB. A recent discussion ahead of the upcoming (late March) Linux Storage, Filesystem and Memory Management Summit suggests that getting Linux to work on such devices may be a rather larger challenge requiring fundamental kernel changes — unless it isn't.Ric Wheeler started the discussion by proposing that large-sector drives could be a topic of discussion at the Summit. The initial question — when such drives might actually become reality — did not get a definitive answer; drive manufacturers, it seems, are not ready to go public with their plans. Clarity increased when Ted Ts'o revealed a bit of information that he was able to share on the topic:
Larger sectors would clearly bring some inconvenience to kernel developers, but, since they can help drive manufacturers offer more capacity at lower cost, they seem almost certain to show up at some point.
Do (almost) nothing
One possible response, espoused by James Bottomley, is to do very little in anticipation of these drives. He pointed out that much of the work done to support 4KB-sector drives was not strictly necessary; the drive manufacturers said that 512-byte transfers would not work on such drives, but the reality has turned out to be different. Not all operating systems were able to adapt to the 4KB size, so drives have read-modify-write (RMW) logic built into their firmware to handle smaller transfers properly. So Linux would have worked anyway, albeit with some performance impact.
James's point is that the same story is likely to play out with larger sector sizes; even if manufacturers swear that only full-sector transfers will be supported, those drives will still, in the end, have to work with popular operating systems. To do that, they will have to support smaller transfers with RMW. So it comes down to what's needed to perform adequately on those drives. Large transfers will naturally include a number of full-sector chunks, so they will mostly work already; the only partial-sector transfers would be the pieces at either end. Some minor tweaks to align those transfers to the hardware sector boundary would improve the situation, and a bit of higher-level logic could cause most transfers to be sized to match the underlying sector size. So, James said:
But Martin Petersen, arguably the developer most on top of what manufacturers are actually doing with their drives, claimed that, while consumer-level drives all support small-sector emulation with RMW, enterprise-grade drives often do not. If the same holds true for larger-sector drives, the 99% solution may not be good enough and more will need to be done.
Larger blocks in the kernel
There are many ways in which large sector support could be implemented in the kernel. One possibility, mentioned by Chris Mason, would be to create a mapping layer in the device mapper that would hide the larger sector sizes from the rest of the kernel. This option just moves the RMW work into a low-level kernel layer, though, and does nothing to address the performance issues associated with that extra work.
Avoiding the RMW overhead requires that filesystems know about the larger sector size and use a block size that matches. Most filesystems are nearly ready to do that now; they are generally written with the idea that one filesystem's block size may differ from another. The challenges are, thus, not really at the filesystem level; where things get interesting is with the memory management (MM) subsystem.
The MM code deals with memory in units of pages. On most (but not all) architectures supported by Linux, a page is 4KB of memory. The MM code charged with managing the page cache (which occupies a substantial portion of a system's RAM) assumes that individual pages can easily be moved to and from the filesystems that provide their backing store. So a page fault may just bring in a single 4KB page, without regard for the fact that said page may be embedded within a larger sector on the storage device. If the 4KB page cannot be read independently, the filesystem code must read the whole sector, then copy the desired page into its destination in the page cache. Similarly, the MM code will write pages back to persistent store with no understanding of the other pages that may share the same hardware sector; that could force the filesystem code to reassemble sectors and create surprising results by writing out pages that were not, yet, meant to be written.
Avoiding these problems almost certainly means teaching the MM code to manage pages in larger chunks. There have been some attempts to do so over the years; consider, for example, Christoph Lameter's large block patch set that was covered here back in 2007. This patch enabled variable-sized chunks in the page cache, with anything larger than the native page size being stored in compound pages. And that is where this patch ran into trouble.
Compound pages are created by grouping together a suitable number of physically contiguous pages. These "higher-order" pages have always been risky for any kernel subsystem to rely on; the normal operation of the system tends to fragment memory over time, making such pages hard to find. Any code that allocates higher-order pages must be prepared for those allocations to fail; reducing the reliability of the page cache in this way was not seen as desirable. So this patch set never was seriously considered for merging.
Nick Piggin's fsblock work, also started in 2007, had a different goal: the elimination of the "buffer head" structure. It also enabled the use of larger blocks when passing requests to filesystems, but at a significant cost: all filesystems would have had to be modified to use an entirely different API. Fsblock also needed higher-order pages, and the patch set was, in general, large and intimidating. So it didn't get very far, even before Nick disappeared from the development community.
One might argue that these approaches should be revisited now. The introduction of transparent huge pages, memory compaction, and more, along with larger memory sizes in general, has made higher-order allocations much more reliable than they once were. But, as Mel Gorman explained, relying on higher-order allocations for critical parts of the kernel is still problematic. If the system is entirely out of memory, it can push some pages out to disk or, if really desperate, start killing processes; that work is guaranteed to make a number of single pages available. But there is nothing the kernel can do to guarantee that it can free up a higher-order page. Any kernel functionality that depends on obtaining such pages could be put out of service indefinitely by the wrong workload.
Avoiding higher-order allocations
Most Linux users, if asked, would not place "page cache plagued by out-of-memory errors" near the top of their list of desired kernel features, even if it comes with support for large-sector drives. So it would seem that any scheme based on being able to allocate physically contiguous chunks of memory larger than the base allocation size used by the MM code is not going to get very far. The alternatives, though, are not without their difficulties.
One possibility would be to move to the use of virtually contiguous pages in the page cache. These large pages would still be composed of a multitude of 4KB pages, but those pages could be spread out in memory; page-table entries would then be used to make them look contiguous to the rest of the kernel. This approach has special challenges on 32-bit systems, where there is little address space available for this kind of mapping, but 64-bit systems would not have that problem. All systems, though, would have the problem that these virtual pages are still groups of small pages behind the mapping. So there would still be a fair amount of overhead involved in setting up the page tables, creating scatter/gather lists for I/O operations, and more. The consensus seems to be that the approach could be workable, but that the extra costs would reduce any performance benefits considerably.
Another possibility is to increase the size of the base unit of memory allocation in the MM layer. In the early days, when a well-provisioned Linux system had 4MB of memory, the page size was 4KB. Now that memory sizes have grown by three orders of magnitude — or more — the page size is still 4KB. So Linux systems are managing far more pages than they used to, with a corresponding increase in overhead. Memory sizes continue to increase, so this overhead will increase too. And, as Ted pointed out in a different discussion late last year, persistent memory technologies on the horizon have the potential to expand memory sizes even more.
So there are good reasons to increase the base page size in Linux even in
the absence of large-sector drives. As Mel put
it, "It would get more than just the storage gains though. Some
of the scalability problems that deal with massive amount of struct pages
may magically go away if the base unit of allocation and management
changes.
" There is only one tiny little problem with this solution:
implementing it would be a huge and painful exercise. There have been attempts to implement "page clustering" in the
kernel in the past, but none have gotten close to being ready to merge.
Linus has also been somewhat hostile to the concept of increasing the base
page size in the past, fearing the memory waste caused by internal
fragmentation.
A number of unpleasant options
In the end, Mel described the available options in this way:
- major filesystem overhaul
- major vm overhaul
- use compound pages as they are today and hope it does not go completely to hell, reboot when it does
With that set of alternatives to choose from, it is not surprising that none have, thus far, developed an enthusiastic following. It seems likely that all of this could lead to a most interesting discussion at the Summit in March. Even if large-sector drives could be supported without taking any of the above options, chances are that, sooner or later, the "major VM overhaul" option is going to require serious consideration. It may mostly be a matter when somebody feels the pain badly enough to be willing to try to push through a solution.
Supporting Intel MPX in Linux
Buffer overflows have long been a source of serious bugs and security problems at all levels of the software stack. Much work has been done over the years to eliminate unsafe library functions, add stack-integrity checking and more, but buffer overflow bugs still happen with great regularity. A recently posted kernel patch is one of the final steps toward the availability of a new tool that should help to make buffer overflow problems more uncommon: Intel's upcoming "MPX" hardware feature.MPX is, at its core, a hardware-assisted mechanism for performing bounds checking on pointer accesses. The hardware, following direction from software, maintains a table of pointers in use and the range of accessible memory (the "bounds") associated with each. Whenever a pointer is dereferenced, special instructions can be used to ensure that the program is accessing memory within the range specified for that particular pointer. These instructions are meant to be fast, allowing bounds checking to be performed on production systems with a minimal performance impact.
As one might expect, quite a bit of supporting software work is needed to make this feature work, since the hardware cannot, on its own, have any idea of what the proper bounds for any given pointer would be. The first step in this direction is to add support to the GCC compiler. Support for MPX in GCC is well advanced, and should be considered for merging into the repository trunk sometime in the near future.
When a file is compiled with the new -fmpx flag, GCC will generate code to make use of the MPX feature. That involves tracking every pointer created by the program and the associated bounds; any time that a new pointer is created, it must be inserted into the bounds table for checking. Tracking of bounds must follow casts and pointer arithmetic; there is also a mechanism for "narrowing" a set of bounds when a pointer to an object within another object (a specific structure field, say) is created. The function-call interface is changed so that when a pointer is passed to a function, the appropriate bounds are passed with it. Pointers returned from functions also carry bounds information.
With that infrastructure in place, it becomes possible to protect a program against out-of-bounds memory accesses. To that end, whenever a pointer is dereferenced, the appropriate instructions are generated to perform a bounds check first. See Documentation/x86/intel_mpx.txt, included with the kernel patch set (described below), for details on how code generation changes. In brief: the new bndcl and bndcu instructions check a pointer reference against the lower and upper limits, respectively. If those instructions succeed, the pointer is known to be within the allowed range.
The next step is to prepare the C library for bounds checking. At a minimum, that means building the library with -fmpx, but there is more to it than that. Any library function that creates an object (malloc(), say) needs to return the proper bounds along with the pointer to the object itself. In the end, the C library will be the source for a large portion of the bounds information used within an application. The bulk of the work for the GNU C library (glibc) is evidently done and committed to the glibc git repository. Instrumentation of other libraries would also be desirable, of course, but the C library is the obvious starting point.
Then there is the matter of getting the necessary support code into the kernel; Qiaowei Ren has recently posted a patch set to do just that. Part of the patch set is necessarily management overhead: allowing applications to set up bounds tables, removing bounds tables when the memory they refer to is unmapped, and so on. But much of the work is oriented around the user-space interface to the MPX feature.
The first step is to add two new prctl() options: PR_MPX_INIT and PR_MPX_RELEASE. The first of those sets up MPX checking and turns on the feature, while the second cleans everything up. Applications can thus explicitly control pointer bounds checking, but that is not expected. Instead, the system runtime will probably turn on MPX as part of application startup, before the application itself begins to run. Current discussion on the linux-kernel list suggests that it may be possible to do the entire setup and teardown job within the user-space runtime code, making these prctl() calls unnecessary, so they may not actually find their way into the mainline kernel.
When a bounds violation is detected, the processor will trap into the kernel. The kernel, in turn, will turn the trap into a SIGSEGV signal to be delivered to the application, similar to other types of memory access errors. Applications that look at the siginfo structure passed to the signal handler from the kernel will be able to recognize a bounds error by checking the si_code field for the new SEGV_BNDERR value. The offending address will be stored in si_addr, while the bounds in effect at the time of the trap will be stored in si_lower and si_upper. But most programs, of course, will not handle SIGSEGV at all and will simply crash in this situation.
In summary, there is a fair amount of development work needed to make this hardware feature available to user applications. The good news is that, for the most part, this work appears to be done. Using MPX within the kernel itself should also be entirely possible, but no patches to that effect have been posted so far. Adding bounds checking to the kernel without breaking things is likely to present a number of interesting challenges; for example, narrowing would have to be reversed anytime the container_of() macro is used — and there are thousands of container_of() calls in the kernel. Finding ways to instrument the kernel would thus be tricky; doing this instrumentation in a way that does not make a mess out of the kernel source could be even harder. But there would be clear benefits should somebody manage to get the job done.
Meanwhile, though, anybody looking forward to MPX will have to wait for a couple of things: hardware that actually supports the feature and distributions built to use it. MPX is evidently a part of Intel's "Skylake" architecture, which is not expected to be commercially available before 2015 at the earliest. So there will be a bit of a wait before this feature is widely available. But, by the time it happens, Linux should be ready to take advantage of it.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>