Kernel development
Brief items
Kernel release status
The 4.5 merge window is still open, probably until January 24.Stable updates: none have been released since December 14. The 4.1.16, 3.14.59, and 3.10.95 updates are in the review process as of this writing; they can be expected on or after January 22.
Quotes of the week
In fact our kernel configuration UI and workflow is still so bad that it's an effort to stay current even with a standalone and working .config, even for experienced kernel developers...
Kernel development news
4.5 merge window part 2
As of this writing, Linus has pulled 8,415 non-merge changesets into the mainline repository for the 4.5 development cycle; 5,300 of those have come in since last week's summary. Recent merge-window history (12,092 patches for 4.2, 10,756 for 4.3, 11,528 for 4.4) suggests that we probably have some merging to go still; a quick look at linux-next suggests that there is still a fair amount of unmerged work in the ARM tree in particular. It is probably fair to say, though, that the bulk of the significant features that we will see in 4.5 are in place now.The most significant of those features include:
- There is a new restriction on access to memory via /dev/mem:
it can no longer access ranges of memory that have been claimed by a
device driver. The specific purpose is to protect non-volatile memory
arrays, which, due to their size, are relatively easy to hit by
accident, but there are other advantages as well. Someday, perhaps,
/dev/mem will go away entirely, but there are still a few
things that use it now. Note that the first 1MB of memory is
unaffected by this restriction; see the
commit changelog for some more information.
- The kernel's persistent-memory support has, until now, lacked the
ability to properly support direct I/O and DMA to persistent memory.
That has changed in 4.5 with the
merging of proper support for page structures backing up
persistent-memory arrays.
- The libnvdimm (non-volatile memory) layer has gained a bad-block
management layer borrowed from the MD RAID code.
- The XFS filesystem now performs checksum validation of all log entries
before applying them during recovery. That should greatly reduce the
chance of applying corrupted data.
- There is now more extensive accounting of kernel memory allocated via
the slab allocators. At the user level, users will see various kernel
allocations charged against their memory-control-group limits. At the
kernel level, the new SLAB_ACCOUNT and __GFP_ACCOUNT
flags are used to mark allocations that should be charged in this
way. Among others, mm_struct, vm_area_struct,
dentry, and inode structures are all tracked now.
- As described in this article, it is
now possible to increase the range of randomness used for
address-space layout randomization. That might increase the security
of the system, at the possible cost of making huge allocations fail.
- The MADV_FREE option to
madvise(), which has been under development for some
time, has finally been merged. MADV_FREE allows an
application to mark memory that it won't need immediately; the kernel
can then reclaim that memory preferentially if resources are tight.
- User-space mode-setting support, deprecated for years, has finally
been removed from the Radeon driver. With luck, all users have long
since switched to kernel mode-setting.
- New hardware support includes:
- Audio:
Cirrus Logic CS47L24 codecs,
Imagination Technologies audio controllers,
Rockchip rk3036 Inno codecs,
Dialog Semiconductor DA7217 and DA7218 audio codecs,
Texas Instruments pcm3168a codecs,
Pistachio SoC internal digital-to-analog converters,
Realtec RT5616 and 5659 codecs, and
AMD audio coprocessors.
- Graphics:
Panasonic VVX10F034N00 1920x1200 video mode panels and
Sharp LS043T1LE01 qHD video mode panels.
Notably, the "Etnaviv" driver, a free driver for Vivante GPUs,
has finally been merged. The AMD driver has gained PowerPlay
power-management support.
- Industrial I/O:
Memsic MXC6255 orientation-sensing accelerometers,
TI Palmas general-purpose analog-to-digital converters (ADCs),
TI ADS8688 ADCs,
TI INA2xx power monitors,
Freescale IMX7D ADCs,
Freescale MMA7455L/MMA7456L accelerometers,
Maxim MAX30100 heart rate and pulse oximeter sensors, and
AMS iAQ-Core VOC sensors.
- Input:
EETI eGalax serial touchscreens and
Technologic TS-4800 touchscreens.
- Miscellaneous:
STMicroelectronics STM32 DMA controllers,
Mediatek MT81xx SPI NOR flash controllers,
Ingenic JZ4780 NAND flash controllers,
HiSilicon SAS SCSI adapters,
TI LM363X voltage regulators,
TI TPS65086 power regulators,
Powerventure Semiconductor PV88060 and PV88090 voltage regulators,
Cirrus Logic Fractional-N Clock synthesizer/multipliers,
Qualcomm MSM8996 clock controllers,
Epson RX8010SJ realtime clocks, and
Intel P-Unit mailboxes.
- USB:
Mediatek MT65xx host controllers,
Renesas USB3.0 peripheral controllers,
Renesas R-Car generation 3 USB 2.0 PHYs,
Hisilicon hi6220 USB PHYs, and
Moxa UPORT 11x0 serial hubs.
- Watchdog: CSR CSRatlas7 watchdogs, Technologic TS-4800 watchdogs, Alphascale ASM9260 watchdogs, Zodiac RAVE watchdog timers, Sigma Designs SMP86xx/SMP87xx watchdogs, and Mediatek SoC watchdogs.
- Audio:
Cirrus Logic CS47L24 codecs,
Imagination Technologies audio controllers,
Rockchip rk3036 Inno codecs,
Dialog Semiconductor DA7217 and DA7218 audio codecs,
Texas Instruments pcm3168a codecs,
Pistachio SoC internal digital-to-analog converters,
Realtec RT5616 and 5659 codecs, and
AMD audio coprocessors.
Changes visible to kernel developers include:
- A new version of the media
controller API has been merged. As Mauro Carvalho Chehab
described this work in the
pull request: "
The goal is to improve the media controller to allow proper support for other types of Video4Linux devices (radio and TV ones) and to extend the media controller functionality to allow it to be used by other subsystems like DVB, ALSA and IIO.
" Parts of the user-space API remain disabled, though, until 4.6 so some final points can be worked out. - The extensive huge-page reference counting patch set has been merged. The end goal (supporting transparent huge pages in the page cache) has not yet been reached, though.
The most likely day for the closing of the merge window remains January 24. As usual, we'll cover any final changes that come in through this merge window in next week's edition.
Direct I/O and DMA for persistent memory
The last year or so has seen a great deal of work toward improving the kernel's support of persistent-memory (or "nonvolatile-memory") devices. Persistent memory looks like regular memory to the system in a number of ways, but it differs in others, most notably in that its contents persist across reboots and power cycles. The upcoming 4.5 kernel contains some core memory-management changes that address one of the biggest items left on the "to do" list for persistent memory: support for DMA and direct I/O. Getting there was a multi-step process, though.One of the biggest areas of disagreement with regard to persistent-memory support has been whether that memory should be represented in the system memory map. Doing so means setting aside considerable amounts of memory for a page structure representing each persistent-memory page; with large persistent-memory arrays, those structures could occupy a significant percentage of the system's RAM — or not fit at all. But the lack of page structures makes persistent memory invisible to much of the low-level memory-management code and, as a result, rules out operations like direct I/O. Since some of the prominent use cases for persistent memory (serving as a fast cache for a huge disk array, for example) require DMA and direct I/O, this was seen as a significant problem.
The solution, merged for 4.5, is evolved from the approaches described here in September 2015. At that point, there was a significant push to use page-frame numbers (PFNs) as a replacement for page structures in much of the memory-management subsystem. If all the memory in the system is seen as a huge array, a PFN is simply an index into that array for a specific page. Any memory that is addressable by the CPU will have an associated PFN, so using the PFN seems like a logical way to refer to that page. There is a catch, though: struct page, beyond just identifying a page, also contains crucial information about how that page is being used. So it's not possible to do without struct page entirely.
The approach found in the 4.5 kernel, implemented by Dan Williams, starts with some of the PFN-based ideas that have been passed around in the past, but does not stop there. There is a new type to represent a PFN and some associated information:
typedef struct {
unsigned long val;
} pfn_t;
Adding this type required renaming a couple of pfn_t types already existing in other parts of the kernel. The val member contains the actual PFN, but the high-order bits are used to encode a few extra flags. Two of them, PFN_SG_CHAIN and PFN_SG_LAST, are meant to be used with scatter-gather lists for DMA that use PFNs rather than pointers to page structures, but the scatter-gather part has not (yet) been merged, so these flags are unused as of this writing. Beyond that, PFN_DEV indicates a page frame stored on special "device" memory that may not have an associated page structure, and PFN_MAP indicates that a page structure does, in fact, exist.
The kernel has had the ability to (easily) create page structures for persistent memory since 4.3, when devm_memremap_pages() was introduced by Christoph Hellwig:
/* The v4.3 version of this function */
void *devm_memremap_pages(struct device *dev, struct resource *res);
This function will map the region described by res into the kernel's virtual address space, allocating page structures for it along the way. It is not a complete solution to the problem, though, for a couple of reasons. One is that it lacks the reference-counting support needed to ensure that a persistent-memory device doesn't disappear while it is in use. The other, of course, is the same old problem: for a huge persistent-memory array, there just isn't room in RAM for all of those page structures.
The lack of reference counting matters for use cases like DMA and direct I/O; it would not do to have some persistent memory (or the mapping to it) disappear in the middle of an operation. In 4.5, this problem is fixed by requiring persistent-memory drivers to provide a percpu_ref structure to go with any memory array that is mapped into the kernel's address space. A pointer to this reference counter is then stored (with a level of indirection) in the already overloaded page structure; since persistent-memory page structures will never appear in the memory-management subsystem's LRU lists, the space occupied by the lru field is available for this purpose.
The 4.5 work introduces a new flag, _PAGE_DEVMAP, which is stored in the page-table entry itself when persistent memory is mapped into a process's address space. Code that creates references to this memory, get_user_pages() for example, will see that flag and respond by incrementing the percpu_ref counter associated with the persistent-memory array. As long as that counter remains elevated, it will not be possible to remove the memory from the system.
The other problem — the size of all those page structures — has an obvious solution: store those structures in the persistent-memory array itself. This solution is not ideal; page structures can change frequently, which mixes poorly with the relatively high cost of writing to persistent memory. But it is better than having no page structures at all. So, in 4.5, drivers for persistent memory can set aside a chunk of each array for the storage of page structures. That is done by filling in a vmem_altmap structure:
struct vmem_altmap {
const unsigned long base_pfn;
const unsigned long reserve;
unsigned long free;
unsigned long align;
unsigned long alloc;
};
The base_pfn field points to the base of the array. A driver can keep some of the memory for its own purposes by storing the amount in the reserve field; the free field should be set to the number of pages that can be used to hold page structures. A simple allocator built into the memory-management code will then use those pages (tracking them with the alloc field) to create page structures when mapping the array into kernel space.
All of these additions come together in the 4.5 version of devm_memremap_pages():
void *devm_memremap_pages(struct device *dev, struct resource *res,
struct percpu_ref *ref, struct vmem_altmap *altmap);
With this infrastructure in place, a persistent-memory driver can easily set up an array that is mapped into kernel memory and which has page structures behind it. That allows functions like get_user_pages() to work, and, as a consequence, direct I/O and DMA also work. An additional benefit (from a bit more work) is that huge-page mappings into persistent memory work properly.
Without doubt, work on supporting persistent memory will continue for some time; this memory represents a major change in how our systems work. But, as of the 4.5 kernel, it would appear that the important low-level pieces are in place. What remains now is figuring out the best ways to actually use terabytes of directly connected persistent memory, both within the kernel and at the application level. It will be interesting to see what developers come up with in the next few years.
Heading toward 2038-safe filesystems
It is a little hard to call the "year 2038" problem looming, given that it is still nearly 22 years off. But Linux is installed in lots of places where it may continue running past 2038—particularly in embedded systems. Kernel developers have done a fair amount of work to address the problem, much of which we have covered along the way. Attention is now turning to preparing the virtual filesystem (VFS) layer, along with all of the myriad filesystems supported by Linux, for 2038.
In a nutshell, the problem is that the representation of time on a Linux system—inherited from the original Unix systems—uses a 32-bit signed integer, at least on 32-bit systems. It stores the number of seconds since January 1, 1970, which is known as the "epoch". That value will wrap in January 2038. The fallout from the year 2000 problem was far smaller than expected, but that was largely a user-space issue. The year 2038 problem will affect all existing kernels, so getting ahead of the curve is certainly prudent.
There are a number of facets to the filesystem side of the problem. Filesystems often store timestamps for each file (Unix filesystems store three), typically in 32-bit formats. That means those filesystems will need to change to a larger-sized timestamp at some point, but they will also need to be able to handle today's already-on-disk filesystems with their 32-bit timestamps. In addition, filesystems may want to handle on-disk timestamps in their own way, without converting to the 64-bit timestamp that is being used internally in the kernel moving forward.
The VFS layer, on the other hand, has its own timestamp handling for its in-memory inodes and other structures. It will need to change too, but there are various carts and horses that need to be aligned correctly before that can happen.
Deepa Dinamani recently posted a patch set that made an attempt at solving the problem in the VFS layer. Somewhat confusingly to some, it also included patches for some filesystems to try to show the scope of the changes needed. That part of the patch set had not been compile-tested, which was part of the confusion.
But the first seven (of fifteen) patches targeted VFS. Currently VFS uses a struct timespec to represent time. That structure suffers from the year 2038 problem because it uses a time_t for seconds, which is 32 bits on some systems. It also uses a long to store nanoseconds, which can vary in size as well. That means the structure has a different size on different systems. The replacement for that in a year-2038-compatible world is the struct timespec64, which has a 64-bit seconds field, but still has a long for nanoseconds, so it still will change size between systems.
Dinamani proposed using a new struct inode_timespec that is defined as a 64-bit seconds field and a 32-bit nanoseconds field everywhere. It is mainly introduced to prevent the need for a big "flag day" patch that converts everything to a timespec64 at once. She added macros to access the fields so that eventually inode_timespec could be turned into a timespec64. The inode_timespec would be aligned so that it only used 12 bytes, rather than 16 on 64-bit systems. But Dave Chinner called that a premature optimization.
As the conversation continued, there was a clear difference of opinion about how to attack the whole problem. The memory savings for 12 versus 16 bytes for timestamps in inodes in memory may not be that significant, as Arnd Bergmann pointed out. 32-bit systems will need larger inodes to handle post-2038 timestamps, so it is really a matter of how much they grow. Bergmann copied other architecture mailing lists to see if there were strong feelings about it, but so far there have been no replies.
But Dinamani also wanted feedback on other parts of the patch set. She summarized some of the outstanding questions that needed to be addressed before the problem can be solved. Essentially, there is a tension between the need to move everything to timespec64 and how that can be done without disrupting filesystem and VFS development. Dinamani sees the transitional inode_timespec as something of a necessary evil that will be eliminated once all of the filesystems have been converted.
Chinner, on the other hand, thinks that moving directly to timespec64 makes more sense. Both agreed that there are some preliminary steps that should be taken, such as ensuring that timestamps are range-checked and clamped to reasonable values on their way into and out of filesystems and VFS. There is also the matter of eliminating the use of the CURRENT_TIME macro in filesystems in favor of current_fs_time(), which references the filesystem superblock so that the proper time granularity and range can be enforced. Beyond that, the approaches diverge.
Rather than go through an intermediate inode timestamp type, so that filesystems can be converted over time, Chinner would like to turn that on its head a bit. Start by ensuring that all filesystems that use timespec internally call a (for now empty) conversion function to change them to and from the VFS representation. That would eliminate all of the macro changes that were needed when using inode_timespec:
Internally, time handling in those filesystems could remain unchanged; it would just be a change at the boundary between the filesystem and the VFS. That would isolate the changes that need to be done for the VFS from those that need to be done for the filesystems. Chinner said that all filesystems will need an audit to determine what they need to support post-2038 timestamps, so this decoupling is useful:
Filesystems that have intermediate timestamp formats such as Lustre, NFS, CIFS, etc will need conversion at the vfs/filesystem entry points, and their internals will remain unchanged. Fixing the internals is outside the scope fo the VFS change - the 64 bit VFS inode support stops at the VFS inode/filesystem boundary.
But Dinamani and Bergmann are leery of an
enormous patch set that touches lots of code all over the place. It is
both "ugly and fragile
" as Bergmann put it, though he suggested at least
investigating that path. Both he and Dinamani have made
various attempts to find the right approach and they have both run into
various walls. Chinner's suggestion of how
to handle a particular case for the FAT filesystem is not workable, they
said. Bergmann elaborated:
So there seems to be an impasse at this point. Dinamani said that she would try to convert an example filesystem using the two different methods for comparison purposes. Hopefully that will help point the way toward a solution that leads to as little disruption as possible. A change of this sort is always going to lead to some upheaval, but finding a way to reduce it as much as possible will be good. So far, Dinamani and Bergmann haven't quite found the right approach—or haven't yet convinced Chinner—but it is good to see that kernel developers are thinking about this.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
