Kernel development
Brief items
Kernel release status
The current development kernel is 3.9-rc8, released on April 21. "Yes, I was really hoping (and originally planning) to release 3.9 final this weekend, but we had enough issues that I just didn't feel comfy about it. It was borderline, and none of the issues were huge, and maybe I could have called this just 3.9 and opened the merge window, but hey, another week won't hurt."
Stable updates: 3.6.11.2 was released on April 19, and 3.5.7.11 came out on April 22.
The 3.8.9, 3.4.42, and 3.0.75 updates are in the review process as of this writing; they can be expected on or after April 25.
Quote of the week
yes "" | make menuconfig
is a bad idea.
The danger of narrow bitmasks, ~, and wide types
Linus was recently playing with the sparse static analysis tool when he hit upon an interesting and possibly surprising source of bugs in kernel code. In truth, the problem should not be surprising once it's understood, but it still exists as a hazard for the unwary.Imagine code that looks like this:
u32 bitmask = 0xff;
u64 value;
value &= ~bitmask;
One would expect this code to clear the bottom eight bits in value, and, indeed, it does. But the bitwise negation operator ~ is applied to a 32-bit type, so ~bitmask evaluates to 0xffffff00 — the upper 32 bits, seen from a 64-bit point of view, are zero. So the logical AND operation ends up clearing those upper 32 bits as well, which is almost certainly not the desired result.
Once one thinks about it, this behavior makes some sense, but it still seems like a likely source of bugs in the kernel. And, indeed, looking for this pattern has quickly turned up bugs in the MIPS code (which was what started Linus looking for these problems), in perf, and in the ext4 filesystem. Chances are that others exist as well. Linus is extending sparse to flag potential problems, but it is also a good idea for developers to be aware of this trap in general.
Kernel development news
LFCS: Preparing Linux for nonvolatile memory devices
Since the demise of core memory, there has been a fundamental dichotomy in data storage technology: memory is either fast and ephemeral, or slow and persistent. The situation is changing, though, and that leads to some interesting challenges for the Linux kernel. How will we adapt to the coming world where nonvolatile memory (NVM) devices are commonplace? Ric Wheeler led a session at the 2013 Linux Foundation Collaboration Summit to discuss this issue.In a theme that was to recur over the course of the hour, Ric noted that we have been hearing about NVM for some years. NVM devices have a number of characteristics that distinguish them from other technologies. They are byte addressable like ordinary RAM, but unlike storage devices which have always been block-oriented. They are persistent: they do not lose state when the power goes away. They are comparable to ordinary memory in speed, and also in price, so they will not be as large as hard drives anytime soon. They also are not yet available for most of us to play with at any reasonable price.
Early solid-state devices looked a lot like disks; they used normal
protocols and were not so fast that the system could not keep up with them.
That situation changed, though, with the next wave of devices, which were
usually connected via PCI Express (PCIe). There is a lot of code in the I/O stack that
sits between the system and the storage; as storage devices get faster, the
overhead of all that code is increasingly painful. Much of that code is not
useful in this situation, since it was designed for high-latency devices.
As a result, Linux still can't get full performance out of bus-connected
solid-state devices.
As an aside, Ric had a few suggestions to offer to anybody working to tune a Linux system to work with existing fast block devices. The relevant parameters are found under /sys/block/dev/queue, where dev is the name of the relevant block device (sda, for example). The rotational parameter is the most important; it should be set to zero for solid-state devices. The CFQ I/O scheduler (selected with the scheduler attribute) is not the best for solid-state devices; the deadline scheduler is a better choice. It is also important to pay attention to the block sizes of the underlying device and align filesystems accordingly; see this paper by Martin Petersen [PDF] for details.
Back to the topic at hand, Ric noted that, along with all the technical challenges, there are some organizational difficulties. Kernel developers tend to be quite specialized: at the storage layer, SCSI and SATA drives are handled by different groups. The block layer itself is maintained by a separate, very small group. There is yet another group for each filesystem, and we have a lot of filesystems. All of these groups will have to work together to make NVM devices work optimally on Linux systems.
Crawling first
Making the best use of NVM devices will require new programming models and new APIs. That kind of change takes time, but the hardware could be arriving soon. So, Ric said, we need to make them work as well as we can within the existing APIs; this is, he said, the "crawl phase." In this phase, NVM devices will be accessed through the same old block API, much like solid-state devices are now. The key will be to make those APIs work as quickly as possible. It is a shame, he said, but we need a block driver that will turn this cool technology into something boring. There is also a need for a lot of work to squeeze overhead out of the block I/O path.
Ted Ts'o suggested that, while it is hard to get applications to move to new APIs, it is easier to make libraries like sqlite use them. That should bring improved performance to applications with no code changes at all. It was pointed out, though, that users are often reluctant to even recompile applications, so it could still take quite a while for performance improvements to be seen by end users.
The current "crawl" status is that block drivers for NVM devices are being developed now. We're also seeing caching technologies that can use NVM devices to provide faster access to traditional storage devices. The dm-cache device mapper target was merged for 3.9, and the bcache mechanism is queued for 3.10. Ric said that various vendor-specific solutions are under development as well.
Getting to the "walk" phase involves making modifications to existing filesystems. One obvious optimization is to move filesystem journals to faster devices; frequently-used metadata can also be moved. Getting the best performance will require reworking the transaction logic to get rid of a lot of the currently-existing barriers and flush operations, though. At the moment, Btrfs has a bit of "dynamic steering" capability that is a start in that direction, but there is still a lot that needs to be done.
It is also time to start thinking about the creation of byte-level I/O APIs for new applications to use; the developers are currently looking for ideas about how applications would actually like to use NVM devices, though. Ric mentioned that the venerable mmap() interface will need to be looked at carefully and "might not be salvageable." Application developers will need to be educated on the capabilities of NVM devices, and hardware needs to be put into their hands.
That last part may prove difficult. Over the course of the session, a number of participants complained that these devices have been "just around the corner" for the last decade, but they never actually materialize. There is a bit of a credibility problem at this point. As Tejun Heo said, nothing is concrete; there is no way to know what the performance characteristics of these devices will be or how to optimize for them. The word is that this situation will change, with developers initially getting hardware under non-disclosure agreements. But, for the moment, it's hard to know what is the best way to support this class of hardware.
Eventually, Ric said, we'll arrive at the "run phase," where there will be new APIs at the device level that can be used by filesystems and storage. There will be new Linux filesystems designed just for NVM devices (in a later session, we were told that Fusion-IO had such a filesystem that would be released at some unspecified time in the future). The Storage Network Industry Association has a working group dedicated to these issues. All told, the transition will take a while and will be painful, Ric said, much like the move to 64-bit systems.
Concerns
The subsequent discussion covered a number of topics, starting with a simple question: why not just use NVM devices as RAM that doesn't forget its contents when the power goes out? One problem with doing things that way is that, while NVM may perform like RAM, other aspects — such as lifespan — may be different. Excessive writes to an NVM device may reduce its useful lifetime considerably.
There was some talk about the difficulty of getting support for new types of devices into Linux in general. The development community goes way beyond the kernel; there are many layers of projects involved in the creation of a full system. This community seems mysterious to a lot of vendors. It can take many years to get features to the point that users can actually take advantage of them. An example that was raised was parallel NFS, which has been in development for at least ten years, but we're only now getting our first enterprise support — and that is client support only.
Another point of discussion was replication of data. With ordinary block devices, replication of data across multiple devices is relatively easy. With NVM devices that are directly accessed by user space, instead, the "interception point" is gone, so there is no way for the kernel to transparently replicate data on its way to persistent storage. It was pointed out that, since applications are going to have to be changed to take advantage of NVM devices anyway, it makes sense to add replication features to the new APIs at the same time.
The issue of how trustworthy these devices are came up briefly. Applications are not accustomed to dealing with memory errors; that may have to change in the future. So the new APIs will need to include features for checksumming and error checking as well. Boaz Harrosh pointed out that, until we know what the failure characteristics of these new devices are, we will not be able to defend against them. Martin Petersen responded that the hardware interfaces to these devices are intended to be independent of the underlying technology. There are, it seems, several technologies competing for a place in the "post-flash" world; the interfaces, hopefully, will hide the differences between those technologies.
In summary, we seem to be headed toward an interesting new world, but it's still not clear what that world will look like or when it will arrive. Chances are that we will have good kernel support for NVM devices by the time they are generally available, but higher-level software may take a while to catch up and take full advantage of this new class of hardware. It should be an interesting transition.
[Your editor would like to thank the Linux Foundation for assistance with travel to the event.]
The 2013 Linux Storage, Filesystem, and Memory Management Summit
The 2013 Linux Storage, Filesystem, and Memory Management Summit was held April 18 and 19 in San Francisco, California, immediately after the Linux Foundation's Collaboration Summit. This page will gather the coverage of this event, which was split into three separate tracks.
Plenary sessions
The following sessions involved the entire group of nearly 100 developers:
- Lock scaling: fine-grained locking
is often seen as the path to greater scalability, but what happens
when increasing the number of locks makes the system less scalable
instead?
- Page forking: might the performance
problems associated with stable pages
be better addressed by a switch to an entirely different solution to
the problem of implementing stable writes within filesystems?
- The shrinker API is the source of
a number of problems in memory management and beyond; here, those
problems were discussed in the context of a proposal for an improved
shrinker API.
- A storage technology update. What can
we expect from upcoming storage devices, and how will the kernel
handle them?
- FUSE and cloud storage; how can we make FUSE work better?
MM-only sessions
The memory management developers had a number of sessions where they closed themselves up in a tiny, refrigerated room for MM-specific discussions. Reports from these sessions include:
- mmap_sem and filesystems: complexities
around the use of the memory management semaphore are creating pain
for filesystem developers.
- In-kernel compression: a seeming
resolution to the ongoing debate between zswap and zcache.
- Various short topics including
hardware-initiated paging from coprocessors, process exit times, and
volatile ranges.
- Writeback latency: the inevitable
writeback discussion was focused on a handful of specific problems in
need of solution in the near future.
- Toward better swapping especially when
the available swap devices have different performance characteristics.
- Improving the out-of-memory killer:
will we ever find a better way to kill off processes when the system
runs out of memory?
- Soft reclaim: making reclaim in control groups work better — though universal agreement on just how things should behave does not yet exist.
Filesystem and Storage sessions
The bulk of the non-plenary sessions were for both Filesystem and Storage developers. Here are the reports from those discussions:
- Storage data integrity: What are the
right interfaces for handling storage data integrity information?
- Unit attentions and thin provisioning
thresholds: When a storage array hits its "soft" threshold, it will
generate a "unit attention", what does the kernel need to handle that
situation?
- I/O hints: Higher layers can provide
hints to the storage layer about how the stored data will be used and accessed,
but it is not clear what filesystems should do to pass along any hints they
get or to generate some of their own.
- Copy offload: How to support
offloading data copies to servers or storage arrays.
- dm-cache and bcache: the future of two
storage-caching technologies for Linux.
- Error returns: filesystems could use
better error information from the storage layer.
- Storage management: how do we ease the
task of creating and managing filesystems on Linux systems?
- O_DIRECT: the kernel's direct I/O code
is complicated, fragile, and hard to change. Is it time to start
over?
- Reducing io_submit() latency: submitting asynchronous I/O operations can potentially block for long periods of time, which is not what callers want. Various ways of addressing this problem were discussed, but easy solutions are not readily at hand.
Filesystem-only sessions
- NFS status: what is going on in the
NFS subsystem.
- Btrfs status: what has happened with
the next-generation Linux filesystem, and when will it be ready for
production use?
- User-space filesystem servers: what
can the kernel do to support user-space servers like Samba and
NFS-GANESHA?
- Range locking: a proposal to lock
portions of files within the kernel.
- FedFS: where things stand with the creation of a Federated Filesystem implementation for Linux.
Storage-only sessions
- Reducing SCSI latency. The SCSI
stack is having a hard time keeping up with the fastest drives; what
can be done to speed things up?
- SCSI testing. It would be nice to
have a test suite for SCSI devices; after this session, one may well
be in the works.
- Error handling and firmware updates: some current problems with handling failing drives, and how can we do online firmware updates on SATA devices?
Before anybody asks: the taking of the group picture was a somewhat confused event this year, and we were unable to take a picture of our own. So we have no such picture to post at this time.
The Linux Foundation has posted a set of photos from the event, including the group picture.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
