Kernel development
Brief items
Kernel release status
The current development kernel is 4.4-rc3, released on November 29. Linus said: "I don't think there's anything particularly exciting, although that obviously depends on whether some particular issue ended up affecting you or not. Most of it is pretty tiny random fixups."
Previously, 4.4-rc2 came out on November 22.
Stable updates: none have been released in the last two weeks.
Quotes of the week
I'm still leading with three stupid mistakes over your one though.
Kernel development news
Post-init read-only memory
At the 2015 Kernel Summit, the assembled developers discussed the idea of incorporating more security-hardening patches into the kernel. As part of that effort, it was agreed that taking another look at the out-of-tree grsecurity patches made sense. The first fruit from this work would appear to be the post-init read-only memory patch set from Kees Cook. This work has been received well, but it also highlights some of the difficulties involved with hardening a general-purpose kernel.The key to a successful exploit is often convincing the kernel to write to an unintended location. See, for example, this recent exploit, which uses a driver bug to overwrite a portion of the vDSO area; that, in turn, enables an attacker to run arbitrary code in kernel mode. One way to defend against such attacks is to minimize, to the greatest extent possible, the memory that the kernel is allowed to write to. A number of techniques, from simply marking data read-only to supervisor-mode access prevention, can be deployed toward that end. There is one class of data, identified by the grsecurity developers, that current techniques overlook, however.
When the kernel boots, it sets up a vast array of data structures describing the hardware it runs on and much more. In many cases, those data structures will never be changed again but, since they are resident in writable memory, they can still be changed by an errant write operation. The post-init read-only memory patch set, as posted by Kees, allows these data structures to be marked with a special __read_only annotation. That will cause them to be placed into a separate ELF section (".data..read_only"). Once the kernel has finished the initialization process, all data found in that section will be marked read-only, never to be changed again. At that point, exploits like the vDSO overwrite linked above will no longer work.
This change seems like an obvious win: unchanging data is marked read-only, blocking known exploits and, perhaps, minimizing the impact of simple bugs as well. As an added bonus, read-only data will be kept together, leading to better cache behavior. It would appear to be an obvious candidate for merging in the near future. That will probably come to pass, but, first, an important question has to be answered: what should happen when the hardware catches an attempt by the kernel to write (post initialization) memory that had been marked __read_only?
When things go wrong
This question matters because there is a potential hazard whenever a data
structure is marked __read_only: the developer involved may have
overlooked the one case where, after a rare sequence of events on days with
a waxing gibbous moon, that data structure must be changed. Or there may
be a case where data structures are modified unnecessarily, perhaps storing
data that is already there anyway. Such cases work in current kernels, but
would break if the data being written were made read-only. Mathias Krause described one such experience, wherein the
system would fail during the resume sequence. As he noted:
"Debugging that kind of problem is sort of a PITA, you could
imagine.
"
The ideal solution would be to have the compiler catch attempts to modify __read_only data outside of the initialization sequence, but that is not currently possible. Simply marking the relevant data structures const will not work; those data structures are written to during boot and, as PaX Team pointed out, making them const opens the door to all kinds of surprising, optimization-related behavior from the compiler. Where compilers are involved, surprising behavior is rarely a good thing. As an alternative, Mathias suggested the use of a special-purpose GCC module to detect inappropriate writes. There seems to be agreement that this is a good idea, but no such module exists and it will take time to create one. Holding this patch set until a checker module can be created seems undesirable.
But without such a checker, there will almost certainly be situations where the kernel tries to write to something marked __read_only, either because it was so marked in error or as the result of some other bug. There have been a number of ideas put forward on how such problems could be handled.
The most obvious thing to do is to simply oops the kernel, with the usual
results for the process that was running and, perhaps, the machine as a
whole. Andy Lutomirski supported this
approach, saying: "We failed, we might be under attack, let's
oops.
" The problem with this approach, of course, is that it takes
the machine out of commission, possibly with an error that is less than fun
to try to track down. Ingo Molnar also worried that the oops information would, in
most desktop cases, never be seen by the user and, as a result, would never
be reported to developers. That highlights an old problem with presenting
such information on desktop systems, but that problem is unlikely to be
fixed right now.
The alternative to oopsing the system would be to log the error and somehow try to continue. Ingo suggested simply skipping over the offending instruction and trying to continue, but that idea did not go far; as PaX Team pointed out, simply dropping an intended write operation could create no end of strange problems further down the line and may actually help exploit attempts. Linus suggested, instead, that the kernel could mark the relevant page writable and retry the instruction. That would, of course, remove the read-only protection from that page, but it would allow the system to continue to operate while generating diagnostic information for developers. One would probably not want things to work this way on a production system, but it could be an invaluable option for developers.
The final piece of the puzzle might be to have a kernel command-line operation to disable the read-only marking entirely. That would provide an option to users who run into a bug and need to be able to get their work done until a proper fix is available.
Kees has indicated that his current approach is to take the kill-the-machine approach by default. He has already implemented the command-line option, and said that Linus's "mark the page writable" suggestion would not be difficult to add. So the next version of the patch should have addressed most of the concerns expressed so far. Getting it merged may prove to be the easy part, though; the task of identifying and marking truly read-only data could be a long and error-prone affair, even when starting with the work that the grsecurity developers have already done. The good news is that this work should make the kernel more secure, provide a (perhaps imperceptible) performance improvement, and turn up a few bugs along the way.
TLS in the kernel
An RFC patch from Dave Watson at Facebook proposes moving the bulk of Transport Layer Security (TLS) processing into the kernel. There are a number of advantages he sees for doing so, but most of the commenters on the patch set seem a bit skeptical about the idea. TLS is, of course, the encryption layer that protects HTTPS and other internet protocols.
The patch set implements RFC 5288 encryption for TLS, which is based on the 128-bit advanced encryption standard (AES) using Galois counter mode (GCM)—also known as "gcm(aes)". That accounts for roughly 80% of the TLS connections that Facebook sees, Watson said. The idea is for the kernel to handle the symmetric encryption and decryption, while leaving the handshake processing to user space. The feature uses the user-space API to the kernel's crypto subsystem, which is accessed via sockets created using the AF_ALG address family.
The basic idea is that an AF_ALG socket and a regular TCP socket are both created. The TCP socket is used to do the handshake with the remote endpoint, which establishes keys and such. The keys (one each for sending and receiving) are passed to the crypto socket using setsockopt(). An operational socket is also created by making an accept() call on the crypto socket. That socket is used in further processing, including setting the initialization vectors (IVs) using sendmsg() and control messages created using CMSG. There are also two IVs, one for each direction. In addition, the file descriptor for the TCP socket is passed to the operational socket in a control message; the application will then read and write data from the operational socket. Watson pointed to an example C program that uses the new facility.
That approach has a number of benefits, according to Watson. Using some additional code that was not part of his submission, he said the in-kernel TLS showed 2-7% better performance than the equivalent done in user space. The idea was inspired by some work [PDF] that Netflix did on FreeBSD to improve the performance of TLS. In addition, two other features could benefit from having TLS in the kernel, he said. The kernel connection multiplexer (KCM) needs access to unencrypted data in the kernel, which this would provide; offloading TLS encryption and decryption to NICs would also require TLS framing support in the kernel.
But Hannes Frederic Sowa questioned two of those advantages. He believes that the existing facilities provided by Linux already do less copying than those that FreeBSD provides, so he suggested comparing the in-kernel approach with a user-space implementation using mmap() and vmsplice() on the TCP socket. Beyond that, he noted that kernel developers have been strong opponents of TCP-offloading efforts. In order to provide TLS offloading, a NIC would also need to handle the TCP layer, so it would effectively be doing TCP offloading as well.
Crypto maintainer Herbert Xu was a bit surprised at the approach. While he can see that using AF_ALG makes sense as a way to export TLS functionality to user space, it's not the way he might have approached it:
But Watson noted that handling out-of-band (OOB) data is one reason to not just layer TLS on top of a TCP socket. TLS transfers data beyond just the data being sent by the application, for things like alerts or to change the cipher being used, but a TCP socket lacks an easy way to signal the reception of that kind of data. In Watson's patches, the crypto socket returns an error in that situation and user space can then read the OOB data from the TCP socket if it wishes.
But others also questioned the value of having TLS in the kernel at all. Modern processors provide user-space programs with access to accelerated crypto instructions directly, without a need for kernel intervention. There is some crypto-acceleration hardware out there, where there might be some benefit to having TLS in the kernel, but it has mostly fallen by the wayside because of better processor support for crypto. As Sowa put it:
Since processors provide aesni and other crypto extensions as part of their instruction set architecture, this, of course, does not make sense any more.
Overall, it looks like it will take some more convincing arguments before putting TLS in the kernel will be seriously considered. For some specialized situations, it might make sense to do so, but even the limited version Watson posted adds more than 1200 lines of code to the kernel—for dubious gains. Over time, more and more crypto has been added to the kernel, though, so maybe TLS will eventually find its way in too.
SOCK_DESTROY: an old Android patch aims upstream
TCP is a patient protocol; if a remote peer stops responding, it will wait a long time (measured in minutes, by default) in the hope that connectivity will eventually return. Sometimes, however, that wait is undesirable; that is especially true when it is known that the connection will not be coming back, but that the establishment of a replacement connection may succeed. As it happens, mobile networking often presents such situations. The SOCK_DESTROY patch set from Lorenzo Colitti is an attempt to improve the user experience in such situations. It fills a clear need, but has run into some opposition anyway; it also shows that the rift between the Android and kernel projects has not yet been entirely closed.Imagine, for a moment, a user streaming $SPORTING_EVENT on a phone handset over a WiFi connection. Said user walks out the door, away from the WiFi network's coverage; that will cause the stream to freeze, probably at the beginning of the bit of action that decides the entire game. The WiFi connection is gone and is not coming back, but the streaming application does not know that, so it will wait a long time, in vain, for data to show up on its network socket. After several minutes, the connection will time out. The application will then realize that it has been disconnected and will try to reconnect; that new connection, going over the phone's broadband interface, will succeed. Streaming recommences, and our poor user gets to watch the post-game sportscasters talking about the one-of-a-kind play that happened while the stream was frozen. The resulting handset-destroying rage could have been avoided if the application had not waited for the network timeout to occur.
There are other scenarios that can create similar problems; placing a system onto a virtual private network (VPN) is another example. When this kind of network change occurs, things would work better if applications knew immediately that their open connection was never going to produce another packet. There are a number of ways this information could be conveyed, but one of the more straightforward ways would be to simply close the socket, returning an error to the application. That is what the SOCK_DESTROY patch set makes possible.
In particular, it adds a SOCK_DESTROY to the netlink-based "socket diag" mechanism, first added to the kernel in the 3.10 development cycle. A suitably privileged process (CAP_NET_ADMIN is required) can use this operation to close an arbitrary socket owned by another process; that process will see an ETIMEDOUT error. As it happens, that error is the same that is returned when a socket times out, but the pain of actually waiting for the timeout has been taken away. Any application that is prepared for such errors (and applications running in mobile environments, at least, should be) should recover and reconnect with no changes required.
As it happens, the Android kernel has had this capability since 2008, though in a different form: Android currently supports an ioctl() command called SIOCKILLADDR. This patch set is an attempt to move this capability upstream, cleaning it up a bit along the way. The fact that this feature has been shipped with Android suggests that there is a real need for it, but a number of concerns were raised anyway.
Tom Herbert worried that this facility could be used by an administrator to close sockets for any reason and that the affected application would have no way to know that this had happened. He suggested that the error code returned could be changed to ENETRESET, so that an explicit action to close a socket would not be presented as if it were a passive timeout. A later version of the patch set changes the return code to ECONNABORTED, which was chosen to be compatible with what BSD systems do.
Hannes Frederic Sowa suggested that, in some cases, quickly closing a socket in this manner could cause old data to be delivered to the wrong socket. Networking maintainer David Miller agreed with that concern, and suggested an alternative: the closing of sockets could be handled by the operation that disconnects them from the network in the first place. So, for example, the removal of a route associated with a disappeared network could cause any sockets bound to that network to be closed. David made it clear that he wants to have the kernel, rather than user space, in charge of deciding which sockets should be closed.
The problem with that approach, according to Lorenzo, is that the kernel doesn't always have a way to know which sockets have been affected by a networking change. The VPN case, in particular, can muddy the waters considerably. Beyond that, it was pointed out that user space can also force sockets to be closed by killing applications directly or installing special firewall rules. The new operation just makes this kind of action easier to carry out. Lorenzo did, however, change the patch to send a reset (set the RST bit) to the peer when a socket is closed as a way of reducing the chances of protocol confusion.
Eric Dumazet came in with a request that
the change be merged. He noted that: "Every time I make a change in
linux TCP stack, this code breaks, and this a real pain because Android
changes need to be carried over to vendors.
" Getting the
SOCK_DESTROY patch merged would spare him the phone calls and
allow him to get more work done on the rest of the networking code. He
also noted that the commonly suggested alternative of having applications
do their own keep-alive processing is not really viable in the mobile
environment for a couple of reasons.
Finally, Eric pointed out that TCP is competing with the QUIC protocol in the mobile space. QUIC is based on UDP and can react quickly to changes in the networking environment; without a similar ability to react, he said, TCP is not competitive.
David then complained that the Android developers still do not really care about the upstream kernel — a complaint that your editor still occasionally hears over beer at conferences. The fact that Android has been carrying this patch for something like seven years does not, in his mind, constitute a reason to merge it quickly into the mainline. Indeed, he said, Android's developers should be prepared to wait for a while as the patch's merits are considered:
Lorenzo responded that he would like to see things change in this area, with more Android code going upstream. The posting of the SOCK_DESTROY patch set was a part of the effort to bring that about. Almost everything that the Android networking group has done in the last two years has been sent upstream, he said.
As was recently discussed at the 2015 Kernel Summit, Android-based devices run a lot of out-of-tree code; indeed, they may be running more out-of-tree code than upstream code. The portion of that code contained within the Android project's repositories is relatively low, though, and there does appear to have been an effort to reduce it in recent years. But it's clear that some resentment remains in the kernel development community. In the end, though, that resentment is unlikely to prevent the merging of needed functionality. By the time it gets upstream, this feature may or may not look like SOCK_DESTROY, but it can be expected to do something similar. Mobile devices are not going away and the kernel community, in the end, wants to support them as well as possible.
A journal for MD/RAID5
RAID5 support in the MD driver has been part of mainline Linux since 2.4.0 was released in early 2001. During this time it has been used widely by hobbyists and small installations, but there has been little evidence of any impact on the larger or "enterprise" sites. Anecdotal evidence suggests that such sites are usually happier with so-called "hardware RAID" configurations where a purpose-built computer, whether attached by PCI or fibre channel or similar, is dedicated to managing the array. This situation could begin to change with the 4.4 kernel, which brings some enhancements to the MD driver that should make it more competitive with hardware-RAID controllers.While hardware-RAID solutions suffer from the lack of transparency and flexibility that so often come with closed devices, they have two particular advantages. First, a separate computer brings dedicated processing power and I/O-bus capacity which takes some load off the main system, freeing it for other work. At the very least, the system CPU will never have to perform the XOR calculations required to generate the parity block, and the system I/O bus will never have to carry that block from memory to a storage device. As commodity hardware has increased in capability and speed over the years, though, this advantage has been significantly eroded.
The second advantage is non-volatile memory (NVRAM). While traditional commodity hardware has not offered much NVRAM because it would hardly ever be used, dedicated RAID controllers nearly always have NVRAM as it brings real benefits in both performance and reliability. Utilizing NVRAM provides more than just the incremental benefits brought by extra processing components. It allows changes in data management that can yield better performance from existing devices.
With recent developments, non-volatile memory is becoming a reality on commodity hardware, at least on server-class machines, and it is becoming increasing easy to attach a small solid-state storage device (SSD) to any system that manages a RAID array. So the time is ripe for MD/RAID5 to benefit from the ability to manage data in the ways that NVRAM allows. Some engineers from Facebook, particularly Shaohua Li and Song Liu, have been working toward this end; Linux 4.4 will be the first mainline release to see the fruits of that labor.
Linux 4.4 — closing the RAID5 write hole
RAID5 (and related levels such as RAID4 and RAID6) suffer from a potential problem known as the "write hole". Each "stripe" on such an array — meaning a set of related blocks, one stored on each active device — will contain data blocks and parity blocks; these must always be kept consistent. The parity must always be exactly what would be computed from the data. If this is not the case then reconstructing the data that was on a device that has failed will produce incorrect results.
In reality, stripes are often inconsistent, though only for very short intervals of time. As the drives in an array are independent (that is the "I" of RAID) they cannot all be updated atomically. When any change is made to a stripe, this independence will almost certainly result in a moment when data and parity are inconsistent. Naturally the MD driver understands this and would never try to access data during that moment of inconsistency ... unless....
Problems occur if a machine crash or power failure causes an unplanned
shutdown. It is fairly easy to argue that the likelihood that an
unclean shutdown would interrupt some writes but not others is
extremely small. It's not easy to argue that such a circumstance could
never happen, though. So when restarting from an unclean shutdown, the MD
driver must assume that the failure may have happened during a moment of
inconsistency and, thus, the parity blocks cannot be trusted.
If the
array is still optimal (no failed devices) it will recalculate the
parity on any stripe that could have been in the middle of an update.
If, however, the array is degraded, the parity cannot be recalculated. If
some blocks in a stripe were updated and others weren't, then the block that
was on the failed device will be reconstructed based on inconsistent
information, leading to data corruption. To handle this case,
MD will refuse to assemble the array
without the "--force
" flag, which effectively acknowledges
that data might be corrupted.
An obvious way to address this issue is to use the same approach that has worked so well with filesystems: write all updates to a journal before writing them to the main array. When the array is restarted, any data and parity blocks still in the journal are simply written to the array again. This ensures the array will be consistent whether it is degraded or not. This could be done with a journal on a rotating-media drive but the performance would be very poor indeed. The advent of large NVRAM and SSDs makes this a much more credible proposition.
The new journal feature
The functionality developed at Facebook does exactly this. It allows a journal device (sometimes referred to as a "cache" or "log" device) to be configured with an MD/RAID5 (or RAID4 or RAID6) array. This can be any block device and could even be a mirrored pair of SSDs (because you wouldn't want the journal device to become a single point of failure).
To try this out you would need Linux 4.4-rc1 or later, and the current
mdadm from git://neil.brown.name/mdadm
. Then you can create a new
array with a journal using a command like
mdadm --create /dev/md/test --level=5 --raid-disks=4 --write-journal=/dev/loop9 \ /dev/loop[0-3]
It is not currently possible to add a journal to an existing array, but that functionality is easy enough to add later.
With the journal in place, RAID5 handling will progress much as it normally does, gathering write requests into stripes and calculating the parity blocks. Then, instead of being written to the array, the stripe is intercepted by the journaling subsystem and queued for the journal instead. When write traffic is sufficiently heavy, multiple stripes will be grouped together into a single transaction and written to the journal with a single metadata block listing the addresses of the data and parity. Once this transaction has been written and, if necessary, flushed to stable storage, the core RAID5 engine is told to process the stripe again, and this time the write-out is not intercepted.
When the write to the main array completes, the journaling subsystem will be told; it will occasionally update its record of where the journal starts so that data that is safe on the array effectively disappears from the journal. When the array is shut down cleanly, this start-of-journal pointer is set to an empty transaction with nothing following. When the array is started, the journal is inspected and if any transactions are found (with both data and parity) they are written to the array.
The journal metadata block uses 16 bytes per data block and so can describe well over 200 blocks. Along with each block's location and size (currently always 4KB), the journal metadata records a checksum for each data block. This, together with a checksum on the metadata block itself, allows very reliable determination of which blocks were successfully written to the journal and so should be copied to the array on restart.
In general, the journal consists of an arbitrarily large sequence of metadata blocks and associated data and parity blocks. Each metadata block records how much space in the journal is used by the data and parity and so indicates where the next metadata block will be, if it has been written. The address of the first metadata block to be considered on restart is stored in the standard MD/RAID superblock.
The net result of this is that, while writes to the array might be slightly slower (depending on how fast the journal device is), a system crash never results in a full resync — only a short journal recovery — and there is no chance of data corruption due to the write hole.
Given that the write-intent bitmap already allows resynchronization after crash to be fairly quick, and that write-hole corruption is, in practice, very rare; you may wonder if this is all worth the cost. Undoubtedly different people will assess this tradeoff differently; now at least the option is available once that assessment is made. But this is not the full story. The journal can provide benefits beyond closing the write-hole. That was a natural place to start as it is conceptually relatively simple and provides a context for creating the infrastructure for managing a journal. The more interesting step comes next.
The future: writeback caching and more full-stripe writes
While RAID5 or RAID6 provide a reasonably economical way to combine multiple devices to provide large storage capacity with reduced chance of data loss, they do come at a cost. When the host system writes a full stripe worth of data to the array, the parity can be calculated from that data and all writes can be scheduled almost immediately, leading to very good throughput. When writing to less than a full stripe, though, throughput drops dramatically.
In that case, some data or parity blocks need to be read from the array before the new parity can be calculated. This read-before-write introduces significant latency to each request, so throughput suffers. The MD driver tries to delay partial-stripe writes a little bit in the hope that the rest of the stripe might be written soon. When this works, it helps a lot. When it doesn't, it just increases latency further.
It is possible for a filesystem to help to some extent, and to align data with stripes to increase the chance of a full-stripe write, but that is far from a complete solution. A journal can make a real difference here by being managed as a writeback cache. Data can be written to the journal and the application can be told that the data is safe before the RAID5 engine even starts considering whether some pre-reading might be needed to be able to update parity blocks.
This allows the application to see very short latencies no matter what data-block pattern is being written. It also allows the RAID5 core to delay writes even longer, hoping to gather full stripes, without inconveniencing the application. This is something that dedicated RAID controllers have (presumably) been doing for years, and hopefully something that MD will provide in the not-too-distant future.
There are plenty of interesting questions here, such as whether to keep all in-flight data in main memory, or to discard it after writing to the journal and to read it back when it is time to write to the RAID. There is also the question of when to give up waiting for a full stripe and to perform the necessary pre-reading. Together with all this, a great deal of care will be needed to ensure we actually get the performance improvements that theory suggests are possible.
This is just engineering though. There is interest in this from both potential users of the technology and vendors of the NVRAM and there is little doubt that we will see the journal enhanced to provide very visible performance improvements to complement the nearly invisible reliability improvements already achieved.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page:
Distributions>>