Kernel development

Brief items

Kernel release status

The current development kernel is 3.15-rc4, released on May 4. According to Linus: "There's a few known things pending still (pending fix for some interesting dentry list corruption, for example - not that any remotely normal use will likely ever hit it), but on the whole things are fairly calm and nothing horribly scary. We're in the middle of the calming-down period, so that's just how I like it."

Stable updates: 3.14.3, 3.10.39, and 3.4.89 were released on May 6 with the usual set of important fixes.

Comments (none posted)

Quotes of the week

Every new configuration combination is a new situation that competes testing wise with the others.

— David Miller

There's a stigma rightfully attached to out-of-tree patches, which roughly amounts to "people ought to submit patches upstream, we shouldn't have to support or care about out-of-tree patches". But that only works if the responses to patch submissions are either "No, because you need to fix X, Y, and Z", or "No, because your use case is better served by this existing mechanism already in the kernel", rather than "No, your use case is not valid".

— Josh Triplett

If you are using glibc and GNU tools it isn't going to work, but long ago tools were written which just did the job they were supposed to do at the time and were small and tidy. Programmers were expected to use shell scripts to combine them for harder jobs rather than be the one person a year who invoked gnu-wibble --format-sideways-while-singing --tune=waltzing-matilda

— Alan Cox

And as for announcing [long-term stable releases] ahead of time, I'm never going to do that again, the aftermath was horrid of people putting stuff that shouldn't be there. Heck, when people know about what the enterprise kernels are going to be, they throw stuff into upstream "early", so it's a well-known pattern and issue.

— Greg Kroah-Hartman

Comments (6 posted)

Kernel Summit 2014 Call for Topics

The 2014 Kernel Summit will be held August 18 to 20 in Chicago, alongside LinuxCon North America. The call for topics (which is also a call for potential invitees) has gone out; there is a soft deadline of May 15 for topic suggestions.

Full Story (comments: none)

GlusterFS 3.5 released

Version 3.5 of the GlusterFS cluster filesystem has been released. New features include better logging, the ability to take snapshots of individual files (full volumes cannot yet be snapshotted), on-the-wire compression, on-disk encryption, and improved geo-replication support.

Comments (none posted)

The possible demise of remap_file_pages()

By Jonathan Corbet
May 7, 2014

The remap_file_pages() system call is a bit of a strange beast; it allows a process to create a complicated, non-linear mapping between its address space and an underlying file. Such mappings can also be created with multiple mmap() calls, but the in-kernel cost is higher: each mmap() call creates a separate virtual memory area (VMA) in the kernel, while remap_file_pages() can get by with just one. If the mapping has a large number of discontinuities, the difference on the kernel side can be significant.

That said, there are few users of remap_file_pages() out there. So few that Kirill Shutemov has posted a patch set to remove it entirely, saying "Nonlinear mappings are pain to support and it seems there's no legitimate use-cases nowadays since 64-bit systems are widely available." The patch is not something he is proposing for merging yet; it's more of a proof of concept at this point.

It is easy to see the appeal of this change; it removes 600+ lines of tricky code from the kernel. But that removal will go nowhere if it constitutes an ABI break. Some kernel developers clearly believe that no users will notice if remap_file_pages() goes away, but going from that belief to potentially breaking applications is a big step. So there is talk of adding a warning to the kernel; Peter Zijlstra suggested going a step further and require setting a sysctl knob to make the system call active. But it would also help if current users of remap_file_pages() would make themselves known; speaking now could save some trouble in the future.

Comments (7 posted)

Kernel development news

The first kpatch submission

By Jonathan Corbet
May 7, 2014

It is spring in the northern hemisphere, so a young kernel developer's thoughts naturally turn to … dynamic kernel patching. Last week saw the posting of SUSE's kGraft live-patching mechanism; shortly thereafter, developers at Red Hat came forward with their competing kpatch mechanism. The approaches taken by the two groups show some interesting similarities, but also some significant differences.

Like kGraft, kpatch replaces entire functions within a running kernel. A kernel patch is processed to determine which functions it changes; the kpatch tools (not included with the patch, but available in this repository) then use that information to create a loadable kernel module containing the new versions of the changed functions. A call to the new kpatch_register() function within the core kpatch code will use the ftrace function tracing mechanism to intercept calls to the old functions, redirecting control to the new versions instead. So far, it sounds a lot like kGraft, but that resemblance fades a bit once one looks at the details.

KGraft goes through a complex dance during which both the old and new versions of a replaced function are active in the kernel; this is done in order to allow each running process to transition to the "new universe" at a (hopefully) safe time. Kpatch is rather less subtle: it starts by calling stop_machine() to bring all other CPUs in the system to a halt. Then, kpatch examines the stack of every process running in kernel mode to ensure that none are running in the affected function(s); should one of the patched functions be active, the patch-application process will fail. If things are OK, instead, kpatch patches out the old functions completely (or, more precisely, it leaves an ftrace handler in place that routes around the old function). There is no tracking of whether processes are in the "old" or "new" universe; instead, everybody is forced to the new universe immediately if it is possible.

There are some downsides to this approach. stop_machine() is a massive sledgehammer of a tool; kernel developers prefer to avoid it if at all possible. If kernel code is running inside one of the target functions, kpatch will simply fail; kGraft, instead, will work to slowly patch the system over to the new function, one process at a time. Some functions (examples would include schedule(), do_wait(), or irq_thread()) are always running somewhere in the kernel, so kpatch cannot be used to apply a patch that modifies them. On a typical system, there will probably be a few dozen functions that can block a live patch in this way — a pretty small subset of the thousands of functions in the kernel.

While kpatch, with its use of stop_machine(), may seem heavy-handed, there are some developers who would like to see it take an even stronger approach initially: Ingo Molnar suggested that it should use the process freezer (normally used when hibernating the system) to be absolutely sure that no processes have any running state within the kernel. That would slow live kernel patching even more, but, as he put it:

Well, if distros are moving towards live patching (and they are!), then it looks rather necessary to me that something scary as flipping out live kernel instructions with substantially different code should be as safe as possible, and only then fast.

The hitch with this approach, as noted by kpatch developer Josh Poimboeuf, is that there are a lot of unfreezable kernel threads. Frederic Weisbecker suggested that the kernel thread parking mechanism could be used instead. Either way, Ingo thought, kernel threads that prevented live patching would be likely to be fixed in short order. There was not a consensus in the end on whether freezing or parking kernel threads was truly necessary, but opinion did appear to be leaning in the direction of being slow and safe early on, then improving performance later.

The other question that has come up has to do with patches that change the format or interpretation of in-kernel data. KGraft tries to handle simple cases with its "universe" mechanism but, in many situations, something more complex will be required. According to kGraft developer Jiri Kosina, there is a mechanism in place to use a "band-aid function" that understands both forms of a changed data structure until all processes have been converted to the new code. After that transition has been made, the code that writes the older version of the changed data structure can be patched out, though it may be necessary to retain code that reads older data structures until the next reboot.

On the kpatch side, instead, there is currently no provision for making changes to data structures at all. The plan for the near future is to add a callback that can be packaged with a live patch; its job would be to search out and convert all affected data structures while the system is stopped and the patch is being applied. This approach has the potential to work without the need for maintaining the ability to cope with older data structures, but only if all of the affected structures can be located at patching time — a tall order, in many cases.

The good news is that few patches (of the type that one would consider for live patching) make changes to kernel data structures. As Jiri put it:

We've done some very light preparatory analysis and went through patches which would make most sense to be shipped as hot/live patches without enough time for proper downtime scheduling (i.e. CVE severity high enough (local root), etc). Most of the time, these turn out to be a one-or-few liners, mostly adding extra check, fixing bounds, etc. There were just one or two in a few years history where some extra care would be needed.

So the question of safely handling data-related changes can likely be deferred for now while the question how to change the code in a running kernel is answered. There have already been suggestions that this topic should be discussed at the 2014 Kernel Summit in August. It is entirely possible, though, that the developers involved will find a way to combine their approaches and get something merged before then. There is no real disagreement over the end goal, after all; it's just a matter of finding the best approach for the implementation of that goal.

Comments (8 posted)

Porting Linux to a new architecture

By Jake Edge
May 7, 2014

Embedded Linux Conference

While it's certainly not an everyday occurrence, getting Linux running on a new CPU architecture needs to be done at times. To someone faced with that task, it may seem rather daunting—and it is—but, as Marta Rybczyńska described in her Embedded Linux Conference (ELC) talk, there are some fairly straightforward steps to follow. She shared those steps, along with many things that she and her Kalray colleagues learned as they ported Linux to the MPPA 256 processor.

When the word "porting" is used, it can mean one of three different things, she said. It can be a port to a new board with an already-supported processor on it. Or it can be a new processor from an existing, supported processor family. The third alternative is to port to a completely new architecture, as with the MPPA 256 (aka K1).

With a new architecture comes a new CPU instruction set. If there is a C compiler, as there was for her team, then you can recompile the existing (non-arch) kernel C code (hopefully, anyway). Any assembly pieces need to be rewritten. There will be a different memory map and possibly new peripherals. That requires configuring existing drivers to work in a new way or writing new drivers from scratch. Also, when people make the effort to create a new architecture, they don't do that just for fun, Rybczyńska said. There will be benefits to the new architecture, so there will be opportunities to optimize the existing system to take advantage of it.

There are several elements that are common to any port. First, you need build tools, such as GCC and binutils. Next, there is the kernel, both its core code and drivers. There are important user-space libraries that need to be ported, such as libc, libm, pthreads, etc. User-space applications come last. Most people start with BusyBox as the first application, then port other applications one by one.

Getting started

To get started, you have to learn about the new architecture, she said. The K1 is a massively multi-core processor with both high performance and high energy efficiency, she said. It has 256 cores that are arranged in groups of sixteen cores which share memory and an MMU. There are Network-on-Chip interfaces to communicate between the groups. Each core has the same very large instruction word (VLIW) instruction set, which can bundle up to five instructions to be executed in one cycle. The cores have advanced bitwise instructions, hardware loops, and a floating point unit (FPU). While the FPU is not particularly important for porting the kernel, it will be needed to port user-space code.

To begin, you create an empty directory (linux/arch/k1 in her case), but then you need to fill it, of course. The initial files needed are less than might be expected, Rybczyńska said. Code is needed first to configure the processor, then to handle the memory map, which includes configuring the zones and initializing the memory allocators. Handling processor mode changes is next up: interrupt and trap handlers, including the clock interrupt, need to be written, as does code to handle context switches. There is some device tree and Kconfig work to be done as well. Lastly, adding a console to get printk() output is quite useful.

To create that code, there are a couple of different routes. There is not that much documentation on this early boot code, so there is a tendency to copy and paste code from existing architectures. Kalray used several as templates along the way, including MicroBlaze, Blackfin, and OpenRISC. If code cannot be found to fit the new architecture, it will have to be written from scratch. That often requires reading other architecture manuals and code—Rybczyńska can read the assembly language for several architectures she has never actually used.

There is a tradeoff between writing assembly code vs. C code for the port. For the K1, the team opted for as much C code as possible because it is difficult to properly bundle multiple instructions into a single VLIW instruction by hand. GCC handles it well, though, so the K1 port uses compiler built-ins in preference to asm inline functions. She said that the K1 has less assembly code than any other architecture in the kernel.

Once that is all in place, at some point you will get the (sometimes dreaded) "Failed to execute /init" error message. This is actually a "big success", she said, as it means that the kernel has booted. Next up is porting an init, which requires a libc. For the K1, they ported uClibc, but there are other choices, of course. She suggested that the first versions of init be statically linked, so that no dynamic loader is required.

Porting a libc means that the kernel-to-user-space ABI needs to be nailed down. At program startup, which values will be in what registers? Where will the stack be located? And so on. Basically, it required work in both the kernel and libc "to make them work together". System calls will also need to be worked on. Setting the numbers for the calls along with determining how the arguments will be passed (registers? stack?) is needed. Signals will need some work as well, but if the early applications being ported don't use signals, only basic support needs to be added, which makes things much simpler.

Kalray created an instruction set simulator for the K1, which was helpful in debugging. The simulator can show every single instruction with the value in each register. It is "handy and fast", Rybczyńska said, and was a great help when doing the port.

Eventually, booting into the newly ported init will be possible. At that point, additional user-space executables are on the agenda. Again she suggested starting out with static binaries. Work on the dynamic loader required "lots of work on the compiler and binutils", at least for the K1. Also needed is porting or writing drivers for the main peripherals that will be used.

Testing

Rybczyńska stressed that testing is "easily forgotten", but is important to the process. When changes are made, you need to ensure you didn't break things that were already working. Her team started by trying to create unit tests from the kernel code, but determined that was hard to do. Instead, they created a "test init" that contained some basic tests of functionality. It is a "basic validation that all of the tools, libc, and the kernel are working correctly", she said.

Further testing of the kernel is required as well, of course. The "normal idea" is to write your own tests, she said, but it would take months just to create tests for all of the system calls. Instead, the K1 team used existing tests, especially those from the Linux Test Project (LTP). It is a "very active project" with "tests for nearly everything", she said; using LTP was much better than trying to write their own tests.

Continuing on is just a matter of activating new functionality (e.g. a new kernel subsystem, filesystem, or driver), fixing things that don't compile, then fixing any functionality that doesn't work. Test-driven development "worked very well for us".

As an example, she described the process undertaken to port strace, which she called a nice debugging tool that is much less verbose than the instruction set simulator. But strace uses the ptrace() system call and requires support for signals. Up until that point, there had not been a need to support signals. The ptrace() tests in LTP were run first, then strace was tried. It compiled easily, but didn't work as there were architecture-specific pieces of the ptrace() code that still needed to be implemented.

Supporting a new architecture requires new code to enable the special features of the chip. For Kalray, the symmetric multi-processing (SMP) and MMU code required a fair amount of time to design and implement. The K1 also has the Network-on-Chip (NoC) subsystem, which is brand new to the kernel. Supporting that took a lot of internal discussion to create something that worked correctly and performed reasonably. The NoC connects the groups of cores, so its performance is integral to the overall performance of the system.

Once the port matures, building a distribution may be next up. One way is to "do it yourself", which is "fine if you have three packages", Rybczyńska said. But if you have more packages than that, it becomes a lot less fun to do it that way. Kalray is currently using Buildroot, which was "easy to set up". The team is now looking at the Yocto Project as another possibility.

Lessons learned

The team learned a number of valuable lessons in doing the port. To start with, it is important to break the work up into stages. That allows you to see something working along the way, which indicates progress being made, but it also helps with debugging. "Test test test", she said, and do it right from the beginning. There are subtle bugs that can be introduced in the early going and, if you aren't testing, you won't catch them early enough to easily figure out where they were introduced.

Wherever possible, use generic functionality already provided by the kernel or other tools; don't roll your own unless you have to. Adhere to the kernel coding style from the outset. She suggested using panic() and exit() in lots of places, including putting it in every non-implemented function. That will help not to waste time debugging problems that aren't actually problems. Code that won't compile if the architecture is unknown should be preferred. If an application has architecture dependencies, failing to compile is much easier to diagnose than some strange failure.

Spend time developing advanced debugging techniques and tools. For example, they developed a visualization tool that showed kernel threads being activated during the boot process. Reading the documentation is important, as is reading the comments in the code. Her last tip was that reading code for other platforms is quite useful, as well.

With that, she answered a few questions from the audience. The port took about two months to get it to boot the first init, she said, the rest "takes much more time". The port is completely self-contained as there are no changes to the generic kernel. Her hope is to submit the code upstream as soon as possible, noting that being out of the mainline can lead to problems (as they encountered with a pointer type in the tty functions when upgrading to 3.8). While Linux is not shipping yet for the K1, it will be soon. The K1 is currently shipping with RTEMS, which was easier to port, thus it filled the operating system role while the Linux port was being completed, she said.

Slides [PDF] from Rybczyńska's talk are available on the ELC slides page.

Comments (2 posted)

Networking on tiny machines

By Jonathan Corbet
May 7, 2014

Last week's article on "Linux and the Internet of Things" discussed the challenge of shrinking the kernel to fit on to computers that, by contemporary standards, are laughably underprovisioned. Shortly thereafter, the posting of a kernel-shrinking patch set sparked a related discussion: what needs to be done to get the kernel to fit into tiny systems and, more importantly, is that something that the kernel development community wants to even attempt?

Shrinking the network stack

The patch set in question was a 24-part series from Andi Kleen adding an option to build a minimally sized networking subsystem. Andi is looking at running Linux on systems with as little as 2MB of memory installed; on such systems, the Linux kernel's networking stack, which weighs in at about 400KB for basic IPv4 support, is just too big to shoehorn in comfortably. By removing a lot of features, changing some data structures, and relying on the link-time optimization feature to remove the (now) unneeded code, Andi was able to trim things down to about 170KB. That seems like a useful reduction, but, as we will see, these changes have a rough road indeed ahead of them before any potential merge into the mainline.

Some of the changes in Andi's patch set include:

Removal of the "ping socket" feature that allows a non-setuid ping utility to send ICMP echo packets. It's a useful feature in a general-purpose distribution, but it's possibly less useful in a single-purpose tiny machine that may not even have a ping binary. Nonetheless the change was rejected: "We want to move away from raw sockets, and making this optional is not going to help us move forward down that path".
Removal of raw sockets, saving about 5KB of space. Rejected: "Sorry, you can't have half a functioning ipv4 stack."
Removal of the TCP fast open feature. That feature takes about 3KB to implement, but it also requires the kernel to have the crypto subsystem and AES code built in. Rejected: "It's for the sake of the remote service not the local client, sorry I'm not applying this, it's a facility we want to be ubiquitous and in widespread use on as many systems as possible."
Removal of the BPF packet filtering subsystem. Rejected: "I think you highly underestimate how much 'small systems' use packet capturing and thus BPF."
Removal of the MIB statistics collection code (normally accessed via /proc) when /proc is configured out of the kernel. Rejected: "Congratulations, you just broke ipv6 device address netlink dumps amongst other things".

The above list could be made much longer, but the point should be apparent by now: this patch set was not welcomed by the networking community with open arms. This community has been working with a strong focus on performance and features on contemporary hardware; networking developers (some of them, at least) do not want to be bothered with the challenges of trying to accommodate users of tiny systems. As Eric Dumazet put it:

I have started using linux on 386/486 pcs which had more than 2MB of memory, it makes me sad we want linux-3.16 to run on this kind of hardware, and consuming time to save few KB here and here.

The networking developers also do not want to start getting bug reports from users of a highly pared-down networking stack wondering why things don't work anymore. Some of that would certainly happen if a patch set like this one were to be merged. One can try to imagine which features are absolutely necessary and which are optional on tiny systems, but other users solving different problems will come to different conclusions. A single "make it tiny" option has a significant chance of providing a network stack with 99% of what most tiny-system users need — but the missing 1% will be different for each of those users.

Should we even try?

Still, pointing out some difficulties inherent in this task is different from saying that the kernel should not try to support small systems at all, but that appears to be the message coming from the networking community. At one point in the discussion, Andi posed a direct question to networking maintainer David Miller: "What parts would you remove to get the foot print down for a 2MB single purpose machine?" David's answer was simple: "I wouldn't use Linux, end of story. Maybe two decades ago, but not now, those days are over." In other words, from his point of view, Linux should not even try to run on machines of that class; instead, some sort of specialty operating system should be used.

That position may come as a bit of a surprise to many longtime observers of the Linux development community. As a general rule, kernel developers have tried to make the system work on just about any kind of hardware available. The "go away and run something else" answer has, on rare occasion, been heard with regard to severely proprietary and locked-down hardware, but, even in those cases, somebody usually makes it work with Linux. In this case, though, there is a class of hardware that could run Linux, with users who would like to run Linux, but some kernel developers are telling them that there is no interest in adding support for them. This is not a message that is likely to be welcomed in those quarters.

Once upon a time, vendors of mainframes laughed at minicomputers — until many of their customers jumped over to the minicomputer market. Minicomputer manufacturers treated workstations, personal computers, and Unix as toys; few of those companies are with us now. Many of us remember how the proprietary Unix world treated Linux in the early days: they dismissed it as an underpowered toy, not to be taken seriously. Suffice to say that we don't hear much from proprietary Unix now. It's a classic Innovator's Dilemma story of disruptive technologies sneaking up on incumbents and eating their lunch.

It is not entirely clear that microscopic systems represent this type of disruptive technology; the "wait for the hardware to grow up a bit" approach has often worked well for Linux in the past. It is usually safe to bet on computing hardware increasing in capability over time, so effort put into supporting underpowered systems is often not worth it. But we may be dealing with a different class of hardware here, one where "smaller and cheaper" is more important than "more powerful." If these systems can be manufactured in vast numbers and spread like "smart dust," they may well become a significant part of the computing substrate of the future.

So the possibility that tiny systems could be a threat to Linux should certainly be considered. If Linux is not running on those devices, something else will be. Perhaps it will be a Linux kernel with the networking stack replaced entirely by a user-space stack like lwIP, or perhaps it will be some other free operating system whose community is more interested in supporting this hardware. Or, possibly, it could be something proprietary and unpleasant. However things go, it would be sad to look back someday and realize that the developers of Linux could have made the kernel run on an important class of machines, but they chose not to.

Comments (91 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.15-rc4 ?

Greg KH Linux 3.14.3 ?

Kamal Mostafa Linux 3.13.11.1 ?

Sebastian Andrzej Siewior 3.14.2-rt3 ?

Greg KH Linux 3.10.39 ?

Kamal Mostafa Linux 3.8.13.23 ?

Greg KH Linux 3.4.89 ?

Architecture-specific

Alex Elder ARM: SMP: support Broadcom mobile SoCs ?

Steve Capper get_user_pages_fast for ARM and ARM64 ?

Core kernel code

NeilBrown [PATCH] SCHED: remove proliferation of wait_on_bit action functions. ?

Yuyang Du A new CPU load metric for power-efficient scheduler: CPU ConCurrency ?

Waiman Long qspinlock: a 4-byte queue spinlock with PV support ?

Frederic Weisbecker workqueue: Introduce low-level unbound wq sysfs cpumask v2 ?

Development tools

Josh Poimboeuf kpatch: dynamic kernel patching ?

Daniel Thompson kdb: Allow selective reduction in capabilities (was "kiosk mode") ?

Device drivers

Lee Jones mtd: nand: Add new driver supporting ST's BCH h/w ?

Guenter Roeck watchdog: Add reboot API ?

Tomas Pop hwmon: add support for Sensirion SHTC1 sensor ?

Darek Marcinkiewicz Driver for Beckhoff CX5020 EtherCAT master module. ?

Pankaj Dubey Introducing Exynos ChipId driver ?

Pali Rohár Dell Latitude freefall driver (ACPI SMO8800/SMO8810) ?

Hans de Goede [PATCH v10 00/15] ARM: sunxi: Add driver for SD/MMC hosts found on Allwinner sunxi SoCs ?

Gregory CLEMENT USB support for Armada 38x and Armada 375 ?

Jiang Liu Enable support of Intel DMAR device hotplug ?

Peter Ujfalusi clk: Add clock driver for DRA7 ATL (Audio Tracking Logic) ?

Documentation

Hans Verkuil Report on the San Jose V4L/DVB mini-summit ?

Janitorial

Kirill A. Shutemov remap_file_pages() decommission ?

Memory management

Johannes Weiner mm: memcontrol: naturalize charge lifetime ?

Mel Gorman Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations ?

David Rientjes [patch v2 1/4] mm, migration: add destination page freeing callback ?

j.glisse@gmail.com Heterogeneous memory management (mirror process address space on a device mmu). ?

Dan Streetman swap: simplify/fix swap_list handling and iteration ?

Networking

Andi Kleen RFC: A reduced Linux network stack for small systems ?

Security-related

Andy Lutomirski fs,net: Add MS_NOIPCCONNECT to block existing FIFOs and sockets ?

Page editor: Jonathan Corbet
Next page: Distributions>>