Leading items

Welcome to the LWN.net Weekly Edition for August 31, 2017

This edition contains the following feature content:

Remote imports for Python?: a proposal to allow modules to be imported from remote systems stirs up some dissent.
A return-oriented programming defense from OpenBSD: a simple trick to harden the kernel against ROP attacks.
Development statistics for the 4.13 kernel: where the code came from for the 4.13 development cycle.
Goodbye to GFP_TEMPORARY and dma_alloc_noncoherent(): a couple of examples of how internal API changes are handled in the kernel.
printk() and KERN_CONT: a change to the kernel's main logging function surprises some developers and users.
Fedora's Boltron preview: a new server distribution from Fedora that highlights some of the ideas from the "Modularity" project.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Remote imports for Python?

By Jake Edge
August 30, 2017

Importing a module into a Python program is a pretty invasive operation; it directly runs code in the current process that has access to anything the process can reach. So it is not wildly surprising that a suggestion to add a way to directly import modules from remote sites was met with considerable doubt—if not something approaching hostility. It turns out that the person suggesting the change was not unaware of the security implications of the idea, but thought it had other redeeming qualities; others in the discussion were less sanguine.

In his first post to the python-ideas mailing list, security researcher John Torakis proposed imports via HTTP and HTTPS be added as a core Python feature. He also filed an enhancement request; both the post and the bug pointed to his httpimport repository on GitHub that has a prototype implementation. The justification he cited was always likely to set off some alarm bells:

My proposal is that this module can become a core Python feature, providing a way to load modules even from Github.com repositories, without the need to "git clone - setup.py install" them.

Other languages, like golang, provide this functionality from their early days (day one?). Python development can be greatly improved if a "try before pip installing" mechanism gets in place, as it will add a lot to the REPL [read-eval-print loop] nature of the testing/experimenting process.

Chris Angelico vehemently opposed the feature, at least for core Python: "This is a security bug magnet; can you imagine trying to ensure that malicious code is not executed, in an arbitrary execution context?" If the feature is explicitly enabled (via, say, pip), it is much less worrisome, Angelico said. The idea of allowing imports over regular HTTP is one that should be dropped, he said; even HTTPS imports would require being "absolutely sure that your certificate chains are 100% dependable".

Oleg Broytman also opposed the idea, suggesting that it would require a Python Enhancement Proposal (PEP), instead of simply filing an enhancement request, to truly be considered. He also noted that there is a difference for Go's remote imports: those happen at compile time for Go, while they would be done at runtime for Python.

The README for Torakis's httpimport repository mentions it being used as a "staging protocol for covertutils backdoors". Torakis's covertutils project is described as a "framework for backdoor programming"; that also got Angelico's attention:

But I'm not entirely sure I want to support this. You're explicitly talking about using this with the creation of backdoors... in what, exactly? What are you actually getting at here?

Torakis responded to those complaints (in a posting with non-standard quoting). The backdoor work is evidently part of his day job and httpimport could be useful for that work, especially for rapid prototyping, testing, and debugging purposes. He did agree that HTTP imports are dangerous, but noted that doing so locally (i.e. only to localhost or perhaps trusted local systems) could be useful for testing. He also objected to the complaint about HTTPS, however, noting that certificate checks would be done to eliminate the man-in-the-middle threat. In another posting, he put it this way: "if you can't trust your certificate store now, and you are afraid of Remote code execution through HTTPS, stop using pip altogether".

While indicating that httpimport would make an excellent addition to the Python Package Index (PyPI), Paul Moore pointed out another flaw with the idea of making the feature a core part of the language: it removes the ability for an organization's security team to disallow it. On the other hand, the team could restrict the ability to install modules from PyPI or simply blacklist certain entries such as httpimport. He said:

[...] whereas with a core module, it's there, like it or not, and *all* Python code has to be audited on the assumption that it might be used. I could easily imagine cases where the httpimport module was allowed on development machines and CI servers, but forbidden on production (and pre-production) systems. That option simply isn't available if the feature is in the core.

This is not the first time this idea has come up, Guido van Rossum noted; it was first proposed (and rejected) in 1995 or so—before HTTPS was invented, he said. He may be misremembering either the date or the status of HTTPS, as Wikipedia gives 1994 for its creation, though it surely was not in widespread use by 1995. As with others in the thread, Van Rossum was happy to see a third-party httpimport available to those who need it, but it is too much of a security concern to ever consider adding to the standard library.

Torakis said that he was two-years old at the time of that decision and that "times have changed". He said that he is willing to make changes to make httpimport acceptable for the standard library. It would make working with Python much easier for him and others:

I'm talking about the need to rapidly test public code. I insist that testing code available on Github (or other repos), without the venv/clone/install hassle is a major improvement in my (and most sec researchers' I know) Python workflow. It makes REPL prototyping million times smoother. We all have created small scripts that auto load modules from URLs anyway.

There was no support for adding it as a core feature in the thread, though. A simple "pip install httpimport" is all that would be needed to get access to the feature (once Torakis gets it added to PyPI, anyway). So some thread participants wondered why it was so imperative that it become part of the standard library. As Stephen J. Turnbull put it: "It's an attractive nuisance unless you're a security person, and then pip is not a big deal." Echoing Moore to some extent, Nick Coghlan added another concern:

[...] donning my commercial redistributor hat: it already bothers some of our (and our customers') security folks that we ship package installation tools that access unfiltered third party package repositories by default (e.g. pip defaulting to querying PyPI).

As a result, I'm pretty sure that even if upstream said "httpimport is in the Python standard library now!", we'd get explicit requests asking us to take it out of our redistributed version and make it at most an optional install (similar to what we do with IDLE and Tcl/Tk support in general).

There is at least one popular scripting language that makes this feature available as part of the language: PHP. Originally, directives like include and require could contain a URL; code would be retrieved from that URL, then executed. Over time, the wisdom of that choice has been questioned; these days, two configuration options (allow_url_fopen and allow_url_include) govern the behavior and remote inclusion is disallowed by default. While many scripting-style languages have ways to accomplish remote inclusion, it is considered something of a trap for the unwary, thus not elevated to be a top-level language feature.

Something that bears noting from the discussion is that installing code over HTTPS from PyPI is no more (or less) dangerous than doing so from GitHub—at least from a man-in-the-middle perspective. There is still the danger of using code from a malicious GitHub repository but, in truth, the same problem exists for PyPI. There is no active vetting of either GitHub repositories or packages uploaded to PyPI. HTTPS can ensure that you are connecting to the server holding the proper key, but it cannot protect you from asking for the wrong thing from that server.

As Torakis's age post indicates, there may be something of a generation gap surrounding this issue. The GitHub-centric, rapid-fire development style meets the grizzled graybeards who still sport some of the scars of security issues past. It is certainly true that it is easy enough to add remote imports to a program (via httpimport or something hand-rolled), but the idea is that programmers will have reached a certain level of understanding when they get to that point—hopefully enough to recognize the dangers of doing so. In any case, by not making it a top-level, supported feature, abuse of it is not the responsibility of the Python core team. Avoiding that kind of "attractive nuisance" (and the bugs it can spawn) is another lesson that the Python graybeards have learned along the way.

Comments (30 posted)

A return-oriented programming defense from OpenBSD

By Jonathan Corbet
August 30, 2017

Stack-smashing attacks have a long history; they featured, for example, as a core part of the Morris worm back in 1988. Restrictions on executing code on the stack have, to a great extent, put an end to such simple attacks, but that does not mean that stack-smashing attacks are no longer a threat. Return-oriented programming (ROP) has become a common technique for compromising systems via a stack-smashing vulnerability. There are various schemes out there for defeating ROP attacks, but a mechanism called "RETGUARD" that is being implemented in OpenBSD is notable for its relative simplicity.

In a classic stack-smashing attack, the attack code would be written directly to the stack and executed there. Most modern systems do not allow execution of on-stack code, though, so this kind of attack will be ineffective. The stack does affect code execution, though, in that the call chain is stored there; when a function executes a "return" instruction, the address to return to is taken from the stack. An attacker who can overwrite the stack can, thus, force a function to "return" to an arbitrary location.

That alone can be enough to carry out some types of attacks, but ROP adds another level of sophistication. A search through a body of binary code will turn up a great many short sequences of instructions ending in a return instruction. These sequences are termed "gadgets"; a large program contains enough gadgets to carry out almost any desired task — if they can be strung together into a chain. ROP works by locating these gadgets, then building a series of stack frames so that each gadget "returns" to the next.

This technique allows the construction of arbitrary programs on the stack without the need for execute permission on the stack itself. It is worth noting that, on a complex-instruction-set architecture like x86, unexpected gadgets can be created by jumping into the middle of a multi-byte instruction, a phenomenon termed "polymorphism". Needless to say, there are tools out there that can be used by an attacker to locate gadgets and string them together into programs.

The RETGUARD mechanism, posted by Theo de Raadt on August 19, makes use of a simple return-address transformation to disrupt ROP chains and prevent them from executing as intended. It takes the form of a patch to the LLVM compiler adding a new -fret-protector flag. When code is compiled with that flag, two things happen:

The prologue to each function (the code that runs before the body of the function itself) exclusive-ORs the return address on the stack with the value of the stack pointer itself.
The epilogue, run just before the function returns, repeats the operation to restore the return address to its initial value.

The exclusive-OR operation changes the return address into something that is effectively random, especially when address-space layout randomization is used to place the stack at an unpredictable location. With this change, the first gadget used by a ROP sequence will, when it attempts the second step above, transform the return address into something unpredictable and, most likely, useless to an attacker. That will stop the chain and thwart the attack.

There is, of course, a significant limitation here: a ROP chain made up of exclusively polymorphic gadgets will still work, since those gadgets were not (intentionally) created by the compiler and do not contain the return-address-mangling code. De Raadt acknowledged this limitation, but said: "we believe once standard-RET is solved those concerns become easier to address separately in the future. In any case a substantial reduction of gadgets is powerful".

Using the compiler to insert the hardening code greatly eases the task of applying RETGUARD to both the OpenBSD kernel and its user-space code. At least, that is true for code written in a high-level language. Any code written in assembly must be changed by hand, though, which is a fair amount of work. De Raadt and company have done that work; he reports that: "We are at the point where userland and base are fully working without regressions, and the remaining impacts are in a few larger ports which directly access the return address (for a variety of reasons)". It can be expected that, once these final issues are dealt with, OpenBSD will ship with this hardening enabled.

It makes sense to ask whether this relatively straightforward hardening technique could be applied to the Linux kernel as well. Using LLVM to build the kernel is not yet a viable option, but it should be possible to reimplement the RETGUARD transformations as a GCC plugin module. The tiresome task of fixing up the assembly code would also need to be done; the objtool utility could probably be pressed into service to help with this task. But the patch that emerged would not be small.

If any benchmarks have been run to determine the cost of using RETGUARD, they have not been publicly posted. The extra code will make the kernel a little bigger, and the extra overhead on every function is likely to add up in the end. But if this technique can make the kernel that much harder to exploit, it may well justify the extra execution overhead that it brings with it. All that's needed is somebody to actually do the work and try it out.

Comments (29 posted)

Development statistics for the 4.13 kernel

By Jonathan Corbet
August 24, 2017

As of this writing, the 4.13 kernel appears headed toward release on September 3, after a nine-week development cycle. It must, therefore, be about time for a look at the statistics for this development cycle. The picture that results shows a fairly typical kernel cycle with, as usual, few surprises.

Midway between 4.13-rc6 and 4.13-rc7, 12,677 non-merge changesets had found their way into the mainline. That makes 4.13 the smallest cycle since 4.7, which finished with 12,283 changesets. Chances are, though, that this cycle will surpass 4.11 (12,724) by the time it is done. So, while there may be signs of a (northern hemisphere) summer slowdown, 4.13 remains generally comparable with its predecessors with respect to patch volume.

1,634 developers have contributed during this cycle, a significant drop from the record set with 4.12 (1,825) but comparable with 4.10 (1,647). The most active of those developers were:

Most active 4.13 developers

By changesets

Christoph Hellwig 252 2.0%

Mauro Carvalho Chehab 184 1.5%

Thomas Gleixner 151 1.2%

Arnd Bergmann 138 1.1%

Takashi Iwai 134 1.1%

Chris Wilson 130 1.0%

Colin Ian King 123 1.0%

Arvind Yadav 123 1.0%

Al Viro 117 0.9%

Masahiro Yamada 113 0.9%

Kuninori Morimoto 102 0.8%

Jakub Kicinski 99 0.8%

Johannes Berg 98 0.8%

Dan Carpenter 93 0.7%

Vivien Didelot 90 0.7%

Paul E. McKenney 83 0.7%

Geert Uytterhoeven 82 0.6%

Andy Shevchenko 77 0.6%

Kees Cook 76 0.6%

Nicholas Piggin 72 0.6%

By changed lines

Alex Deucher 279567 29.9%

Mauro Carvalho Chehab 32256 3.5%

Robert Bragg 22511 2.4%

Steve Longerbeam 12486 1.3%

Stanimir Varbanov 11236 1.2%

Christoph Hellwig 10187 1.1%

Michal Kalderon 9818 1.1%

Yuval Mintz 9373 1.0%

Lionel Landwerlin 8960 1.0%

Igor Mitsyanko 8485 0.9%

John Johansen 7806 0.8%

Mika Westerberg 7004 0.7%

Chris Wilson 6723 0.7%

Ben Skeggs 6305 0.7%

Hans de Goede 5975 0.6%

Geert Uytterhoeven 5722 0.6%

Gilad Ben-Yossef 5580 0.6%

Al Viro 5478 0.6%

Ilan Tayari 5215 0.6%

Serge Semin 4978 0.5%

The top contributor of changesets this time around was Christoph Hellwig, who made significant improvements all over the filesystem and block I/O layers. Mauro Carvalho Chehab continues to be a relentless generator of patches in his role as the media subsystem maintainer; many of his changes touched the documentation directory as well. Thomas Gleixner was busy in the interrupt-handling and timer code, Arnd Bergmann (as usual) contributed fixes all over the tree, and Takashi Iwai made many changes as the maintainer of the audio subsystem.

Once again, Alex Deucher topped the "lines changed" column by adding yet another massive set of AMD GPU register definitions. Robert Bragg, instead, added a bunch of i915 register configurations. Steve Longerbeam and Stanimir Varbanov both added media subsystem drivers.

As has been the case in recent cycles, the developers appearing in these lists are generally not working on the staging tree. That is a significant change from a few years ago, when staging work was the source of many of the changesets going into the mainline kernel. One might almost be tempted to believe that the staging tree has done what it was meant to do, and the bulk of those out-of-tree drivers have now been merged. More likely, though, is that this is just a lull in staging work; substandard drivers are in anything but short supply.

A minimum of 203 employers supported work on the code that was merged for 4.13, a fairly normal number (though, once again, a significant drop from 4.12, which had support from 233). The most active of those employers were:

Most active 4.13 employers

By changesets

Intel 1474 11.6%

(None) 887 7.0%

(Unknown) 756 6.0%

Red Hat 750 5.9%

IBM 537 4.2%

SUSE 495 3.9%

Linaro 475 3.7%

Google 416 3.3%

AMD 410 3.2%

(Consultant) 389 3.1%

Renesas Electronics 331 2.6%

Samsung 323 2.5%

Mellanox 281 2.2%

Oracle 274 2.2%

ARM 265 2.1%

Free Electrons 232 1.8%

Canonical 203 1.6%

Cavium 201 1.6%

Broadcom 178 1.4%

linutronix 172 1.4%

By lines changed

AMD 296975 31.8%

Intel 79179 8.5%

(None) 53207 5.7%

Red Hat 40166 4.3%

Samsung 36962 4.0%

Cavium 32397 3.5%

Linaro 30870 3.3%

(Unknown) 30295 3.2%

IBM 21185 2.3%

Mellanox 19441 2.1%

Renesas Electronics 17946 1.9%

(Consultant) 14005 1.5%

Free Electrons 13043 1.4%

Mentor Graphics 12768 1.4%

SUSE 12742 1.4%

Google 12288 1.3%

ARM 11466 1.2%

Texas Instruments 10149 1.1%

ST Microelectronics 9062 1.0%

Broadcom 8945 1.0%

Once again, there are few surprises here; these lists don't change much from one cycle to the next.

One thing we have occasionally commented on over the years is a perceived decrease in the contributions from developers working on their own time. The 887 changes known to be from volunteers in 4.13 make up 7% of the total, a relatively low number. But perhaps percentages are not the right unit here. Looking at the absolute count of changesets from volunteers since 3.0 was released in July 2011 reveals a trend like this:

That plot does suggest an overall decrease in the number of patches received from developers working on their own time. But it may not be an entirely accurate picture. The table above also shows 756 changes coming from developers with unknown affiliation. There were 263 such developers participating in the 4.13 development cycle, contributing an average of just under three patches each; 165 of them contributed a single patch. One could well argue that the bulk of this group is highly likely to fit into the "volunteers" category. Some of them may well be doing kernel patches at work, but it's clearly not a significant part of their job.

If one plots the number of changesets coming from both known volunteers and shadowy mysterious developers, the result is:

That line looks rather more level, suggesting that the number of changes contributed by volunteers has remained roughly the same over the last six years. Note that the overall changeset volume has increased significantly over this period; the 3.0 development cycle had 9,153, for example. So, while the volume of changes going into the kernel is increasing, the volume from volunteer developers cannot be said to be increasing with it — but, perhaps, it is not shrinking either.

Overall, the kernel-development machine continues to hum along, cranking out a new kernel every nine or ten weeks. The predictability of the process may lead to relatively boring statistics articles, but predictability is a good thing in a critical low-level system component.

Comments (12 posted)

Goodbye to GFP_TEMPORARY and dma_alloc_noncoherent()

By Jonathan Corbet
August 28, 2017

Like most actively developed programs, the kernel grows over time; there have only been two development cycles ever (2.6.36 and 3.17) where the kernel as a whole was smaller than its predecessor. The kernel's internal API tends to grow in size and complexity along with the rest. The good thing about the internal API, though, is that it is completely under the control of the development community and can be changed at any time. Among other things, that means that parts of the kernel's internal API can be removed if they are no longer needed — or if their addition in the first place is deemed to be a mistake. A pair of pending removals in the memory-management area shows how this process can work.

GFP_TEMPORARY

One of the many challenges faced by the kernel's memory-management subsystem is fragmentation. If allocations are not placed carefully, the system's free memory can end up split into many small chunks that cannot be coalesced; that can lead to allocations failing even though much of the system's memory is idle. This is particularly true of memory allocations for use by the kernel itself. Those allocations can be long-lived and there is usually no way to relocate them if they are in the way. A single small allocation can prevent the reuse of an entire page; that, in turn, can block the creation of larger chunks of memory around that page.

It has long been understood that not all kernel memory allocations are equal. Some data structures are critical to the operation of the system and cannot be removed; consider the structures describing a mounted filesystem or a running process, for example. Others, though, exist to improve the system's performance and can be dropped if needed; the inode and dentry caches in the virtual filesystem layer are perhaps the biggest examples of this type of structure. The latter type of structure is called "reclaimable".

A key heuristic used within the memory-management subsystem is to try to separate reclaimable and non-reclaimable allocations. A page full of reclaimable allocations can, in theory at least, be recovered for other uses when memory is tight. But a single non-reclaimable allocation will prevent the reuse of the entire page. Separating the two types increases the probability that pages containing reclaimable allocations can, in truth, be reclaimed should the need arise.

Back in 2007, Mel Gorman added the GFP_TEMPORARY allocation type in an attempt to make memory allocation more flexible. The reasoning was this: some memory allocations last a long time, while others are highly transient. A structure allocated to represent a newly added device may persist for the lifetime of the system, while memory allocated to satisfy a system call may be returned within milliseconds. When an allocation is short-lived, it doesn't matter whether it is reclaimable or not; since it will be returned shortly regardless, it is unlikely to hold up the reclaim of a page full of otherwise reclaimable allocations. So GFP_TEMPORARY allocations were allowed to draw from the reclaimable pool, even though there was no mechanism by which they could be reclaimed.

Earlier this year, GFP_TEMPORARY was the subject of an extensive discussion that was, at the start, focused on a seemingly simple question: what does "temporary" mean? Is there a limit on how long the allocation can be held? Is the holder of a GFP_TEMPORARY allocation allowed to block or take locks? It turns out that this was not specified when GFP_TEMPORARY was added. The discussion failed to fill that void, and a review of the GFP_TEMPORARY call sites in the kernel revealed some decidedly non-temporary uses. It became clear that nobody really knew what a "temporary" allocation was supposed to be.

There was talk of trying to nail down that definition, but Michal Hocko pushed a different approach: remove GFP_TEMPORARY entirely. The current uses, he said, did not justify keeping it around:

I have checked some random users and none of them has added the flag with a specific justification. I suspect most of them just copied from other existing users and others just thought it might be a good idea to use without any measuring. This suggests that GFP_TEMPORARY just motivates for cargo cult usage without any reasoning.

There were a handful of complaints about the loss of the flag, but no serious opposition to the change. Other developers, including Neil Brown, agreed with the change:

If we have a flag that doesn't have a well defined meaning that actually affects behavior, it will not be used consistently, and if we ever change exactly how it behaves we can expect things to break. So it is better not to have a flag, than to have a poorly defined flag.

He suggested improving the kernel's notion of reclaimability of allocations instead. That may happen in the future, but the removal of GFP_TEMPORARY is set to happen more quickly. The patches are in linux-next now, meaning they are on track to hit the mainline during the 4.14 merge window. Should that happen, GFP_TEMPORARY will itself prove to have been temporary — for a "ten years" value of "temporary".

dma_alloc_noncoherent()

The allocation of memory for direct memory access (DMA) operations is not as simple as it might seem. Devices often have a different view of memory than the CPU does, and allocations for DMA must bridge that gap. These allocations must usually be physically contiguous, and they have to be in a region of memory that the target device is able to access, for example. An interesting additional requirement is handled by dma_alloc_noncoherent():

    void *dma_alloc_noncoherent(struct device *dev, size_t size,
    				dma_addr_t *dma_handle, gfp_t flag);

A call to dma_alloc_noncoherent() is an explicit request to allocate a DMA buffer in a noncoherent region of memory. Memory that is cache-coherent looks the same to both the CPU and I/O devices. If the CPU writes to that memory, its writes will be visible to the device; similarly, if the device writes a region of memory, the CPU will immediately see the new data. Noncoherent memory lacks that guarantee; if the CPU wants to read data placed into memory via a DMA operation, it must take care to invalidate its own memory caches after the completion of the I/O operation, but before its first access.

Noncoherent memory is clearly trickier to work with; without sufficient care, it is easy to end up with corrupted data. So one might wonder why anybody would want to ask for it specifically. The answer is that on architectures where cache coherence doesn't come naturally (ARM, for example), coherent memory is far slower. Turning on coherence generally involves turning off caching, with a predictable effect on performance. For situations involving any significant data processing, using coherent memory is just not an option.

It is thus important to be able to allocate noncoherent memory for DMA buffers, which raises the question of why Christoph Hellwig is working to remove dma_alloc_noncoherent(). The answer is that, on any reasonably current system, control over memory access modes is more sophisticated than simply turning caching on or off. Memory can be configured to allow write combining (where multiple write operations can be grouped by the hardware for performance), for example, or it can be set to allow operations to be reordered. Many of these features can be configured together. Creating new allocation function for each combination is clearly unlikely to lead to joy, so the kernel developers added a new set of functions in the 3.4 development cycle, including:

    void *dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle,
			  gfp_t flag, unsigned long attrs);

The attrs field can be used to specify a whole range of attributes, including DMA_ATTR_NON_CONSISTENT to obtain a noncoherent mapping. This function clearly fills the same role as dma_alloc_noncoherent() and a lot more besides, so there is little reason to keep dma_alloc_coherent() around. Hellwig has been working to remove it, which means updating all of its callers to use dma_alloc_attrs() instead. Much of that work went in during the 4.13 merge window; only three call sites remain. His current patches remove those last three, along with the function itself. Any out-of-tree drivers using dma_alloc_noncoherent() will have to be updated separately, of course.

In both cases, the kernel's internal API is getting (slightly) smaller, but no functionality is being lost. This work is an example of the sort of cleanups that are possible when there is no need to maintain API compatibility. Interfaces exposed to user space must be preserved, but the ability to evolve internally is a big part of why the kernel remains maintainable despite having just celebrated its 26th birthday.

Comments (7 posted)

printk() and KERN_CONT

By Jake Edge
August 30, 2017

A nearly year-old "fix" to the main logging function used in the kernel, printk(), changed the appearance of some log messages in an unexpected way, at least for some. Messages that had appeared on a single line will now be spread over multiple lines as each call to printk() begins a new line in the output unless the KERN_CONT flag is used. That is how a comment in the kernel code says it should work, but the change was made by Linus Torvalds without any discussion or fanfare, so it took some by surprise.

The printk() function is the workhorse of kernel output, for critical messages, warnings, information, and debugging. It is used in much the same way as printf() but there are some differences. For one thing, "log levels" can be prepended to the format string to specify the severity of the message. These range from KERN_EMERG to KERN_DEBUG and can be used as follows:

    printk(KERN_ALERT "CPU on fire: %d\n", cpu_num);

The log levels are simply strings that get concatenated with the format string, thus there is no comma between them. Another difference from printf() is in how a format string without a newline is treated, which is what has changed. The KERN_CONT "log level" is meant to indicate a continuation line; a printk() that lacks that flag is supposed to start a new line in the log—though that hasn't always been enforced.

Pavel Machek posted a query about that behavior on August 28. He noted that "printk("foo"); printk("bar"); seems to produce foo\nbar", which was both surprising and unwelcome. That led to a bit of a rant from Torvalds, who had made the change:

If you want to continue a line, you NEED to use KERN_CONT.

That has always been true. It hasn't always been enforced, though.

If you do two printk's and the second one doesn't say "I'm a continuation", the printk logic assumes you're just confused and wanted two lines.

But, as several pointed out, that behavior only changed relatively recently (for the 4.9 kernel released in December 2016); prior to that Machek's example would produce "foobar" as he expected. Lots of places in the kernel use printk() without KERN_CONT and expect to get output on a single line, Joe Perches pointed out. Perches also complained that Torvalds had, in fact, changed longstanding behavior and was not just enforcing something that had "always been true". But Torvalds pointed to a commit from 2007 that added the KERN_CONT flag, along with the following comment:

    /*
     * Annotation for a "continued" line of log printout (only done after a
     * line that had no enclosing \n). Only to be used by core/arch code
     * during early bootup (a continued line is not SMP-safe otherwise).
     */

While 2007 is not exactly "always", the comment certainly documents the intent of KERN_CONT, so not using it and expecting multiple calls to printk() to end up on the same line has not been right for nearly ten years. Torvalds was unapologetic about this recent change:

So yes, we're enforcing it now, and we're not going back to the unenforced times, because a decade of shit has shown that people didn't do it without being forced to.

In fact, he would like to get rid of the whole idea of continuation lines. They made some amount of sense when the output was just sent to a circular character buffer, he said, but printk() now has a log-based structure so continuation lines do not really work well in that environment. Beyond that, there is always the chance that some asynchronous action (e.g. an interrupt) outputs something that interferes with the single line of output. Instead, users should be marshaling their own output into single-line chunks and passing those to printk(), he said.

He went on to suggest that some helper functions be added to assist in places where that marshaling is needed. Users would provide their own buffer to these routines that would then call printk() when they have a full line.

That avoids the whole nasty issue with printk - printk wants to show stuff early (because _maybe_ it's critical) and printk wants to make log records with timestamps and loglevels. And printk has serious locking issues that are really nasty and fundamental.

That set off a discussion on various ways to implement what had been suggested. Various schemes to replace printk() with something "smarter" were batted down quickly by Torvalds. Steven Rostedt recommended using the kernel's seq_buf facility that is used for tracing and implementing /proc files. That idea seemed to gain traction among the other thread participants (including, crucially, Torvalds). So far, no patch set along those lines has been proposed, but it seems like a promising direction.

No matter what happens, there are going to be multiple changes to fix the output in places where KERN_CONT was not used but should have been. If the seq_buf interface is going to be used, it would make sense to do that directly, rather than add a bunch of KERN_CONT flags in various places. Once that is done, perhaps the existing uses of KERN_CONT could be tackled to get rid of as many of those as possible—leaving only those used at boot time as was originally planned.

Clearly Torvalds doesn't think twice about breaking things internal to the kernel in order to enforce something he sees as important. Doing so silently, though, as happened here, might not have been the best approach. Had there been discussion of the patch on the mailing list, it would at least have given folks a chance to realize what was up. That might have eliminated Machek's query and perhaps reduced Torvalds's blood pressure a bit.

Comments (10 posted)

Fedora's Boltron preview

By Jake Edge
August 30, 2017

In many ways, distributions shackle their users to particular versions of tools, libraries, and frameworks. Distributions do not do that to be cruel, of course, but to try to ensure a consistent and well-functioning experience across all of the software they ship. But users have often chafed at these restrictions, especially for the fast-moving environments surrounding various web frameworks and their dependencies. Fedora has been making an effort to make it easier for a single system to support these kinds of environments with its Modularity initiative. In late July, Fedora announced a preview release of the server side of the Modularity equation, Boltron, which is a version of the distribution that supports the initiative.

In order to be able to support multiple versions of various packages, those packages need to be created and made available to Fedora installations. That's where Boltron comes in. First, the packages need to be decoupled from a Fedora release, which is how they are delivered today. This was done by adjusting the existing tools rather than starting over. The Modularity working group did so to try to reduce the effects of the changes:

The Working Group also took on the requirement to impact the Fedora Infrastructure, user base, and packager community as little as possible. The group also wanted to increase quality and reliability of the Fedora distribution, and drastically increase the automation, and therefore speed, of delivery.

As a result, the group didn't treat this as a greenfield experiment that would take years to harden and trust. Instead, they kept the warts and wrinkles with the toolset, and implemented tools and procedures that slightly adjusted the existing systems to provide something new. Largely, the resultant package set can be thought of as virtualized, separate repositories. In other words, the client tooling (dnf) treats the traditional flat repo as if it was a set of repos that are only enabled when you want that version of the component.

This was done by modifying the existing tooling to support arbitrary branching within the package database so the service level (SL) and end of life (EOL) dates can be separated from a specific Fedora release. More information about arbitrary branching—how it would work and why it is needed—can be found in the focus document.

That document uses the example of the Django web framework and the Python Requests module. The packages in the database would no longer only have branches for each release (e.g. Fedora 25, EPEL 7), but would also have branches corresponding to the upstream release versions. That way, a Django module could have multiple versions that depended on particular versions of its dependencies. So, for example, Django module 1.9 could depend on Requests 2.12, while Django module 1.10 could depend on the master branch for Requests (or, perhaps, 2.13 if following the upstream master is deemed too risky).

There is another side to Modularity, though. There needs to be a way to allow incompatible modules to coexist on a single system. There are a number of ways to support that, each with its own set of tradeoffs. Instead of coming up with yet another solution, the working group is focused on supporting existing mechanisms, such as Flatpak and containers of various sorts (Open Container Initiative (OCI) containers and System containers).

As the announcement notes, the Boltron preview is "somewhere between a Spin and a preview for the future of Fedora Server Edition". There is a wealth of information on the Modularity page for those interested in trying it out or simply for curiosity's sake. Boltron is shipped as a container image that can be downloaded from the Fedora Docker registry as described in the "getting started" guide. There is also a guided walkthrough that serves both as an guide to using Boltron and as a way for the working group to collect feedback on the changes.

A module is simply a collection of related packages that work together to provide some task or service. That could be a web server and a selected group of modules or add-ons, a particular version of a web framework and its dependencies, or a particular language version with any needed runtime components. Multiple versions of those modules can be created, each with its own stability level and lifetime. Each of those versions is handled as a different "stream" of the module.

In Boltron, an updated version of the now familiar dnf command is used to manage modules. For example:

    # dnf install httpd
    # dnf install @httpd
    # dnf module install httpd

The first of these would install the httpd module, if one exists, or fall back to installing the httpd package. The latter two are requesting the httpd module only (the middle reuses the existing "group" syntax for dnf). Individual streams can be chosen for installation by appending the stream name (e.g. httpd-2.4). When a dnf update is done, updates are only taken from the appropriate stream. There are ways to install different profiles of modules (for a database server versus client, say) or to simply install certain packages from a module. Eventually, packages will only come from modules, so a module must be enabled before individual packages can be installed from it.

As the "preview" term would imply, there's not much more to Boltron at this point. There are some 25 modules that have the same stream as that of the regular packages for Fedora 26. So far, the only module with multiple streams is for Node.js, with version 8 being available in the nodejs-8 stream. The intent is that more modules and streams will be added so that Fedora 27 servers can be composed by picking and choosing modules and streams to fit their intended use cases. Containers would presumably be used to manage multiple conflicting modules. There is, clearly, plenty more to be worked on.

The Modularity effort is a bold rethinking of how Fedora is built, used, and managed, as we have noted in some previous articles along the way. For a year or more, Modularity has largely just been an idea and a few, somewhat confusing diagrams, at least from the perspective of Fedora users. We are finally starting to see some of the behind-the-scenes efforts bear fruit. It will be interesting to watch and see where it all leads.

Comments (9 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>