LWN.net Weekly Edition for December 21, 2017

Welcome to the LWN.net Weekly Edition for December 21, 2017

This edition contains the following feature content:

A 2017 retrospective: our traditional look back at the year that is ending.
Python 3, ASCII, and UTF-8: changes in Python to make the handling of Unicode data less error-prone.
The current state of kernel page-table isolation: an update on the KPTI (formerly KAISER) patch set, a significant memory-management change that is on the fast track.
Shrinking the kernel with link-time garbage collection: using the linker to remove unneeded kernel text and data in a fine-grained manner.
Demystifying container runtimes: an overview of the container runtime landscape from KubeCon + CloudNativeCon.
Containers without Docker at Red Hat: the container series continues with a look at a project to create a more modular system.
HarfBuzz brings professional typography to the desktop: the library that gives Linux best-of-class text rendering.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

This is the final LWN Weekly Edition for 2017; we will, as usual, be taking the final week of the year off. We look forward to returning on January 4, 2018.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

A 2017 retrospective

By Jonathan Corbet
December 20, 2017

The December 21 LWN Weekly Edition will be the final one for 2017; as usual, we will take the last week of the year off and return on January 4. It's that time of year where one is moved to look back over the last twelve months and ruminate on what happened; at LWN, we also get the opportunity to mock the predictions we made back in January. Read on for the scorecard and a year-end note from LWN.

Your editor led off with a prediction that group maintainer models would be adopted by more projects over the course of the year; this prediction was partly motivated by the Debian discussion on the idea of eliminating single maintainers. Debian appears to have dropped the idea; Fedora, meanwhile, has seen some strong pushback from maintainers who resent others touching "their" packages. Group maintainership may have made a few gains here and there, but it has not yet succeeded in taking over the free-software world.

The prediction that the vendor kernels shipped on Android devices would move closer to the mainline was not a complete failure. Google has made some efforts to push vendors toward less-ancient kernels, and efforts to get those vendors to work more closely with the mainline are beginning to bear fruit. It will be a long and slow process, though.

There is no easier way to come up with a safe prediction than to suggest that security problems will get worse. Sure enough, we had plenty of vulnerabilities at all levels. What the prediction arguably missed was the increasing number of vulnerabilities turning up at the level that most of us consider to be "hardware". The Intel management engine problems are perhaps the highest-profile vulnerabilities of this type that turned up in 2017, but they aren't the only ones. There are clear indications that more issues of this type will come to light in the near future. What we think of as "hardware" is increasingly made up of software with all of the problems that afflict software at higher levels.

A related prediction said that the free-software community's security story would improve over the course of the year. That has certainly come true; we have seen an increase in development on hardening technologies, use of fuzzers to find vulnerabilities, and so on. We have a long way to go, but things are heading in the right direction.

With regard to the disintermediation of distributions: your editor has noted an apparent increase in applications that can only be installed by way of language-specific package managers or some sort of container system. There appears to be momentum behind technologies like Snap or Flatpak, and some distributions appear to be working to disintermediate themselves. The role of distributions remains strong for most Linux desktop and server users, though, and that won't change anytime soon.

And yes, as predicted, we're tired of hearing about containers, but we're far from done on that front.

Beyond that, the predicted ruling in the VMware GPL-infringement suit has not yet come out. Protocol ossification remains an issue, and it is pushing the net toward new protocols that can masquerade as old protocols. Thus, for example, QUIC now accounts for a significant fraction of Internet traffic thanks to its use in the Chrome browser. See also this article (discussed on LWN in December) on how various other protocols are changing. Meanwhile, as predicted, the kernel community is coming to a better understanding of the changes required to perform properly on systems with persistent memory which, we're told, will be widely available sometime soon, honest.

LWN in 2017 and beyond

In 2017, LWN put out 50 weekly editions containing 258 feature articles written by LWN staff and 91 articles from 26 outside authors. We had coverage from 25 different conferences this year, for which we would like to acknowledge the Linux Foundation which, as LWN's travel sponsor, made all of that conference coverage possible. This support has been made available with no strings or even hints that we should cover certain events, and we appreciate it.

This year saw the first significant reorganization of the LWN Weekly Edition since 2002. Some readers clearly like the new arrangement; others are ... less pleased. Running all of our feature content when it's ready (rather than burying it in the edition) seems like a clear improvement. The organization of the edition itself probably has not found its final form, but we don't really know what that final form might be yet. That will take a while (and probably some additional staff) to work out. Which leads to the final topic...

One of the things we did not foresee in 2017 is the sad demise of Linux Journal, which has been covering our community since the earliest days. That publication joins a long list of others — The H, NewsForge, KernelTrap, LinuxWorld, Kernel Traffic, NTK, Linux Gazette, RootPrompt, Linux Action Show, LinuxDevices, Linux Voice, and so on — that are no longer active. The publication market is a challenging place to be, even if you're not trying to keep up with a fast-moving, geographically dispersed, highly technical community like ours. Many who have tried to run a free-software-oriented publication have failed to thrive in the long run.

LWN, happily, is still here and is not in danger of going away. We will not be announcing a pivot to video anytime soon. Things could always be better, of course, but LWN is on solid financial ground, thanks to you, our readers, who have continued to support us for all these years. Our biggest problem, instead, is staffing. It takes a lot of effort to keep an operation like LWN going, and we are short-handed; we would still very much like to hire additional writers/editors to work with us. If anybody knows somebody with skills in this area, please encourage them to contact us.

Meanwhile, we'll muddle along into 2018 — the year in which LWN will celebrate its 20th anniversary. We wish the best of holidays for all of our readers and look forward to rejoining you after the new year. It is an honor to write for such an audience, and we thank you all for staying with us.

Comments (15 posted)

Python 3, ASCII, and UTF-8

By Jake Edge
December 17, 2017

The dreaded UnicodeDecodeError exception is one of the signature "features" of Python 3. It is raised when the language encounters a byte sequence that it cannot decode into a string; strictly treating strings differently from arrays of byte values was something that came with Python 3. Two Python Enhancement Proposals (PEPs) bound for Python 3.7 look toward reducing those errors (and the related UnicodeEncodeError) for environments where they are prevalent—and often unexpected.

Two related problems are being addressed by PEP 538 ("Coercing the legacy C locale to a UTF-8 based locale") and PEP 540 ("Add a new UTF-8 Mode"). The problems stem from the fact that locales are often incorrectly specified and that the default locale (the "POSIX" or "C" locale) specifies an ASCII encoding, which is often not what users actually want. Over time, more and more programs and developers are using UTF-8 and are expecting things to "just work".

PEP 538

PEP 538 was first posted by its author, Nick Coghlan, back in January 2017 on the Python Linux special interest group mailing list; after a stop on python-ideas, it made its way to the python-dev mailing list in May. Naoki Inada had been designated as the "BDFL-delegate" for the PEP (and for PEP 540); a BDFL-delegate sometimes is chosen to make the decision for PEPs that Guido van Rossum, Python's benevolent dictator for life (BDFL), doesn't have the time, interest, or necessary background to pass judgment on. Inada accepted PEP 538 for inclusion in Python 3.7 back in May.

As demonstrated in the voluminous text of PEP 538, many container images do not set the locale of the distribution, which means that Python defaults to the POSIX locale, thus to ASCII. This is unexpected behavior. Developers may well find that their local system handles UTF-8 just fine:

    $ python3 -c 'print("ℙƴ☂ℌøἤ")'
    ℙƴ☂ℌøἤ

However, running that in a container on the same system (using a generic distribution container image) may fail with a UnicodeEncodeError. The LC_CTYPE locale environment variable can fix the problem, but it must be set to C.UTF-8 (or variants that are available on other Unix platforms) inside the container. The PEP notes that new application distribution formats (e.g. Flatpak, Snappy) may suffer from similar problems.

So Python 3.7 and beyond will determine if the POSIX/C locale is active at startup time and will switch its LC_TYPE to the appropriate UTF-8 setting if one is available. It will only do that if the POSIX/C locale has not been explicitly chosen by the user or environment (e.g. using LC_ALL=C), so it will essentially override a default to the POSIX/C locale with UTF-8. That setting will be inherited by any subprocesses or other components that get run from the initial interpreter.

This change could, of course, cause problems for those who are actually expecting (and wanting) to get ASCII-only behavior. The PEP changes the interpreter so it will emit a warning noting that it has changed the locale if it does so. In addition, it will also emit a warning if it does not change the locale because it has been explicitly set to the (legacy) POSIX/C locale. These warnings do not go through the regular warnings module, so that -Werror (which turns warnings into errors) will not cause the program to exit; it was found during testing that doing so led to various problems.

This idea rests on two fundamental assumptions that are laid out in the PEP:

in desktop application use cases, the process locale will already be configured appropriately, and if it isn't, then that is an operating system or embedding application level problem that needs to be reported to and resolved by the operating system provider or application developer
in network service development use cases (especially those based on Linux containers), the process locale may not be configured at all, and if it isn't, then the expectation is that components will impose their own default encoding the way Rust, Go and Node.js do, rather than trusting the legacy C default encoding of ASCII the way CPython currently does

Beyond that, locale coercion will default to the surrogateescape error handler for sys.stdin and sys.stdout for the new coerced UTF-8 locales. As described in PEP 383, surrogateescape effectively turns strings containing characters that cannot be decoded using the current encoding into their equivalent byte values, rather than raising a decoding exception as the strict error handler would.

The effect of the PEP is meant to do what is actually generally expected by users. The warnings should help alert users (and distributors and the like) if the assumptions are not correct in their environment.

PEP 540

The second PEP has only recently been accepted for 3.7; though it began around the same time as the first, it languished for some time. It was brought back up by its author, Victor Stinner, in early December 2017. But that version was rather large and unwieldy, leading Van Rossum to say:

I am very worried about this long and rambling PEP, and I propose that it not be accepted without a major rewrite to focus on clarity of the specification. The "Unicode just works" summary is more a wish than a proper summary of the PEP.

Stinner acknowledged that and posted a much shorter version a day later. Inada expressed interest in accepting it, but there were some things that still needed to be worked out.

The basic idea is complementary to PEP 538. In fact, there are fairly subtle differences between the two, which has led to some confusion along the way. PEP 540 creates a "UTF-8 mode" that is decoupled from the locale of the system. When UTF-8 mode is active, the interpreter acts much the same as if it had been coerced into a new locale (à la PEP 538), except that it will not export those changes to the environment. Thus, subprocesses will not be affected.

In some ways, UTF-8 mode is more far reaching than locale coercion. For environments where there are either no locales or no suitable locale is found (i.e. no UTF-8 support), locale coercion will not work. But UTF-8 mode will be available to fill the gap.

UTF-8 mode will be disabled by default, unless the POSIX/C locale is active. It can be enabled by way of the "-X utf8" command line option or by setting PYTHONUTF8=1 in the environment (the latter will affect subprocesses, of course). Since the POSIX locale has ASCII encoding, UTF-8 (which is ASCII compatible at some level) is seen as a better encoding choice, much like with PEP 538.

The confusion between the two PEPs led to both referring to the other in order to make it clear why both exist and why both are needed. PEP 538 explains the main difference:

PEP 540 proposes to entirely decouple CPython's default text encoding from the C locale system in that case, allowing text handling inconsistencies to arise between CPython and other locale-aware components running in the same process and in subprocesses. This approach aims to make CPython behave less like a locale-aware application, and more like locale-independent language runtimes like those for Go, Node.js (V8), and Rust.

Beyond that, PEP 538 goes into an example of the Python read-eval-print loop (REPL), which uses GNU Readline. In order for Readline to properly handle UTF-8 input, it must have the locale propagated in LC_TYPE as locale coercion does. On the other hand, some situations may not want to push the locale on subprocesses, as PEP 540 notes:

The benefit of the locale coercion approach is that it helps ensure that encoding handling in binary extension modules and child processes is consistent with Python's encoding handling. The upside of the UTF-8 Mode approach is that it allows an embedding application to change the interpreter's behaviour without having to change the process global locale settings.

The two earliest versions of PEP 540 (it went through four versions on the way to acceptance) proposed changing to the surrogateescape error handler for files returned by open(), but Inada was concerned about that. A common mistake that new Python programmers make is to open a binary file without using the "b" option to open(). Currently, that would typically generate lots of decoding errors, which would lead the programmer to spot their error. Using surrogateescape would silently mask the problem; it is also inconsistent with the locale coercion behavior, which does not change open() at all. After some discussion, Stinner changed the PEP to keep the strict error handler for files returned by open().

So Python 3.7 will work better than previous versions with respect to Unicode character handling. While the switch to Unicode was one of the most significant features of Python 3, there have been some unintended and unexpected hiccups along the way. In addition, the computing landscape has changed in some fundamental ways over that time, especially with the advent of containers. Strict ASCII was once a reasonable choice in many environments, but that has been steadily changing, even before Python 3 was released just over nine years ago.

Comments (83 posted)

The current state of kernel page-table isolation

By Jonathan Corbet
December 20, 2017

At the end of October, the KAISER patch set was unveiled; this work separates the page tables used by the kernel from those belonging to user space in an attempt to address x86 processor bugs that can disclose the layout of the kernel to an attacker. Those patches have seen significant work in the weeks since their debut, but they appear to be approaching a final state. It seems like an appropriate time for another look.

This work has since been renamed to "kernel page-table isolation" or KPTI, but the objective remains the same: split the page tables, which are currently shared between user and kernel space, into two sets of tables, one for each side. This is a fundamental change to how the kernel's memory management works and is the sort of thing that one would ordinarily expect to see debated for years, especially given its associated performance impact. KPTI remains on the fast track, though. A set of preparatory patches was merged into the mainline after the 4.15-rc4 release — when only important fixes would ordinarily be allowed — and the rest seems destined for the 4.16 merge window. Many of the core kernel developers have clearly put a lot of time into this work, and Linus Torvalds is expecting it to be backported to the long-term stable kernels.

KPTI, in other words, has all the markings of a security patch being readied under pressure from a deadline. Just in case there are any smug ARM-based readers out there, it's worth noting that there is an equivalent patch set for arm64 in the works.

51 Patches and counting

As of this writing, the x86 patch series is at version 163. It contains 51 patches, so we can all be grateful that most of the intervening versions were not posted publicly. The initial patch set, posted by Dave Hansen, has been extensively reworked by Thomas Gleixner, Peter Zijlstra, Andy Lutomirski, and Hugh Dickins, with suggestions from many others. Any bugs that remain in this work won't be there as the result of a lack of experienced eyeballs on the code.

Page tables on contemporary systems are organized in a tree-like structure that makes for efficient storage of a sparse memory map and supports the huge pages feature; see this 2005 article for more details and a diagram of how it works. On a system with four levels of page tables (most largish systems, these days), the top level is the page global directory (PGD). Below that come the page upper directory (PUD), page middle directory (PMD), and page-table entries (PTE). Systems with five-level page tables insert a level (called the P4D) just below the PGD.

Page-fault resolution normally traverses this entire tree to find the PTE of interest, but huge pages can be represented by special entries at the higher levels. For example, a 2MB chunk of memory could be represented by either a single huge-page entry at the PMD level or a full page of single-page PTE entries.

In current kernels, each process has a single PGD; one of the first steps taken in the KPTI patch series is to create a second PGD. The original remains in use when the kernel is running; it maps the full address space. The second is made active (at the end of the patch series) when the process is running in user space. It points to the same directory hierarchy for pages belonging to the process itself, but the portion describing kernel space (which sits at the high end of the virtual address space) is mostly absent.

Page-table entries contain permission bits describing how the memory they describe can be accessed; these bits are, naturally, set to prevent user space from accessing kernel pages, even though those pages are mapped into the address space. Unfortunately, a number of hardware-level bugs allow a user-space attacker to determine whether a given kernel-space address is mapped or not, regardless of whether any page mapped at that address is accessible. That information can be used to defeat kernel address-space layout randomization, making life much easier for a local attacker. The core idea behind KPTI is that switching to a PGD lacking a kernel-space mapping will defeat attacks based on these vulnerabilities, of which we have apparently not yet seen the last.

Details

The idea is simple but, as is so often the case, there are a number of troublesome details that turn a simple idea into a 51-part patch series. The first of those is that, if the processor responds to a hardware interrupt while running in user mode, the kernel code needed to deal with the interrupt will no longer exist in the address space. So there must be enough kernel code mapped in user mode to switch back to the kernel PGD and make the rest available. A similar situation exists for traps, non-maskable interrupts, and system calls. This code is small and can be isolated from the rest, but there are a number of tricky details involved in handling that switch safely and efficiently.

Another complication comes in the form of the x86 local descriptor table (LDT), which can be used to change how the user-space memory layout looks. It can be tweaked with the little-known modify_ldt() system call. The early POSIX threads implementation on Linux used the LDT to create a thread-local storage area, for example. On current Linux systems, the LDT is almost unused but some applications (Wine, for example) still need it. When it is used, the LDT must be available to both kernel and user space, but it must live in kernel space. The KPTI patch set shuffles kernel memory around to reserve an entire entry at the PGD level for the LDT; the space available for vmalloc() calls shrinks to a mere 12,800TB as a result. That allows space for a large number of LDTs, needed on systems with many CPUs. One result of this change is that the location of the LDT is fixed and known to user space — a potential problem, since the ability to overwrite the LDT is easily exploited to compromise the system as a whole. The final patch in the series maps the LDT read-only in an attempt to head off any such attacks.

Another potential vulnerability comes about if the kernel can ever be manipulated into returning to user space without switching back to the sanitized PGD. Since the kernel-space PGD also maps user-space memory, such an omission could go unnoticed for some time. The response here is to map the user-space portion of the virtual address space as non-executable in the kernel PGD. Should user space ever start running with the wrong page tables, it will immediately crash as a result.

Finally, while all existing x86 processors are seemingly affected by information-disclosure vulnerabilities, future processors may not be. KPTI comes with a measurable run-time cost, estimated at about 5%. That is a cost that some users may not want to pay, especially once they get newer processors that lack these problems. There will be a nopti command-line option to disable this mechanism at boot time. The patch series also adds a new "feature" flag (X86_BUG_CPU_INSECURE) to indicate vulnerable CPUs; it is set on all x86 CPUs currently, but might not be on future hardware. In the absence of this feature flag, page-table isolation will automatically be turned off.

Approximately one month remains before the opening of the 4.16 merge window. During that time, the KPTI patch set will undoubtedly go through a number of additional revisions as the inevitable glitches come to light. Once things settle down, though, it would appear that this code will be merged and backported to stable kernels in a relative hurry. Apparently, we can look forward to slower — but more secure — kernels as a new-year's present.

Comments (19 posted)

Shrinking the kernel with link-time garbage collection

December 15, 2017

This article was contributed by Nicolas Pitre

One of the keys to fitting the Linux kernel into a small system is to remove any code that is not needed. The kernel's configuration system allows that to be done on a large scale, but it still results in the building of a kernel containing many smaller chunks of unused code and data. With a bit of work, though, the compiler and linker can be made to work together to garbage-collect much of that unused code and recover the wasted space for more important uses.

This is the first article of a series discussing various methods of reducing the size of the Linux kernel to make it suitable for small environments. Several approaches will be examined, from the straightforward to the daring.

It is a fact that Linux has conquered the high end of the computing spectrum. Since Linux started small 26 years ago, there was great emphasis on scaling up, which eventually made it the platform of choice for server rooms, data centers and the cloud infrastructure. Since November 2017, all the top 500 supercomputers use Linux-based operating systems. No more to say there.

On the desktop this is not as clear. Linux market share has only now surpassed the three-percent mark ( according to Net Applications) even though it has been available and working well for years. Unless we consider mobile as the new desktop, in which case the presence of at least two-billion active Android devices should make up for it and cover the middle range of the spectrum pretty well. But mobile devices, while being physically small, are far from matching the "tiny" Linux definition. Mobile is simply desktop software (more or less) that we can carry in our pocket but that only 20 years ago required a big computer tower to run.

Small often implies "embedded", and embedded Linux is often considered as being at the low end of the computing spectrum. Linux is successful in this area too; small, embedded Linux examples are plentiful, from wireless routers to smart light bulbs, from airline in-flight entertainment to car GPS systems, etc. Linux is everywhere, and most of the time its presence is unsuspected. But those systems still have a relatively generous amount of resources to work with, e.g. typically 32MB of RAM to start with. That's not really tiny yet.

So what does "tiny" actually mean? Let's define it as a sub-megabyte system or thereabout. Battery-powered operation is a given and batteries are expected to last for months or years. Power consumption has to be extremely low, which implies static RAM (or SRAM). Because SRAM is expensive, it is typically deployed in minimal quantities. We're talking small, cheap, and ubiquitous IoT devices that are increasingly being connected to the Internet with all its perils. The software in this space is fragmented, and Linux has next to no presence at all.

Operating systems carrying an open-source license for the tiny space are many. A few random examples are: Contiki, FreeRTOS, Mbed OS, NuttX RTOS, RIOT OS, or even Fuzix OS for the nostalgic. There is also an effort to consolidate this tiny space with the Zephyr Project. Diversity is also prominent in the proprietary world. So why bother with Linux in this space?

The open-source gravitational field

Successful open-source projects may be compared to celestial objects. Some of them grow big, as universal gravitation works to pull them together into stars and planets. There are relatively few such objects in the end since the gravity field from large objects absorbs any other objects in their vicinity. Sometimes, however, objects that fail to consolidate with others remain numerous, roughly shaped, and sparsely distributed like asteroids.

The tiny computing space is just like an asteroid field; numerous projects exist, but they lack the required center of gravity for effective and self-sustained communities to form naturally around them. Consolidation efforts are moving slowly because of that. The end result is a highly fragmented space with relatively few developers per project and, therefore, fewer resources to rely upon when issues come up. Vulnerabilities are more likely to turn into a security nightmare.

The Linux ecosystem, instead, reached planet status a long time ago. It has a lot of knowledgeable people around it, a lot of testing infrastructure and tooling available already, etc. If a security issue turns up on Linux, it has a greater chance of being caught early, or fixed quickly, and finding people with the right knowledge is easier with Linux than it would ever be on any other tiny operating system out there. Leveraging that ecosystem could be a big plus for the tiny computing space, and it would truly mean world domination for Linux.

Scaling Linux down

With this short rationale in hand, let's dive into the actual technical stuff. Of course, the biggest obstacle to a tiny Linux kernel is its size. It is unrealistic to expect Linux to ever run in 64KB of RAM. Such targets are best left to the likes of Zephyr. But maybe "640K ought to be enough for anybody" as someone once said. Many modern microcontrollers capable of running Linux do have about that amount of on-chip SRAM which would make them nice single-chip Linux targets. So let's see how this can be achieved.

Automatic kernel size reduction

Wouldn't it be nice if computers could be leveraged to do the job automatically for us? After all, if there is one thing that computers are good at, it is finding unused code and optimizing the resulting binary. There are two ways in which currently available tools can achieve automatic size reduction: garbage collection of unused input sections and link-time optimization (LTO). This article will focus on the first of these two approaches.

Please note that most examples presented here were produced using the ARM architecture, however the principles themselves are not architecture-dependent.

Linker section garbage collection

The linker is already able to omit stuff from a linked binary given extra input data. Let's think about the libraries used to produce an executable for example: if the whole of a library (say libc) were always linked into the final executable, then that executable would be rather big and inefficient. That's assuming static linking of course, but the kernel is a statically linked executable for the most part, so let's forget about dynamic linking and modules for the purpose of this demonstration.

Let's consider the following code as test.c:

   int foo(void)  { return 1; }

   int bar(void)  { return foo() + 2; }

   int main(void) { return foo() + 4; }

The compiler generates the following (simplified) assembly output for that code:

        .text

        .type   foo, %function
    foo:
        mov     r0, #1
        bx      lr

        .type   bar, %function
    bar:
        push    {r3, lr}
        bl      foo
        adds    r0, r0, #2
        pop     {r3, pc}

        .type   main, %function
    main:
        push    {r3, lr}
        bl      foo
        adds    r0, r0, #4
        pop     {r3, pc}

Despite bar() not being called, it is still part of the compiled output because there is no way for the compiler to actually know if some other file might call it. Only the linker can know, once it gathered all the object files to be linked together, whether bar() is referenced from another object file or not.

And even then, the linker has no knowledge of the object file content other than the different sections it contains (.text, .data, etc.), where named things start (symbol table), and how to patch in final addresses (relocation table). So all the linker can do is to pull an object file into the final link if it contains a symbol that is referenced by another object file. The linker simply cannot carve out a piece of the .text section to drop the unused bar() code.

In the Linux kernel, this happens quite often when the core kernel API provides functions that are not called when some unwanted feature is configured out. It could be argued that the unused core function should be #ifdef'd in or out along with its user, but this gets hairy when multiple features sharing the same core API may be configured in and out independently. Tracking this kind of dependency can be done manually in the Kconfig language, but there is a point where too many configuration symbols and #ifdefs in the code become a maintenance burden.

Alternatively, we could move every core function into a separate source file, causing each to be compiled into its own object file. The linker could work out what is used and what is not like it does with libraries, but that wouldn't sit well with kernel developers either. Fortunately, there is a GCC flag to request the creation of a separate code section for every function, namely -ffunction-sections. When recompiling our test code above with -ffunction-sections the output becomes:

        .section .text.foo,"ax",%progbits
        .type   foo, %function
    foo:
        mov     r0, #1
        bx      lr

        .section .text.bar,"ax",%progbits
        .type   bar, %function
    bar:
        push    {r3, lr}
        bl      foo
        adds    r0, r0, #2
        pop     {r3, pc}

        .section .text.main,"ax",%progbits
        .type   main, %function
    main:
        push    {r3, lr}
        bl      foo
        adds    r0, r0, #4
        pop     {r3, pc}

Now, we get three distinct sections rather than the single .text section we had initially, one per function, named after the function they contain plus some attributes to indicate that they contain executable code. Separate sections are just as good as separate object files to the linker, since it can drop unreferenced code on a per function granularity as long as the linker is also passed the -gc-sections flag.

Let's try it out, first without any special flags:

    $ gcc -O2 -o test test.c
    $ ./test
    $ echo $?
    5
    $ nm test | grep "foo\|bar"
    00008520 T bar
    000084fc T foo

The code works as expected, and the bar symbol is still present as expected. Let's add our special flags now (using the special "-Wl" flag to pass options through to the linker):

    $ gcc -ffunction-sections \
    >     -Wl,-gc-sections -Wl,-print-gc-sections \
    >     -O2 -o test test.c
    ld: Removing unused section '.text.bar' in file 'test.o'
    $ ./test
    $ echo $?
    5
    $ nm test | grep "foo\|bar"
    000084fc T foo

Now bar() is gone. We get an automatic size reduction and we didn't have to modify the source code at all. Instant gratification! The attentive reader would have noticed the extra -print-gc-sections linker flag that may be used to confirm what sections are actually removed.

In addition to -ffunction-sections, GCC also accepts -fdata-sections to perform the same split-section trick with global data variables. It is typical to see both flags used together.

Can we just add those flags to the kernel build? Unfortunately, things aren't that simple. Just adding those flags will produce a non-booting kernel. The problem comes from various sections that the kernel code creates to let the linker gather scattered pieces of information together into a table that can be used by the running kernel. For example, let's consider this macro that closely resembles what's used to implement put_user() on ARM:

    #define __put_user_asm_word(x, __pu_addr, err)              \
        __asm__ __volatile__(                                   \
        "1:     strt    %1, [%2]\n"                             \
        "2:\n"                                                  \
        "       .pushsection .text.fixup,\"ax\"\n"              \
        "       .align  2\n"                                    \
        "3:     mov     %0, %3\n"                               \
        "       b       2b\n"                                   \
        "       .popsection\n"                                  \
        "       .pushsection __ex_table,\"a\"\n"                \
        "       .align  3\n"                                    \
        "       .long   1b, 3b\n"                               \
        "       .popsection"                                    \
        : "+r" (err)                                            \
        : "r" (x), "r" (__pu_addr), "i" (-EFAULT)               \
        : "cc")

For those who might wonder what those "1b", "2b" and "3b" symbols might be: those are backward references to the respective local labels without the "b" suffix. The "f" suffix is also available for forward references. See the binutils documentation for more details.

The code above adds a single strt instruction in the currently active section, which is a special instruction that perform a user-space store after testing user access permissions. It also adds some code to the .text.fixup section, and finally records in the __ex_table section the location of the strt instruction and the .text.fixup code.

To better illustrate what's happening, let's consider this code:

    int foobar(int __user *p)
    {
        return put_user(0x5a, p);
    }

The assembly result is:

        .section .text.foobar,"ax"
    foobar:
        mov     r3, #0
        mov     r2, #0x5a
    1:  strt    r2, [r0]
    2:  mov     r0, r3
        bx      lr

        .section .text.fixup,"ax"
    3:  mov     r3, #-EFAULT
        b       2b

        .section __ex_table,"a"
        .long   1b, 3b

In the example above, the code preloads zero into r3, performs the user-space access and, if an exception occurs, r3 is loaded with -EFAULT by the fixup code and execution is resumed past the faulting instruction.

What's important to remember here is that the linker gathers the __ex_table sections from every put_user() instance into a single section to form a table. That table is searched by the kernel exception-handling code to decide what to do if the faulting instruction matches a table entry when an exception occurs.

The problem with these __ex_table sections is that nothing has an actual reference to them. They are just a bunch of address values pulled together. So, when the linker is passed the -gc-sections flag, it is free to drop all of them because they aren't referenced by anyone since they don't define any symbol themselves anyway. So we end up with a final kernel that has an empty exception table. This is true for many such kernel tables created by the linker, such as the list of kernel command-line argument parsers or the the initcall pointer table. And a kernel without any initcalls won't boot very far.

Of course there is a linker script directive that allows for overriding the -gc-sections effect on a per section basis, namely the KEEP() directive. So the kernel linker script has gained entries that look like this:

    __ex_table {
	__start___ex_table = .;
	KEEP(*(__ex_table))
	__stop___ex_table = .;
    }

With the appropriate sprinkling of KEEP() annotations in the linker script, the kernel does eventually boot properly. Yay! So now the extent of our modifications consists of a few extra flags to the compiler/linker and a couple KEEP() annotations in the linker script. That is, in fact, what the mainline kernel already offers since v4.10 with the CONFIG_LD_DEAD_CODE_DATA_ELIMINATION configuration option.

The "backward reference" problem

Maybe we shouldn't celebrate just yet. Let's consider multiple functions like the previous example, each with a call to put_user(). We'd end up with something like the following assembly representation after the final link:

       .section .text.foo1,"ax"
    foo1:
    	...
	mov     r3, #0
    1:  strt    ...
    2:  ...

        .section .text.foo2,"ax"
    foo2:
   	...
        mov     r3, #0
    3:  strt    ...
    4:  ...

        .section .text.foo3,"ax"
    foo3:
	...
        mov     r3, #0
    5:  strt    ...
    6:  ...

        .section .text.fixup,"ax"
    7:  mov     r3, #-EFAULT
        b       2b
    8:  mov     r3, #-EFAULT
        b       4b
    9:  mov     r3, #-EFAULT
        b       6b

        .section __ex_table,"a"
        .long   1b, 7b
        .long   3b, 8b
        .long   5b, 9b

Here we clearly see the __ex_table section containing a table of tuples, each with the address of a potentially exception-raising instruction and the address of the code to execute in that case. Of course our linker script has a KEEP() on the individual __ex_table entries to pull them into the final binary, otherwise the linker would discard them. But despite the fact that the __ex_table entries don't define symbols of their own, they do reference other symbols, namely the location of the strt instructions, illustrated by backward references to labels 1, 3 and 5. References to the fixup code also pull in the corresponding section, which also has yet more references to those individual functions.

That means all those functions with a put_user() call in them, and all other functions using similar constructs that create table entries using the same mechanism, are always pulled into the final binary even if they are never referenced by anything else. And those functions will pull in all the functions they call, and so on, down to leaf functions. This makes the whole idea of dropping unused code by garbage-collecting unreferenced sections rather ineffective in this case.

What can we do about that? We could apply the same trick already applied to those original functions themselves and create separate sections for each of the exception and fixup entries so the linker can link some of them and drop the others. However, in the function case, it is the compiler that does the section splitting for us with -ffunction-sections. Here we're providing our own assembly stubs.

One could suggest something like __put_user(val, ptr, __func__). The compiler provides the __func__ identifier that holds the name of the current function as a string. That could be used with the .pushsection directive to create section names after the function where this is invoked:

    #define __put_user_asm_word(x, __pu_addr, err)              \
        __asm__ __volatile__(                                   \
        "1:     strt    %1, [%2]\n"                             \
        "2:\n"                                                  \
        "       .pushsection .text.fixup." __func__ ",\"ax\"\n" \
        "       .align  2\n"                                    \
        "3:     mov     %0, %3\n"                               \
        "       b       2b\n"                                   \
        "       .popsection\n"                                  \
        "       .pushsection __ex_table." __func__ ",\"a\"\n"   \
        "       .align  3\n"                                    \
        "       .long   1b, 3b\n"                               \
        "       .popsection"                                    \
        : "+r" (err)                                            \
        : "r" (x), "r" (__pu_addr), "i" (-EFAULT)               \
        : "cc")

The problem is that __func__ is not a string literal. It is a string pointer and therefore cannot be used to construct a string for the asm() statement. What about __FUNCTION__ then? That used to work. However, the GCC documentation says:

These identifiers are not preprocessor macros. In GCC 3.3 and earlier, in C only, __FUNCTION__ and __PRETTY_FUNCTION__ were treated as string literals; they could be used to initialize char arrays, and they could be concatenated with other string literals. GCC 3.4 and later treat them as variables, like func. In C++, __FUNCTION__ and __PRETTY_FUNCTION__ have always been variables.

What about __put_user(val, ptr, __FILE__, __LINE__)? That can work to some extent, as __FILE__ is a string literal and __LINE__ can be stringified. But this scheme falls flat when invoking static inline functions as the file and line information correspond to the function definition location and not to where it is inlined. That means multiple instances would end up with the same section name, which is precisely what we're trying to avoid.

The ultimate and simplest solution requires some involvement from the assembler. A section name substitution sequence is possible when using binutils version 2.26 or later. With that feature, the previous .pushsection directives would simply become:

    #define __put_user_asm_word(x, __pu_addr, err)              \
        __asm__ __volatile__(                                   \
        "1:     strt    %1, [%2]\n"                             \
        "2:\n"                                                  \
        "       .pushsection %S.fixup,\"ax\"\n"                 \
        "       .align  2\n"                                    \
        "3:     mov     %0, %3\n"                               \
        "       b       2b\n"                                   \
        "       .popsection\n"                                  \
        "       .pushsection __ex_table%S,\"a\"\n"              \
        "       .align  3\n"                                    \
        "       .long   1b, 3b\n"                               \
        "       .popsection"                                    \
        : "+r" (err)                                            \
        : "r" (x), "r" (__pu_addr), "i" (-EFAULT)               \
        : "cc")

Given the function foobar() invoking put_user(), the active section name would be .text.foobar. Then the fixup code would end up in section .text.foobar.fixup and the exception table entry in __ex_table.text.foobar. The linker has now the ability to include only the relevant parts of the exception table and fixup code.

Still, we didn't fix anything, did we?

The "missing forward reference" problem

At this point we have managed to create separate sections for functions, fixup code per function, and exception table entries per function. But we still need a KEEP(__ex_table.*) in the linker script or we'll still end up with all our exception sections discarded like before. Having separate sections with pretty names still doesn't create any reference to them.

What we need is some kind of explicit reference from the invoking code to the corresponding exception table entry so the linker will pull it in along the function when it is needed without having to forcefully KEEP() them. Something illustrated by a hypothetical .tug assembly directive like this:

	.section .text.foobar,"ax"
    foobar:
        mov     r3, #0
        mov     r2, #0x5a
    1:  strt    r2, [r0]
        .tug    4f
    2:  mov     r0, r3
        bx      lr

        .section .fixup.text.foobar,"ax"
    3:  mov     r3, #-EFAULT
        b       2b

        .section __ex_table.text.foobar,"a"
    4:  .long   1b, 3b

Here, a simple call to foobar() from some other code will prompt the linker to pull the foobar() code in. That code, in turn, contains a reference to its exception entry, prompting the linker to pull that in too, and the exception entry has a reference to the fixup code which would be pulled into the link as well. Now we're going somewhere!

But how can we create such a reference? The most obvious way is to replace .tug with .long above, which would store the address of the exception table entry at that location. However this would require the code to branch over that value, which has no use other than creating a reference, wasting memory and making the code less optimal.

Turns out that the GNU assembler already has the necessary feature to create an explicit reference without allocating any space in the code, at least on ARM:

       .section .text.foobar,"ax"
    foobar:
        mov     r3, #0
        mov     r2, #0x5a
    1:  strt    r2, [r0]
        .reloc  ., R_ARM_NONE, 4f
    2:  mov     r0, r3
        bx      lr

        .section .fixup.text.foobar,"ax"
    3:   mov     r3, #-EFAULT
        b       2b

        .section __ex_table.text.foobar,"a"
4:      .long   1b, 3b

Yes, our .tug directive can be defined in terms of the .reloc directive with a no-op relocation type. What .reloc means and how relocations work is beyond the scope of this article though.

Conclusion

We've seen how the linker -gc-sections feature can be exploited to its full potential on the Linux kernel. This requires more changes to the source code than one might have hoped, involving some obscure (or exotic depending on your taste) assembler tricks. However this does not yet unveil the full extent of what the automatic size reduction current tools are capable of when using link-time optimization (LTO). So in the next article in this series we'll look at LTO instead.

Still, there is a longstanding issue with kernel code annotated with the __exit marker that can be fixed with what we've seen already. Because exception table entries hold references to the originating code, that code has to be pulled into the final link or unresolved symbol errors will occur. This is why the built-in __exit code is currently linked into the .init section to be dropped at runtime rather than being discarded upfront at link time. However, with the section name substitution feature, it is possible to create something that looks like:

        .text
    foobar:
    1:  ...

        .section .init.text
    foobar_init:
    2:  ...

        .section .exit.text
    foobar_exit:
    3:  ...

        .section .text.fixup
        do_fix  1b

        .section .init.text.fixup
        do_fix  2b

        .section .exit.text.fixup
        do_fix  3b

This way, it would be possible to discard the __exit code as originally intended, along with the exception entries that reference it. Any takers?

The rest of this series is made up of:

Comments (24 posted)

Demystifying container runtimes

December 20, 2017

This article was contributed by Antoine Beaupré

KubeCon+CloudNativeCon NA

As we briefly mentioned in our overview article about KubeCon + CloudNativeCon, there are multiple container "runtimes", which are programs that can create and execute containers that are typically fetched from online images. That space is slowly reaching maturity both in terms of standards and implementation: Docker's containerd 1.0 was released during KubeCon, CRI-O 1.0 was released a few months ago, and rkt is also still in the game. With all of those runtimes, it may be a confusing time for those looking at deploying their own container-based system or Kubernetes cluster from scratch. This article will try to explain what container runtimes are, what they do, how they compare with each other, and how to choose the right one. It also provides a primer on container specifications and standards.

What is a container?

Before we go further in looking at specific runtimes, let's see what containers actually are. Here is basically what happens when a container is launched:

A container is created from a container image. Images are tarballs with a JSON configuration file attached. Images are often nested: for example this Libresonic image is built on top of a Tomcat image that depends (eventually) on a base Debian image. This allows for content deduplication because that Debian image (or any intermediate step) may be the basis for other containers. A container image is typically created with a command like docker build.
If necessary, the runtime downloads the image from somewhere, usually some "container registry" that exposes the metadata and the files for download over a simple HTTP-based protocol. It used to be only Docker Hub, but now everyone has their own registry: for example, Red Hat has one for its OpenShift project, Microsoft has one for Azure, and GitLab has one for its continuous integration platform. A registry is the server that docker pull or push talks with, for example.
The runtime extracts that layered image onto a copy-on-write (CoW) filesystem. This is usually done using an overlay filesystem, where all the container layers overlay each other to create a merged filesystem. This step is not generally directly accessible from the command line but happens behind the scenes when the runtime creates a container.
Finally, the runtime actually executes the container, which means telling the kernel to assign resource limits, create isolation layers (for processes, networking, and filesystems), and so on, using a cocktail of mechanisms like control groups (cgroups), namespaces, capabilities, seccomp, AppArmor, SELinux, and whatnot. For Docker, docker run is what creates and runs the container, but underneath it actually calls the runc command.

Those concepts were first elaborated in Docker's Standard Container manifesto which was eventually removed from Docker, but other standardization efforts followed. The Open Container Initiative (OCI) now specifies most of the above under a few specifications:

the Image Specification (often referred to as "OCI 1.0 images") which defines the content of container images
the Runtime Specification (often referred to as CRI 1.0 or Container Runtime Interface) describes the "configuration, execution environment, and lifecycle of a container"
the Container Network Interface (CNI) specifies how to configure network interfaces inside containers, though it was standardized under the Cloud Native Computing Foundation (CNCF) umbrella, not the OCI

Implementation of those standards varies among the different projects. For example, Docker is generally compatible with the standards except for the image format. Docker has its own image format that predates standardization and it has promised to convert to the new specification soon. Implementation of the runtime interface also differs as not everything Docker does is standardized, as we shall see.

The Docker and rkt story

Since Docker was the first to popularize containers, it seems fair to start there. Originally, Docker used LXC but its isolation layers were incomplete, so Docker wrote libcontainer, which eventually became runc. Container popularity exploded and Docker became the de facto standard to deploy containers. When it came out in 2014, Kubernetes naturally used Docker, as Docker was the only runtime available at the time. But Docker is an ambitious company and kept on developing new features on its own. Docker Compose, for example, reached 1.0 at the same time as Kubernetes and there is some overlap between the two projects. While there are ways to make the two tools interoperate using tools such as Kompose, Docker is often seen as a big project doing too many things. This situation led CoreOS to release a simpler, standalone runtime in the form of rkt, that was explained this way:

Docker now is building tools for launching cloud servers, systems for clustering, and a wide range of functions: building images, running images, uploading, downloading, and eventually even overlay networking, all compiled into one monolithic binary running primarily as root on your server. The standard container manifesto was removed. We should stop talking about Docker containers, and start talking about the Docker Platform. It is not becoming the simple composable building block we had envisioned.

One of the innovations of rkt was to standardize image formats through the appc specification, something we covered back in 2015. CoreOS doesn't yet have a fully standard implementation of the runtime interfaces: at the time of writing, rkt's Kubernetes compatibility layer (rktlet), doesn't pass all of Kubernetes integration tests and is still under development. Indeed, according to Brandon Philips, CTO of CoreOS, in an email exchange:

rkt has initial support for OCI image-spec, but it is incomplete in places. OCI support is less important at the moment as the support for OCI is still emerging in container registries and is notably absent from Kubernetes. OCI runtime-spec is not used, consumed, nor handled by rkt. This is because rkt execution is based on pod semantics, while runtime-spec only covers single container execution.

However, according to Dan Walsh, head of the container team at Red Hat, in an email interview, CoreOS's efforts were vital to the standardization of the container space within the Kubernetes ecosystem: "Without CoreOS we probably would not have CNI, and CRI and would be still fighting about OCI. The CoreOS effort is under-appreciated in the market." Indeed, according to Philips, the "CNI project and specifications originated from rkt, and were later spun off and moved to CNCF. CNI is still widely used by rkt today, both internally and for user-provided configuration." At this point, however, CoreOS seems to be gearing up toward building its Kubernetes platform (Tectonic) and image distribution service (Quay) rather than competing in the runtime layer.

CRI-O: the minimal runtime

Seeing those new standards, some Red Hat folks figured they could make a simpler runtime that would only do what Kubernetes needed. That "skunkworks" project was eventually called CRI-O and implements a minimal CRI interface. During a talk at KubeCon Austin 2017, Walsh explained that "CRI-O is designed to be simpler than other solutions, following the Unix philosophy of doing one thing and doing it well, with reusable components."

Started in late 2016 by Red Hat for its OpenShift platform, the project also benefits from support by Intel and SUSE, according to Mrunal Patel, lead CRI-O developer at Red Hat who hosted the talk. CRI-O is compatible with the CRI (runtime) specification and the OCI and Docker image formats. It can also verify image GPG signatures. It uses the CNI package for networking and supports CNI plugins, which OpenShift uses for its software-defined networking layer. It supports multiple CoW filesystems, like the commonly used overlay and aufs, but also the less common Btrfs.

One of CRI-O's most notable features, however is that it supports mixed workloads between "trusted" and "untrusted" containers. For example, CRI-O can use Clear Containers for stronger isolation promises, which is useful in multi-tenant configurations or to run untrusted code. It is currently unclear how that functionality will trickle up into Kubernetes, which currently considers all backends to be the same.

CRI-O has an interesting architecture (see the diagram below from the talk slides [PDF]). It reuses basic components like runc to start containers, and software libraries like containers/image and containers/storage, created for the skopeo project, to pull container images and create container filesystems. A separate library called oci-runtime-tool prepares the container configuration. CRI-O introduces a new daemon to handle containers called conmon. According to Patel, the program was "written in C for stability and performance" and takes care of monitoring, logging, TTY allocation, and miscellaneous hazards like out-of-memory conditions.

The conmon daemon is needed here to do all of the things that systemd doesn't (want to) do. But even though CRI-O doesn't use systemd directly to manage containers, it assigns containers to systemd-compatible cgroups, so that regular systemd tools like systemctl have visibility into the container resources. Since conmon (and not the CRI daemon) is the parent process of the container, it also allows parts of CRI-O to be restarted without stopping containers, which promises smoother upgrades. This is a problem for Docker deployments right now, where a Docker upgrade requires restarting all of the containers. This is usually not much trouble for Kubernetes clusters, however, because it is easy to roll out upgrades progressively by moving containers around.

CRI-O is the first implementation of the OCI standards suite that passes all Kubernetes integration tests (apart from Docker itself). Patel demonstrated those capabilities by showing a Kubernetes cluster backed by CRI-O, in what seemed to be a routine demonstration of cluster functionalities. Dan Walsh explained CRI-O's approach in a blog post that explains how CRI-O interacts with Kubernetes:

Our number 1 goal for CRI-O, unlike other container runtimes, is to never break Kubernetes. Providing a stable and rock-solid container runtime for Kubernetes is CRI-O's only mission in life.

According to Patel, performance is comparable to a normal Docker-based deployment, but the team is working on optimizing performance to go beyond that. Debian and RPM packages are available and deployment tools like minikube, or kubeadm also support switching to the CRI-O runtime. On existing clusters, switching runtimes is straightforward: just a single environment variable changes the runtime socket, which is what Kubernetes uses to communicate with the runtime.

CRI-O 1.0 was released in October 2017 with support for Kubernetes 1.7. Since then, CRI-O 1.8 and 1.9 were released to follow the Kubernetes 1.8 and 1.9 releases (and sync version numbers). Patel considers CRI-O to be production-ready and it is already available in beta in OpenShift 3.7, released in November 2017. Red Hat will mark it as stable in the upcoming OpenShift 3.9 release and is looking at using it by default with OpenShift 3.10, while keeping Docker as a fallback. Next steps include integrating the new Kata Containers virtual-machine-based runtime, kube-spawn support, and more storage backends like NFS or GlusterFS. The team also discussed how it could support casync or libtorrent to optimize synchronization of images between nodes.

containerd: Docker's runtime gets an API

While Red Hat was working on its implementation of OCI, Docker was also working on the standard, which led to the creation of another runtime, called containerd. The new daemon is a refactoring of internal Docker components to combine the OCI-specific bits like execution, storage, and network interface management. It was already featured in the 1.12 Docker release, but wasn't completed until the containerd 1.0 release announced at KubeCon, which will be part of the upcoming Docker 17.12 (Docker has moved to version numbers based on year and month). And while we call containerd a "runtime", it doesn't directly implement the CRI interface, which is covered by another daemon called cri-containerd. So containerd needs more daemons than CRI-O for Kubernetes (five, versus three for CRI-O). Also, at the time of writing, the cri-containerd component is marked as beta but containerd itself is already used in numerous production environments through Docker, of course.

During the Node special interest group (SIG) meeting at KubeCon, Stephen Day described [Speaker Deck] containerd as "designed as a tight core of decoupled components". Unlike CRI-O, however, containerd supports workloads outside the Kubernetes ecosystem through a Go API. The API is not considered stable yet, although containerd defines a clear release process for making changes to the API and command-line tools. Like CRI-O, containerd is feature complete and passes all Kubernetes tests, but it does not interoperate with systemd's cgroups.

Next step for the project is to develop more tests and improve performance like memory usage and latency. Containerd developers are also working hard on improving stability. They want to provide Debian and RPM packages for easier installation, and integrate it with minikube and kops as well. There are also plans to integrate Kata Containers more smoothly: runc can already be replaced by Kata for basic integration but cri-containerd integration is not implemented yet.

Interoperability and defaults

All of those options are causing a certain amount of confusion in the community. At KubeCon, which runtime to use was a recurring question to speakers. Kubernetes will likely change from Docker to a different runtime, because it doesn't need all the features Docker provides, and there are concerns that the switch could cause compatibility issues because the new runtimes do not implement exactly the same interface as Docker. Log files, for example, are different in the CRI standard. Some programs also monitor the Docker socket directly, which has some non-standard behaviors that the new runtimes may implement differently, or not at all. All of this could cause some breakage when switching to a different runtime.

The question of which runtime Kubernetes will switch to (if it changes) is also open, which leads to some competition between the runtimes. There was a slight controversy related to that question at KubeCon because CRI-O wasn't mentioned during the CNCF keynote, something Vincent Batts, a senior engineer at Red Hat, mentioned on Twitter:

It is bonkers that CRI implementations containerd and rktlet are covered in KubeCon keynote, but zero mention of CRI-O, which is a Kubernetes project that's been 1.0 and actually used in production.

When I prompted him for details about this at KubeCon, Batts explained that:

Healthy competition is good, the problem is unhealthy competition. The CNCF should be better stewards of the projects under their umbrella and shouldn't fold under pressure to promote certain projects over others.

Batts explained further that Red Hat "may be at a tipping point where some apps could start being deployed as containers instead of RPMs" citing "security concerns (namely with [security] advisory tracking, which is lacking in containers as a packaging format) as the main blocker for such transitions". With Project Atomic, Red Hat seems to be pivoting toward containers, so the stakes are high for the Linux distribution.

When I talked to CNCF's COO Chris Aniszczyk at KubeCon, he explained that the CNCF "current policy is to prioritize the marketing of top-level projects":

Projects like CRIO and Helm are part of Kubernetes in some fashion and therefore part of CNCF, we just don't market them as heavily as our top level projects which have passed the CNCF TOC [Technical Oversight Committee] approval bar.

Aniszczyk added that "we want to help, we've heard the feedback and plan to address things in 2018", suggesting that one solution could be that CRI-O applies to graduate to a top-level project in CNCF.

During a container runtimes press meeting, Philips explained that the community would decide, through consensus, which runtime Kubernetes would run by default. He compared runtimes to web browsers and suggested that OCI specifications for containers be compared to the HTML5 and Javascript standards: those are evolving standards that get pushed by the different implementations. He argued that this competition is healthy and means more innovation.

A lot of the people involved in writing those new runtimes were originally Docker contributors: Patel was the initial maintainer of the original OCI implementation that led to runc, while Philips was also a core Docker contributor before starting the rkt project. Those people actively collaborate, along with Docker developers, on standardizing those interfaces and all want to see Kubernetes stabilize and improve. The goal, according to Patrick Chazenon from Docker Inc., is to "make container runtimes boring and stable to have people innovate on top of it". The developers present at the press meeting were happy and proud of what they have accomplished: they have managed to create a single, interoperable specification for containers, and that specification is expanding.

Consolidation and standardization will continue in 2018

The current big topic in container standardization is not runtimes as much as image distribution (i.e. container registries), which is likely to standardize, again, in a specification built around Docker's distribution system. There is also work needed to follow the Linux kernel changes, for example cgroups v2.

The reality is that each runtime has its own advantages: containerd has an API so it can be used to build customized platforms, while CRI-O is a simpler runtime specifically targeting Kubernetes. Docker and rkt are on another level, providing more than simply the runtime: they also provide ways of building containers or pushing to registries, for example.

Right now, most public cloud infrastructure still uses Docker as a runtime. In fact, even CoreOS uses Docker instead of rkt in its Tectonic platform. According to Philips, this is because "there is a substantial integration ecosystem around Docker Engine that our customers rely on and it is the most well-tested option across all existing Kubernetes products." Philips says that CoreOS may consider supporting alternate runtimes for Tectonic, "if alternative container runtimes provide significant improvements to Kubernetes users":

At this point containerd and CRI-O are both very young projects due to the significant amount of new code each project developed this year. Further, they need to reach the maturity of third-party integrations across the ecosystem from logging, monitoring, security, and much more.

Philips further explained CoreOS's position in this blog post:

So far the primary benefits of the CRI for the Kubernetes community have been better code organization and improved code coverage in the Kubelet itself, resulting in a code base that's both higher quality and more thoroughly tested than ever before. For nearly all deployments, however, we expect the Kubernetes community will continue to use Docker Engine in the near term.

During the discussion in the press meeting, Patel likewise said that "we don't want Kubernetes users to know what the runtime is". Indeed, as long as it works, users shouldn't care. Besides, OpenShift, Tectonic, and other platforms abstract runtime decisions away and pick their own best default matching their users' requirements. The question of which runtime Kubernetes chooses by default is therefore not really a concern for those developers, who prefer working on building consensus on standard specifications. In a world of conflict, seeing those developers working together cordially was definitely a breath of fresh air.

[We would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to attend KubeCon + CloudNativeCon.]

Comments (14 posted)

Containers without Docker at Red Hat

December 20, 2017

This article was contributed by Antoine Beaupré

KubeCon+CloudNativeCon NA

The Docker (now Moby) project has done a lot to popularize containers in recent years. Along the way, though, it has generated concerns about its concentration of functionality into a single, monolithic system under the control of a single daemon running with root privileges: dockerd. Those concerns were reflected in a talk by Dan Walsh, head of the container team at Red Hat, at KubeCon + CloudNativeCon. Walsh spoke about the work the container team is doing to replace Docker with a set of smaller, interoperable components. His rallying cry is "no big fat daemons" as he finds them to be contrary to the venerated Unix philosophy.

The quest to modularize Docker

As we saw in an earlier article, the basic set of container operations is not that complicated: you need to pull a container image, create a container from the image, and start it. On top of that, you need to be able to build images and push them to a registry. Most people still use Docker for all of those steps but, as it turns out, Docker isn't the only name in town anymore: an early alternative was rkt, which led to the creation of various standards like CRI (runtime), OCI (image), and CNI (networking) that allow backends like CRI-O or Docker to interoperate with, for example, Kubernetes.

These standards led Red Hat to create a set of "core utils" like the CRI-O runtime that implements the parts of the standards that Kubernetes needs. But Red Hat's OpenShift project needs more than what Kubernetes provides. Developers will want to be able to build containers and push them to the registry. Those operations need a whole different bag of tricks.

It turns out that there are multiple tools to build containers right now. Apart from Docker itself, a session from Michael Ducy of Sysdig reviewed eight image builders, and that's probably not all of them. Ducy identified the ideal build tool as one that would create a minimal image in a reproducible way. A minimal image is one where there is no operating system, only the application and its essential dependencies. Ducy identified Distroless, Smith, and Source-to-Image as good tools to build minimal images, which he called "micro-containers".

A reproducible container is one that you can build multiple times and always get the same result. For that, Ducy said you have to use a "declarative" approach (as opposed to "imperative"), which is understandable given that he comes from the Chef configuration-management world. He gave the examples of Ansible Container, Habitat, nixos-container, and Smith (yes, again) as being good approaches, provided you were familiar with their domain-specific languages. He added that Habitat ships its own supervisor in its containers, which may be superfluous if you already have an external one, like systemd, Docker, or Kubernetes. To complete the list, we should mention the new BuildKit from Docker and Buildah, which is part of Red Hat's Project Atomic.

Building containers with Buildah

Buildah's name apparently comes from Walsh's colorful Boston accent; the Boston theme permeates the branding of the tool: the logo, for example, is a Boston terrier dog (seen at right). This project takes a different approach from Ducy's decree: instead of enforcing a declarative configuration-management approach to containers, why not build simple tools that can be used by your favorite configuration-management tool? If you want to use regular command-line commands like cp (instead of Docker's custom COPY directive, for example), you can. But you can also use Ansible or Puppet, OS-specific or language-specific installers like APT or pip, or whatever other system to provision the content of your containers. This is what building a container looks like with regular shell commands and simply using make to install a binary inside the container:

    # pull a base image, equivalent to a Dockerfile's FROM command
    buildah from redhat

    # mount the base image to work on it
    crt=$(buildah mount)
    cp foo $crt
    make install DESTDIR=$crt

    # then make a snapshot
    buildah commit

An interesting thing with this approach is that, since you reuse normal build tools from the host environment, you can build really minimal images because you don't need to install all the dependencies in the image. Usually, when building a container image, the target application build dependencies need to be installed within the container. For example, building from source usually requires a compiler toolchain in the container, because it is not meant to access the host environment. A lot of containers will also ship basic Unix tools like ps or bash which are not actually necessary in a micro-container. Developers often forget to (or simply can't) remove some dependencies from the built containers; that common practice creates unnecessary overhead and attack surface.

The modular approach of Buildah means you can run at least parts of the build as non-root: the mount command still needs the CAP_SYS_ADMIN capability, but there is an issue open to resolve this. However, Buildah shares the same limitation as Docker in that it can't build containers inside containers. For Docker, you need to run the container in "privileged" mode, which is not possible in certain environments (like GitLab Continuous Integration, for example) and, even when it is possible, the configuration is messy at best.

The manual commit step allows fine-grained control over when to create container snapshots. While in a Dockerfile every line creates a new snapshot, with Buildah commit checkpoints are explicitly chosen, which reduces unnecessary snapshots and saves disk space. This is useful to isolate sensitive material like private keys or passwords which sometimes mistakenly end up in public images as well.

While Docker builds non-standard, Docker-specific images, Buildah produces standard OCI images among other output formats. For backward compatibility, it has a command called build-using-dockerfile or buildah bud that parses normal Dockerfiles. Buildah has a enter command to inspect images from the inside directly and a run command to start containers on the fly. It does all the work without any "fat daemon" running in the background and uses standard tools like runc.

Ducy's criticism of Buildah was that it was not declarative, which made it less reproducible. When allowing shell commands anything can happen: for example, a shell script might download arbitrary binaries, without any way of subsequently retracing where those come from. Shell command effects may vary according to the environment. In contrast to shell-based tools, configuration-management systems like Puppet or Chef are designed to "converge" over a final configuration that is more reliable, at least in theory: in practice you can call shell commands from configuration-management systems. Walsh, however, argued that existing configuration management can be used on top of Buildah, but it doesn't force users down that path. This fits well with the classic "separation" principle of the Unix philosophy ("mechanism not policy").

At this point, Buildah is in beta and Red Hat is working on integrating it into OpenShift. I have tested Buildah while writing this article and, short of some documentation issues, it generally works reliably. It could use some polishing in error handling, but it is definitely a great asset to add to your container toolbox.

Replacing the rest of the Docker command-line

Walsh continued his presentation by giving an overview of another project that Red Hat is working on, tentatively called libpod. The name derives from a "pod" in Kubernetes, which is a way to group containers inside a host, to share namespaces, for example.

Libpod includes the kpod command to inspect and manipulate container storage directly. Walsh explained this can be useful if, for example, dockerd hangs or if a Kubernetes cluster crashes. kpod is basically an independent re-implementation of the docker command-line tool. There is a command to list running containers (kpod ps) or images (kpod images). In fact, there is a translation cheat sheet documenting all Docker commands with a kpod equivalent.

One of the nice things with the modular approach is that when you run a container with kpod run, the container is directly started as a subprocess of the current shell, instead of a subprocess of dockerd. In theory, this allows running containers directly from systemd, removing the duplicate work dockerd is doing. It enables things like socket-activated containers, which is something that is not straightforward to do with Docker, or even with Kubernetes right now. In my experiments, however, I have found that containers started with kpod lack some fundamental functionality, namely networking (!), although there is an issue in progress to complete that implementation.

A final command we haven't covered is push. While the above commands provide a good process for working with local containers, they don't cover remote registries, which allow developers to actively collaborate on application packaging. Registries are also an essential part of a continuous-deployment framework. This is where the skopeo project comes in. Skopeo is another Atomic project that "performs various operations on container images and image repositories", according to the README file. It was originally designed to inspect the contents of container registries without actually downloading the sometimes voluminous images as docker pull does. Docker refused patches to support inspection, suggesting the creation of a separate tool, which led to Skopeo. After pull, push was the logical next step and Skopeo can now do a bunch of other things like copying and converting images between registries without having to store a copy locally. Because this functionality was useful to other projects as well, a lot of the Skopeo code now lives in a reusable library called containers/image. That library is in turn used by Pivotal, Google's container-diff, kpod push, and buildah push.

kpod is not directly tied to Kubernetes, so the name might change in the future — especially since Red Hat legal has not cleared the name yet. (In fact, just as this article was going to "press", the name was changed to podman.) The team wants to implement more "pod-level" commands which would allow operations on multiple containers, a bit like what docker compose might do. But at that level, a better tool might be Kompose which can execute Compose YAML files into a Kubernetes cluster. Some Docker commands (like swarm) will never be implemented, on purpose, as they are best left for Kubernetes itself to handle.

It seems that the effort to modularize Docker that started a few years ago is finally bearing fruit. While, at this point, kpod is under heavy development and probably should not be used in production, the design of those different tools is certainly interesting; a lot of it is ready for development environments. Right now, the only way to install libpod is to compile it from source, but we should expect packages coming out for your favorite distribution eventually.

[We would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to attend KubeCon + CloudNativeCon.]

Comments (27 posted)

HarfBuzz brings professional typography to the desktop

December 19, 2017

This article was contributed by Bruce Byfield

By their nature, low-level libraries go mostly unnoticed by users and even some programmers. Usually, they are only noticed when something goes wrong. However, HarfBuzz deserves to be an exception. Not only does the adoption of HarfBuzz mean that free software's ability to convert Unicode characters to a font's specific glyphs is as advanced as any proprietary equivalent, but its increasing use means that professional typography can now be done from the Linux desktop as easily as at a print shop.

"HarfBuzz" is a transliteration of the Persian for "open type." Partly, the name reflects that it is designed for use with OpenType, the dominant format for font files. Equally, though, it reflects the fact that the library's beginnings lie in the wish of Behdad Esfahbod, HarfBuzz's lead developer, to render Persian texts correctly on a computer.

"I grew up in a print shop," Esfahbod explained during a telephone interview. "My father was a printer, and his father was a printer. When I was nine, they got a PC, so my brother and I started learning programming on it." In university, Esfahbod tried to add support for Unicode, the industry standard for encoding text, to Microsoft Explorer 5. "We wanted to support Persian on the web," he said. "But the rendering was so bad, and we couldn't fix that, so we started hacking on Mozilla, which back then was Netscape."

Esfahbod's early interest in rendering Persian was the start of a fifteen-year effort to bring professional typography to every Unicode-supported script (writing system). It was an effort that led through working on the GNOME desktop for Red Hat to working on Firefox development at Mozilla and Chrome development at Google, with Esfahbod always moving on amiably to wherever he could devote the most time to perfecting HarfBuzz. The first general release was reached in 2015, and Esfahbod continues to work on related font technologies to this day.

At the beginning of the project, three text renderers were in use on the Linux desktop. The original HarfBuzz code evolved from FreeType, with some borrowing from the Pango and Qt font-rendering systems. The problem with these strands of development was that none of the three was consistent with either of the others. The same characters in the same fonts could be displayed differently in different applications and might look different when printed than on the screen. Esfahbod recalls reading bug reports on Red Hat's internationalization efforts in which it was often impossible to know whether the source of the problems was the fonts or the different implementations.

Eventually, the code for the three implementations was merged under an MIT license, producing what is known as Old HarfBuzz. In the spirit of openness, Old HarfBuzz is still available from the project web site, but for historical reference only. The HarfBuzz code in use today is sometimes known as New HarfBuzz, or (formerly) as harfbuzz-ng. Today, it is available for all major operating systems, and incorporated in Firefox, GNOME, ChromeOS, Android, and KDE, reflecting the companies where Esfahbod has worked and the projects to which he has contributed over the years. In 2017, LibreOffice 5.3 joined the list of HarfBuzz users, and most other free-software projects are following HarfBuzz closely, especially those dealing with graphics and layout. For example, Inkscape has switched to HarfBuzz, and GIMP, Krita, and Scribus are planning to switch in upcoming releases. Yet while HarfBuzz is a significant development, its adoption is happening mostly unnoticed.

The importance of shaping engines

Esfahbod defines HarfBuzz as a shaping engine rather than a layout engine. That is, HarfBuzz is concerned with the consistent graphical representation of Unicode characters in fonts. By contrast, a layout engine is concerned with bodies of text and such issues as the co-existence of different scripts or writing systems, how text breaks at the end of the line, and whether text flows from left to right or right to left.

Because different writing systems display text differently, each script supported by Unicode requires its own shaping engine. In some languages, the shaping engine is straightforward. In Latin-based scripts, such as English, French, or German, and in Greek, Chinese, Japanese, and Korean, one glyph follows another in a set order, and the placement of a new glyph does not affect the positioning of the previous glyphs. Even the addition of accents and diacritical marks, or the use of ligatures (glyphs that are redrawings of letter combinations that would otherwise be poorly spaced) do not greatly complicate rendering. For this reason, Esfahbod describes the shaping engines of these scripts as "simple."

Esfahbod contrasts simple scripts with what he calls "complex" scripts. Examples of complex scripts include Arabic and Persian, in which the shape and position of each glyph is determined by those around it, and in which some glyphs are connected while others are written separately. Even more complicated are Indic scripts, Javanese, and Southeast Asian scripts. In some complex scripts, Esfahbod explains, ten glyphs can form a single syllable, and in some instances, the last glyph can be placed above the others, so that it is viewed first. Necessarily, the shaping engines of complex scripts must be much more detailed than those of simple scripts. In fact, to the users of simple scripts such as English, complex scripts can initially seem overwhelmingly bewildering. Yet with the internationalization of computing, the need to render complex scripts correctly is greater than ever before.

What HarfBuzz does is provide the shaping engines for all 139 writing systems included in Unicode. Esfahbod has joked in presentations that he asked his supervisors at Google to download the entire Internet for him so that the project could obtain enough samples to derive the rules required for the shaping engine for each script. In the end, he settled for the more manageable sample provided by different localizations of Wikipedia.

What HarfBuzz does

The greatest changes brought about by HarfBuzz is that it places all Unicode-supported scripts on an equal footing on all computing platforms. With consistent rendering, minority and non-European languages are now much easier to use in computing so long as a font supporting them is installed. In addition, when necessary, scripts can be mixed more easily in a single passage. However, the new ease of incorporating Unicode scripts also means some changes to Western European languages like English, which have long had a privileged position in computing.

To start with, the growing use of HarfBuzz means that free software matches the Microsoft Universal Shaping Engine, which Esfahbod suspects drew at least inspiration from HarfBuzz. Together, the two systems have also de-emphasized Graphite, which SIL International, a Christian missionary organization, developed to produce texts in minority languages. Graphite still exists in open source, but the fact that it required each font to support it means that only a handful of fonts, such as Gentium and Linux Biolinium, ever supported it. By contrast, HarfBuzz works with all OpenType fonts, which may mean that Graphite will soon become obsolete.

For long-time designers, who have collected fonts over a couple of decades, another change caused by HarfBuzz is the dropping of Type 1 or Postscript fonts. Type 1 fonts were a format popular in the 1990s, when they competed with Microsoft's TrueType format. Both have been largely superseded by the OpenType format in the last fifteen years, but for some, a consequence of applications incorporating HarfBuzz is the unannounced loss of the use of their Type 1 fonts. However, those still using Type 1 fonts can install FontForge and run a script to batch convert them to OpenType quickly and without any loss of detail in the glyphs – a handy trick to know, since Esfahbod has stated that "I don't think we will ever support Type 1 in HarfBuzz" – although he adds that "I'd be happy to work on a converter."

Any drawbacks caused by these changes, though, are outweighed by HarfBuzz's advantages. To start with, according to Esfahbod, HarfBuzz renders fonts much more quickly than previous technologies. While HarfBuzz's finishing touches were being added, Esfahbod was also working on making HarfBuzz work well with Google's Chrome web browser. As Esfahbod tells it, a basic policy in Chrome development was that enhancements should never slow the application. As a result, porting HarfBuzz to Chrome required numerous optimizations. In Chrome's Windows version, the speed of rendering increased by four hundred percent, he claims. As a result, loading and scrolling through long documents in other applications should also be enhanced, although hard figures for any improvement are not available.

At first, Latin-based scripts like English might seem to be affected least by HarfBuzz. Yet, on closer examination, they, too, can benefit. For centuries, printing has been in the hands of professionals. The typewriter reduced that monopoly of expertise, but was unable to reproduce many of the features of professional printing. Full justification, for instance, was impossible on most typewriters and, for the most part, italics could only be indicated by underlining or, on an IBM Selectric, stopping to change the type ball.

In the 1980s, the word processor reduced the monopoly of expertise still further, bringing users many of the features that the typewriter either lacked or could only produce with difficulty. Yet advanced features were still lacking, or could only be added with makeshifts such as special character dialogs or LibreOffice's Typography Toolbar extension – which many users could not be bothered with. In English in particular, communication was mostly in ASCII. The increasing availability of diverse keyboard layouts helped to make computing more versatile, especially for Western European languages, yet the finishing touches of professional print shops remained unavailable.

Now, with HarfBuzz, those finishing touches are used automatically. Using ligatures no longer requires extra effort – when available, they are inserted automatically, eliminating unsightly letter combinations. True small capitals, which are smaller and different in design from the standard capital letters of the font so that they fit into other text more elegantly, can be used without switching to a special font or being indicated by clumsy small capitals manufactured – inevitably, incorrectly – by word processors. Similarly, instead of arranging numbers with a common base line, HarfBuzz inserts old-style figures, in which each numeral has its own baseline, which fits them more aesthetically into blocks of texts. With HarfBuzz, the only limitation is whether the font actually includes such features – which, increasingly, they do. At last, the revolution that began over three decades ago with the introduction of word processors is completed, and applications like LibreOffice can produce publication-ready copy as easily as a print shop.

Moreover, HarfBuzz also gives users control over which advanced features to use, rather than being dependent on what the font designers choose to give them. To customize any feature, users can add a small code snippet in a field that defines what font to use. For example, typing directly after the font name :smcp or :smcp=on converts regular capital letters to small capitals. If you want to turn off this feature in a font that enables them automatically, type :-smcp or :smcp=off. If you want to enter more than one snippet, add the second one directly after the first, with no space between them, so that :liga:onum::cpsp after the font name would force the use of standard ligatures, old style figures, and kerning for upper case capitals, always assuming that the font supports these features.

Make these changes in a paragraph style in a word processor like LibreOffice Writer, and you have a custom font without having to define it each time. You might also create a character style for a feature like kerning (the adjustment of the space between letters), and apply it only when you think it necessary.

Which of these features are available in a font depends on the designer. Some features may be included in a font file, but turned off by default. You may also need to experiment with some of the options to see if they make a visible or significant difference with a given font. The features are based on those used with OpenType in LaTeX, and a complete list is available on Wikipedia.

Further enhancements

Since HarfBuzz reached general release in 2015, Esfahbod has concentrated on related projects. One such project is Noto, a font whose goal is to include all of the approximately 137,000 Unicode characters, enabling the use of all the supported scripts. Currently, Noto includes all the characters in Unicode 6.1 or so but, since Unicode is currently at version 10.0, completion of Noto is likely to remain a continuously moving target. In addition, Esfahbod is involved in the cross-platform development of OpenType Font Variations, which are an open-source implementation of Adobe's proprietary multiple master font technology. Font Variations gives users the ability to change the weight of a font with a slider.

Font designers are already in the habit of designing fonts in multiple weights, ranging from Thin to Regular to Bold, or even Extra Bold, because changing the weight often requires redesigning the font slightly. The free-licensed Lato, for example, is available in no less than seven weights. Now, with Font Variations, users can select a font weight that automatically changes designs as necessary. Where multiple master fonts support only a handful of design changes with weight, a variations font can incorporate almost a hundred, allowing for more subtle and elaborate designs. The change to the toolchain for both font designers and users is creating a lot of excitement at gatherings of typographers. At one recent conference, according to Esfahbod, over half the papers were about Font Variations.

Ever since the SIL Free Font License and the GPL Font Exception were released just over a decade ago, providing licenses acceptable to font designers, high-quality open-source fonts have been released by the hundreds, and made available for both print and online use on sites like Google Fonts and The League of Movable Type. Some are near-replicas of popular proprietary fonts. Others are revivals of forgotten fonts. Still others are original designs. Together, these free fonts mean that modern graphic designers can work entirely with free software to an extent that was never previously possible.

Today HarfBuzz and its related technologies are promising users control over free-licensed OpenType fonts as never before. Moreover, it is a sign of how far free software has come that it is no longer playing catch up with its proprietary counterparts, but keeping pace with them, collaborating with them, and even sometimes leading development.

Comments (22 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Code integrity & PGP; Fedora modularity redesign; Thunderbird news; Net neutrality; Quotes; ...
Announcements: Newsletters; events; security advisories; kernel patches; ...

Next page: Brief items>>