LWN.net Weekly Edition for May 16, 2019

Welcome to the LWN.net Weekly Edition for May 16, 2019

This edition contains the following feature content:

A panel with the new Python steering council: five council members discuss the state and future of the Python language.
BPF: what's good, what's coming, and what's needed: Dave Miller's LSFMM session on the BPF virtual machine.
The first half of the 5.2 merge window: what's coming in the next major kernel release.
More LSFMM coverage, including the following sessions from the filesystems track:
The future of Docker containers: a DockerCon session on what's in store for the Docker system.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

A panel with the new Python steering council

By Jake Edge
May 15, 2019

PyCon

Over the past year, Python has moved on from the benevolent dictator for life (BDFL) governance model since Guido van Rossum stepped down from that role. In February, a new steering council was elected based on the governance model that was adopted in December. At PyCon 2019 in Cleveland, Ohio, the five members of the steering council took the stage for a keynote panel that was moderated by Python Software Foundation (PSF) executive director Ewa Jodlowska.

As is traditional for panels of this sort, the participants introduced themselves at the start—going left to right across the stage. Barry Warsaw works for LinkedIn on its "Python Foundation" team, supporting Python development at the company. He was sporting a t-shirt from "Guido van Rossum's world tour" in 1994; he went to that gathering in Maryland and "fell in love with Python" ("and Guido, of course"). He contrasted the 20 participants at that meeting with the thousands in the audience at PyCon; "pretty amazing". It is "kind of miraculous" how much the language and community has grown in the 25 years since that meeting—Python has changed many lives, including his.

Brett Cannon is a development manager for the Python extension to Visual Studio Code at Microsoft. He got started by fixing a problem that he saw in the Python Cookbook in 2002; when asked, the editor told him to post his fix to the python-dev mailing list. He kept at it and at some point Van Rossum had someone give him commit rights; "I think I was just bothering everyone a little too much". He is partly known for the phrase "I came for the language and stayed for the community"; the people on stage are a big part of that, he said.

Carol Willing said that back in 2016 she gave a keynote in the Philippines where she called Python "the people's programming language". When she started using Python in 2012, she found it to be a fun language to program in. After that, she got involved with the Jupyter project and recognized that it provided a path to learn Python, but also to learn about various scientific disciplines as well. She wanted to get more involved, so she started out "working on the community side" with the PSF and OpenHatch, then learned about the workflow and processes for doing core Python development before becoming a core developer.

Van Rossum said that when he created Python, there wasn't a language that he liked, "so I made one out of pieces I had learned previously". Since he can't write code and not have other people use it, he made Python open source, he said to loud applause. Last year, he had a case of burnout and decided it was time for someone else or some other structure to shepherd Python after doing that for nearly 30 years. He likened that to sending a child off to college; you may no longer be directly involved in their lives, but "you never stop worrying", which is why he nominated himself for the steering council. "And here I am."

Nick Coghlan came to Python as a "hardware and C/C++ guy" because he needed to test some signal-processing code. Python had the unittest module and it had SWIG, which allowed him to wrap the C++ code under test. Now he does that kind of work for fast electric-vehicle chargers for Tritium. Python is a language that allows him to "reach out and touch the real world without having to worry about all the messy details".

Questions

Jodlowska asked Van Rossum for his thoughts on Python governance and its evolution from the perspective of the former BDFL and now steering council member. Van Rossum noted that it was "pretty stressful" to be the final arbiter of anything controversial in the language; he is glad that responsibility is now distributed among the council members. He thinks that the council will have the trust of the community because it was voted in, rather than someone becoming the leader "by happenstance"

[Carol Willing, Guido van Rossum & Nick Coghlan]

Going forward, he said that the council is structuring things differently. Instead of being the deciders of all Python Enhancement Proposals (PEPs), the council will defer most of those decisions to a chosen expert among the core developers (or sometimes outside of that group). There has only been a few months under that process, but "so far, i think that is going great". Jodlowska suggested reconvening the council at PyCon in, say, three years to see how things have gone.

The importance of having representation on the council for the scientific Python community was a question directed at Willing. She said that the council should be open to hearing from all of the different parts of the Python community. The needs of different constituencies—web, embedded, education, science, data science, and others—are going to be different at times. Having a diversity of opinions and experiences on the council will help ensure that those needs do not get lost.

Next up was the status of PEP 581 ("Using GitHub Issues for CPython"), which was still undecided at the time of the panel, but has since been accepted by Warsaw as the "BDFL delegate" for the PEP. The question was directed at Cannon, who has been a major player in the transition of various Python projects to GitHub. He noted that the implementation parts of the PEP had been split out into PEP 588 ("GitHub Issues Migration Plan"). There was a presentation on the PEPs at the Python Language Summit that was held earlier in the week and there has been some discussion with the PSF about possibly funding a "PM" (project or product manager) kind of role to help with that transition and others.

Coghlan is a member of the Python packaging working group in addition to his council role; Jodlowska asked him what the next steps for that group might be after the new Python Package Index (PyPI) was rolled out last year and once the upcoming security and accessibility work for PyPI is finished. Coghlan would like to see improvements on the "publisher side"; there have been lots of improvements on the consumer side recently, but releases to PyPI are getting more complicated

How the PEP process has changed was Jodlowska's next question, which she directed at Warsaw, who was one of the authors of PEP 1 ("PEP Purpose and Guidelines"), which describes how PEPs are meant to work. He said that the idea came out of the time when he and Van Rossum were working at the Corporation for National Research Initiatives (CNRI), which ran the Internet Engineering Task Force (IETF). He wanted a format somewhat similar to the RFC process that the IETF uses, but more lightweight. It would "just be enough process for us to get our work done without being too much process". He doesn't know how successful that last part has been, he said with a laugh.

The idea behind PEPs was that Van Rossum would not have to wade through thousands of emails to try to figure out what the proposal was; he could simply read one document that would lay out both the pros and the cons of the proposal and then make a decision. The PEP would record the decision-making process. After a while, there were parts of the ecosystem where Van Rossum didn't have the expertise to pronounce on PEPs, which is where the BDFL delegate idea came from. When Van Rossum was the BDFL, delegating was kind of the last resort for things he didn't want to decide upon; under the new governance model, though, delegates are really the first resort, Warsaw said. The council wants to "allow other people to become engaged with shaping where Python is going to go in the next 25 years".

Python 2

Jodlowska asked attendees to indicate how many were still using Python 2; both Warsaw and Cannon said that the number of hands was lower than they expected. What is the game plan after Python 2 is retired on January 1, 2020, Jodlowska asked. Cannon suggested a party, to laughter and applause.

[Barry Warsaw, Brett Cannon & Carol Willing]

The plans for Python 2 are moving forward and are not going to change ("thank goodness"), Cannon said. The PM role that is under discussion would also help "figure out all the little minute details" that the Python 2 sunset entails. For example, search engines often default to sending people to the Python 2 documentation, which is not the right version to go to starting in 2020. Places in the documentation where it talks about "Python 2 and 3" will need adjustment because starting in 2020 "there's just 'Python'". And so on.

Coghlan pointed attendees at a talk from PyCon AU 2017 called "Python 3 for People Who Haven't Been Paying Attention". The talk looks at some of the options going forward; the core team does expect that some commercial vendors will be offering support for Python 2 beyond 2020. The deadline is effectively saying "we're not doing this [supporting Python 2] for free anymore".

Willing said that if she looks back a few years, the transition to Python 3 was in a "very different space" than it is now. All of the top packages for Python now have versions for Python 3, for example. The scientific Python community has been using Python 3 for a long time, she said, though there is still plenty of legacy Python 2 code out there. She pointed to a keynote [YouTube video] from Instagram at PyCon 2017, on how it made the switch, for a good example of how a large company can make the business case to do the transition. A PM would hopefully help distill these "best practices" and disseminate them to help others follow that path.

The last question for the group, before taking questions from the audience, was about plans for growing and sustaining diversity within core development. Willing said that recognition that progress has been made should be first up. It wasn't that long ago, in 2017, that the first woman became a core developer. But there is "still a long way to go", she said. The council is trying to ensure that the processes and tools for contributing to Python make it easier for everyone to join in. In addition, many get their start in contributing by working on PyPI packages, so the council is trying to strengthen that ecosystem.

Warsaw added that he believed anyone in the audience could become a Python core developer; you don't have to be a C programmer, you just have to care about Python and want to learn how to become a core developer. There are many on stage and in the community who are mentoring people with the goal of bringing in more core developers. "Think about diversity in all the axes", he said. "We need everybody; there's so much work to do and so many interesting things to do that I think everybody can really contribute to Python."

Audience questions

An audience member asked what the first steps are toward becoming a core developer. Cannon suggested starting with the Python Developer's Guide, which should give people a reasonable "lay of the land" that will help them decide if it is something they want to pursue.

The biggest gaps in the language was the next question. Warsaw said that he did not see any major gaps in the language itself, but that the CPython interpreter is 28 years old. The state of the art in virtual machines has come a long way in that time, so it may be time to look at making CPython faster and able to go into even more environments than it is in now.

Coghlan agreed with Warsaw, but said that he was jealous of JavaScript's source maps as way of getting better tracebacks from modules that have been compiled or transformed in some way. Source maps allow JavaScript to essentially undo any transformations so that debuggers and other tools can see the original source code.

Another attendee wondered about core-developer burnout; does the council have any plans on how to reduce it? And is there anything the community can do to help? Cannon said that the intent is to make the job of being a core developer easier, which should help. The community can help by "being nice to core devs", he said, pointing to his keynote from the previous year. "Be nice online and that would help immensely", he said to applause.

Going back to her "people's programming language" comment, Willing noted that the core developers are all people. They have feelings, they make mistakes, they do good things, and sometimes they get things wrong; when the latter happens, "tell us—kindly". She is proud of how the core developers and greater community came together, rather than pulling apart, in the face of Van Rossum's resignation and the process of coming up with a governance model and then electing the council. She suggested that people remember that when they write and say things, there is a person on the other side of that communication.

Over a fairly short span, Python has made a transition that few other large projects have ever made. It certainly seems to have landed the project in a place that is better for Van Rossum and, in truth, for the entire Python community. With luck, there will be another steering committee panel in 2022 or so and we can look how things have gone in the interim.

A YouTube video of the talk is available.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Cleveland for PyCon.]

Comments (1 posted)

BPF: what's good, what's coming, and what's needed

By Jonathan Corbet
May 9, 2019

LSFMM

The 2019 Linux Storage, Filesystem, and Memory-Management Summit differed somewhat from its predecessors in that it contained a fourth track dedicated to the BPF virtual machine. LWN was unable to attend most of those sessions, but a couple of BPF-related talks were a part of the broader program. Among those was a plenary talk by Dave Miller, described as "a wholistic view" of why BPF is successful, its current state, and where things are going.

Years ago, Miller began, Alexei Starovoitov showed up at a netfilter conference promoting his ideas for extending BPF. He described how it could be used to efficiently implement various types of switching fabric — any type, in fact. Miller said that he didn't understand the power of this idea until quite a bit later.

What's good now

BPF, he said, is well defined and useful, solving real problems and allowing users to do what they want with their systems. It is increasingly possible to look at what is going on inside the kernel or to modify its behavior without having to change the source or reboot the system. BPF provides strict limits on what programs can do, running them inside the kernel but in a sandboxed mode. BPF programs operate on a specific object (called the "context"), and there are many places to attach them. They execute in finite time; something that is assured because they are not currently allowed to contain loops (though that will change to some extent eventually). BPF maps provide data structures for programs and can be used to share access to data.

The BPF verifier, Miller said, is "the only thing between us and extreme peril". It is the last line of defense preventing dangerous code from getting into the system. It is so good, he said, that it often frustrates BPF authors, who have to massage their programs to get them to a point where the verifier will accept them.

The real value of BPF lies in the fact that "we are all arrogant". System designers tend to think that they know what their users want to do, so they make boxes to enable that one thing. But users don't want to be in a box; those users are a constant source of new ideas, and developers often don't know what they want. BPF allows users to escape the box created by system designers, who may be sad about that, but they'll get over it.

BPF has been growing slowly, by word of mouth, Miller said, because there is no "advertising machine" for this technology. Users are still learning about it. The good news is that, once technical people get into a new technology, they tend to spread it around. That has happened with BPF, to the point that people are now writing books about it.

What's improving

BPF is mostly useful now for solving simple problems, Miller said, but it is rapidly gaining the ability to deal with "real programs". One step in that direction is increasing the size limit for BPF programs from its current value of 4096 instructions to 1 million. The prohibition on loops forces developers to unroll loops in their programs, which is unfortunate; the in-development support for bounded loops will fix that problem someday.

BPF programs can perform tail calls now; they function a lot like continuations. Tail calls are a great way to build an execution pipeline, where each step in the pipeline performs a tail call to the next. But now BPF is able to support real function calls as well, but a given program still cannot use both due to limitations in the verifier.

One area where BPF has seen some improvement is introspection; it can be hard for developers to understand why their program is not doing what they want. Indeed, in current kernels, it's hard even to determine which BPF programs have been loaded into the system, or to verify that a loaded program is the one that is wanted. The bpftool utility is improving in its introspection support, as is the BTF format for describing data types, which will help to increase the portability of BPF programs. BTF turns out to be good for annotating BPF programs and how they work. The perf utility is also gaining the ability to drill down into BPF programs. Users cannot complain, Miller said, that visibility into BPF is not being provided.

What's needed

There is no shortage of opportunities for improvement still, he said. For example, BPF does not currently support code reuse all that well; there are a lot of people out there writing their own Ethernet header parsers. There are systems with thousands of redundant BPF programs loaded into them; that is not the way to do software development. Support for function calls will help, but BPF needs libraries, and it will need access control for those libraries once they can be loaded. BTF will help, since it will make it easy to see which libraries are available in a given system.

BPF development is still harder in general than it should be; Miller would like to see the development of a "type and go" environment that makes writing and loading a BPF program as easy as on the Arduino. Unskilled people should be able to get stuff done; that is part of the goal of wresting control away from arrogant system developers.

BPF programs should have "trivial debuggability", he continued. It should be possible to single-step through programs and examine context data. He would like the ability to record a program's execution or state so that it could be stepped through outside of the kernel. Perhaps even live, in-kernel single-stepping could be supported in development environments. The most important thing for the near future, though, is the ability to snapshot the current state of a BPF program.

Finally, he said, BPF needs better access control. Almost all BPF functionality is root-only now, but things will not be that way forever. Much more granular control to BPF functionality is required — or we could always control access to BPF with a BPF program, he said. A file like /dev/bpf could be used for access control, but that's still pretty coarse; perhaps what is needed is a hierarchy of files describing the different program types and their access permissions. BPF also needs better memory accounting, since maps can get quite large.

At this point, Miller concluded his talk and accepted questions. Matthew Wilcox started things off by saying that he will not be impressed by BPF until it becomes possible to play Zork in the kernel. The original Zork was less than 1 million instructions, Wilcox said, so that should be possible.

ABI compatibility

The first actual question was about whether there are any inherent limits on what BPF will eventually be able to do. Early on, Miller answered, it was used for tasks like packet analysis, and current usage still reflects that. BPF will not be usable to implement a proprietary TCP stack in the kernel, for example; that is not a goal. Among other things, there are no timers available to BPF programs and no plans to add them.

Some people do try to push the limits, Miller said. Steve Hemminger tried to convert a packet scheduler to BPF, for example, but eventually ran into the timer issue. Somebody else, though, managed to create a complete implementation of OpenVSwitch, but that sort of project really misses the point of BPF. The real value is not in doing everything, but in being able to do exactly what you need and no more.

Ted Ts'o said that he did not expect to see device drivers in BPF, but Miller responded that those already exist. He was referring to the ability to perform infrared protocol decoding in a BPF program. That eliminates the need to support hundreds of infrared devices in a kernel driver and allows support for new devices to be easily added to older kernels. Ts'o conceded that point, but said that it was unlikely that there would be an NVIDIA GPU driver written in BPF anytime soon.

Another attendee asked about ABI compatibility; will the kernel have to support existing BPF programs forever? Miller responded that BPF exists in an "ambiguous plane" between kernel ABI and the "wild west" of the kernel's internals. Tools like BTF will help to make things more compatible over time. Meanwhile, the BPF developers have taken liberties to break things in the early stages; the community is still learning how all of this stuff should work. But that should happen less often over time. That said, he doesn't think it will ever be possible to write a BPF program and expect that it will work on every future kernel.

The discussion turned to the powertop episode, where a change to a tracepoint broke the powertop utility and had to be reverted. As a result of that, some maintainers still refuse to allow the addition of tracepoints in their subsystems. The problem is that powertop was useful, so users complained when it broke. BPF programs, too, will be useful, and are likely to suffer from the same problems. Brendan Gregg may have said earlier in the week that occasional breakage was OK, but someday some user will complain and Linus Torvalds will revert a BPF-visible change. Miller responded that, whenever a new facility like this is added, there is always a period in which things break. We'll never get away from that, but it will get a lot better.

Ts'o worried about how bad the ABI pain would be; some BPF interfaces will not be changeable, he said. At least, it will not be possible to change them without a ten-year deprecation period while old programs are fixed. Miller said that, with BPF, users are often happy when things break, because it usually indicates that new information is available for them to work with.

Gregg said that, in the absence of tracepoints, current BPF tools are using a lot of kprobes. There are a lot of kernel-version checks that go with them, but they still break with every kernel release. If the kernel moves to tracepoints that only break once every five years, that will be fantastic. Ts'o wondered whether the breakage of a kprobe-based tool that is seen as being as useful as powertop would cause Torvalds to revert a change. He does not know the answer to that.

Security

Dave Hansen asked about security and side channels; BPF was one way in which the Spectre vulnerabilities could be exploited early on. These issues have been mitigated one at a time as they are found, but has any thought been given to broader mitigations? Miller acknowledged that programs can be written to exploit speculative execution vulnerabilities; the verifier can often detect and block such attempts. On the other hand, BPF can also improve security. He mentioned an episode where a bug in a custom hash computation could be exploited to crash the kernel; it was possible to move the computation to BPF and block exploits until the kernel was fixed. Hansen continued, saying that the kernel-hardening efforts are trying to address problems proactively; work in the BPF area, he said, is more reactive. Miller conceded that point, but said that, hopefully, the kernel is becoming sufficiently hardened that it will no longer be necessary to worry about these issues all the time.

The final question came from Ts'o, who wondered about how BPF will interact with Linux security modules. With the advent of stackable security modules, it should be possible to implement more flexible access control for BPF programs. He also suggested that perhaps some verifier policies should include interaction with the security-module subsystem.

Miller answered that the verifier has a set of operations specific to each program type; it should be possible to add a security-module hook there somehow. He also observed, with amusement, that SELinux is using classic BPF now for a few things. It would be great, he said, to use BPF to create new security policies; it could be the "universal security policy engine". That would allow for the immediate addition of new policies without the need to wait for the next kernel release.

Comments (18 posted)

The first half of the 5.2 merge window

By Jonathan Corbet
May 10, 2019

When he released the 5.1 kernel, Linus Torvalds noted that he had a family event happening in the middle of the 5.2 merge window and that he would be offline for a few days in the middle. He appears to be trying to make up for lost time before it happens: over 8,300 non-merge changesets have found their way into the mainline in the first four days. As always, there is a wide variety of work happening all over the kernel tree.

Architecture-specific

On x86-64 systems, crash-dump kernels could only be placed in memory below 896MB; in 5.2, that limit has been removed. This will break with ancient versions of kexec-tools, but it appears that those versions are unable to work with current kernels anyway.
A lot of work has been done to eliminate the final places where the kernel might execute code from a writable mapping, closing a number of potential holes that could be exploited in an attack.
The s390 architecture now supports kernel address-space layout randomization and signature verification in kexec_file_load().
The PA-RISC architecture now supports the KGDB kernel debugger, jump labels, and kprobes.
The MIPS32 architecture has gained a just-in-time compiler for the eBPF virtual machine.

Core kernel

The clone() system call has a new CLONE_PIDFD flag. When it is present, the return value to the parent will be a file descriptor representing the newly created child process; this descriptor (a "pidfd") can be used for race-free process signaling among other things.
The kernel now exports the attributes of the memory attached to each node in sysfs, allowing user space to understand how different memory banks on heterogeneous-memory systems will perform. See this commit and this commit for details and documentation. Note that this work appears to be independent of the heterogeneous memory work discussed at the 2019 Linux Storage, Filesystem, and Memory-Management Summit.
The BPF verifier has seen some optimization work that yields a 20x speedup on large programs. That has enabled an increase in the maximum program size (for the root user) from 4096 instructions to 1,000,000.
BPF programs may now access global data; see this commit changelog for some details.
It is now possible to install a BPF program to control changes to sysctl knobs. See Documentation/bpf/prog_cgroup_sysctl.rst for information on the API for these programs.

Filesystems and block layer

The XFS filesystem has gained a health-tracking infrastructure and a new ioctl() command to query the health status of a filesystem. It has not, however, gained any documentation describing this feature or how to use it.
The BFQ I/O scheduler has seen another set of significant performance improvements.
The io_uring mechanism has a new operation, IORING_OP_SYNC_FILE_RANGE, which performs the equivalent of a sync_file_range() system call. It is also now possible to register an eventfd with an io_uring and get notifications when operations complete.
The new system calls for filesystem mounting have finally made it into the mainline kernel. This commit contains a sample program showing how to use them.
The ext4 filesystem has gained support for case-insensitive lookups. As part of that work, the kernel now has generic support for UTF-8 string handling.
The CIFS filesystem now supports the FIEMAP ioctl() operation for efficient extent mapping.

Hardware support

Counters: there is now a generic interface for devices that count things; see this commit for interface documentation and this one for sysfs documentation. Supported devices include ACCES 104-QUAD-8, STM32 Timer encoders, STM32 LP Timer encoders, and FlexTimer module quadrature decoders.
Fieldbus: The kernel now supports the Fieldbus protocol and, in particular, the HMS Anybus-S, Arcx Anybus-S, and HMS Profinet IRT controllers.
Graphics: the kernel finally has support for ARM Mali GPUs. Two new drivers have been merged: Lima for older GPUs and Panfrost for the more recent ones. Also added was support for Ronbo Electronics RB070D30 panels, Feiyang FY07024DI26A30-D MIPI-DSI LCD panels, Rocktech JH057N00900 MIPI touchscreen panels, and ASPEED BMC display controllers.
Industrial I/O: MaxSonar I2CXL family ultrasonic sensors, Maxim MAX31856 thermocouple temperature sensors, NXP FXAS21002C gyro sensors, and Texas Instruments ADS8344 analog-to-digital converters.
Input: users of Logitech devices with non-unifying receivers should notice an improvement of support for various device features; the input layer now interacts directly with the devices rather than relying on the HID emulation in the receiver.
Miscellaneous: ARM SMMUv3 performance monitor counter groups, Cirrus Logic Lochnagar2 temperature, voltage and current sensors, Infineon IR38064 voltage regulators, Intersil ISL68137 PWM controllers, Xilinx Zynq quad-SPI controllers, Daktronics KPC DMA controllers, Aspeed ast2400/2500 HOST P2A VGA MMIO to BMC bridges, STMicroelectronics STM32 factory-programmed memory, Texas Instruments LM3532 backlight controllers, Amlogic G12a-based MDIO bus multiplexers, Milbeaut USIO/UART serial ports, SiFive UARTs, STMicroelectronics MIPID02 CSI-2 to parallel bridges, Amlogic Meson G12A AO CEC controllers, and Amazon elastic fabric adapters.
Networking: MediaTek HCI MT7663S and MT7668S SDIO Bluetooth interfaces, NXP SJA1105 Ethernet switches, Realtek 802.11ac wireless interfaces, and MediaTek MT7615E wireless interfaces.
Pin control: Cirrus Logic Lochnagar pin controllers, Mediatek MT8516 pin controllers, and Bitmain BM1880 pin controllers.
Sound: Support for audio devices running Intel's Sound Open Firmware has landed in the mainline. Also supported are Microchip inter-IC sound controllers.
USB: Broadcom Stingray USB PHYs, Amlogic G12A USB PHYs, MediaTek UFS M-PHYs, Texas Instruments AM654 SerDes PHYs, and Hisilicon HI3660 USB PHYs.
Note also that the support for legacy IDE devices has been deprecated, with an eye toward removal in 2021. If there is anybody out there still using IDE devices that have not been converted over to libata support, now is the time to start saying something.

Security-related

The new mitigations= command-line option provides simplified control over which speculative-execution vulnerability defenses are enabled. Setting it to off disables mitigations entirely. The default option of auto turns mitigations on, but will not affect whether hyperthreading is enabled; auto,nosmt will also disable hyperthreading if a mitigation requires that.
The elliptic curve Russian digital signature algorithm (GOST R 34.10-2012, RFC 7091, ISO/IEC 14888-3) is now supported.
The work to mark all implicit fall-through cases in switch statements is almost complete in 5.2, with only 32 cases left to be addressed. Once they are done, it will be possible to enable the -Wimplicit-fallthrough option on kernel builds to prevent them from coming back.

Internal kernel changes

The objtool utility now tracks code that disables supervisor mode access protection (SMAP, which prevents the kernel from accessing user-space data) to ensure that it is re-enabled before calling any other functions. It is relatively easy to end up in surprising parts of the kernel with SMAP disabled, leading to potential security holes; this change should prevent that from happening.
The interrupt and exception stacks on x86-64 systems now have guard pages, allowing stack overflows to be reliably caught and dealt with. The older probabilistic stack-overflow checking option has been removed, since it is no longer needed.
The new VM_FLUSH_RESET_PERMS VMA flag will cause the kernel to immediately clear TLB entries and direct-map permissions for memory with execute permissions. This flag can be set with set_vm_flush_reset_perms() or at allocation time.
The mmiowb() primitive, which inserts a barrier for memory-mapped I/O operations, has been removed in favor of infrastructure that handles barriers automatically when they are needed.
The new inode method free_inode() serves as a better version of destroy_inode() when RCU is involved; see this commit for some details. Most filesystems have been converted over to this function.
Device tree authors may want to have a look at the new dos and don'ts document on how to write bindings.

The usual schedule would have the 5.2 merge window closing on May 19, with the final 5.2 release happening in the first half of July. It seems like the list of changes for the second half of this merge window will be smaller than the first, but we'll catch up with it regardless once the window has closed.

Comments (7 posted)

DAX semantics

By Jake Edge
May 13, 2019

LSFMM

In the filesystems track at the 2019 Linux Storage, Filesystem, and Memory-Management Summit, Ted Ts'o led a discussion about an inode flag to indicate DAX files, which is meant to be applied to files that should be directly accessed without going through the page cache. XFS has such a flag, but ext4 and other filesystems do not. The semantics of what the flag would mean are not clear to Ts'o (and probably others), so the intent of the discussion was to try to nail those down.

Dan Williams said that the XFS DAX flag is silently ignored if the device is not DAX capable. Otherwise, the file must be accessed with DAX. Ts'o said there are lots of questions about what turning on or off a DAX flag might mean; does it matter whether there are already pages in the page cache, for example. He said that he did not have any strong preference but thought that all filesystems should stick with one interpretation.

While Christoph Hellwig described things as "all broken", Ts'o was hoping that some agreement could be reached among the disparate ideas of what a DAX flag would mean. A few people think there should be no flag and that it should all be determined automatically, but most think the flag is useful. He suggested starting with something "super conservative", such as only being able to set the flag for zero-length files or only empty directories where the files in it would inherit the flag. Those constraints could be relaxed later if there was a need.

Boaz Harrosh wondered why someone might want to turn DAX off for a persistent memory device. Hellwig said that the performance "could suck"; Williams noted that the page cache could be useful for some applications as well. Jan Kara pointed out that reads from persistent memory are close to DRAM speed, but that writes are not; the page cache could be helpful for frequent writes. Applications need to change to fully take advantage of DAX, Williams said; part of the promise of adding a flag is that users can do DAX on smaller granularities than a full filesystem.

When he developed DAX, he added the DAX flag as a "chicken bit", Matthew Wilcox said. The intent was that administrators could control the use of DAX on their systems; he would like to preserve that ability going forward. It may not only depend on whether the application is DAX aware or not, but also on the workload that the application is handling. Ts'o said that there may be applications that want to use persistent memory in its "full unexpurgated form"; requiring administrators to set a flag on a file to enable that is not particularly friendly. Wilcox agreed, saying that he did not want to make administrator's jobs harder, but he did want to preserve the ability to override what an application developer chose.

But Chuck Lever wondered why these DAX-aware applications would want to use a filesystem at all; wouldn't they rather simply get access to a chunk of persistent memory directly via mmap() or similar? Williams said that is exactly what DAX does; if you mmap() a DAX file, you get a chunk of the persistent memory mapped in. The problem is that other applications using the same filesystem may not be ready to get that same kind of access; they may be relying on filesystem semantics for file-backed memory.

One way to look at it, Ts'o said, is that if someone buys a chunk of persistent memory for a single application that is going to use all of it, they don't need a filesystem at all. They can just point the application directly at the block device. But if someone wants to share that persistent memory with multiple applications, user IDs, and such, then a filesystem makes sense.

Lever asked why some kind of namespace would not be used to make that distinction. Williams said that administrators are used to the tools to deal with filesystems and files, so that would make a better interface. For the "I know what I want" users, device DAX solves their problems, but for other users, some other interface is needed and file-oriented interfaces make the most sense. In addition, Ts'o pointed out that a namespace or bind mount is not sufficient since a file either needs to use the page cache or it needs to stay completely out of it, trying to do both will lead to problems. Partitioning a block device into DAX and non-DAX portions would be another way to make it all work, but that lacks flexibility.

An attendee asked what it meant for an application to be DAX aware. Williams said that DAX-aware applications want to do in-place updates of their data and want to manage the CPU cache themselves; these applications want to do accesses on data that is smaller than a page in size directly in memory versus having some kind of buffering or the page cache.

Ts'o said that there are a number of papers out there that describe libraries that can use persistent memory to, for example, update a B-Tree in place. These algorithms do the operations in the right order and flush the caches in such a way that a crash at any point will leave the data structure in a consistent state. It is important to note that these are libraries that would be used by applications because Ts'o said he would not normally trust application authors to get this kind of thing right. But Hellwig expressed skepticism that any non-academic filesystem author would actually trust the CPU's memory subsystem to always get this right; that was met with a fair amount of laughter.

Amir Goldstein asked how allowing DAX and non-DAX access to a file was different from allowing buffered and direct I/O access to the same file. Hellwig said that, while mixing buffered and direct I/O is allowed, it does not give the results that users expect. Goldstein also asked about mixing direct I/O and DAX, but that does not work, Kara said; mixing buffered and direct I/O doesn't really work either, but the kernel pretends that it does. For DAX, kernel developers decided not to repeat that mistake.

Direct I/O and DAX do not work together, but they could if someone wanted to rework the existing implementation, Hellwig said. It would be useful to be able to read persistent memory without using the page cache and to write via the page cache, but it would be tremendously complex to handle the CPU cache coherency correctly, which is probably what has scared everyone away. Another problem is when a DAX-aware application gets surprised by having the page cache between it and persistence; it believes that the cache-flush instructions it issues are making the data persistent when they are not.

But Hellwig said that there is an API problem if applications are issuing "random weird instructions" and expecting them to work; there are too many other layers potentially in between the application and the storage. He is not entirely sure that making these kinds of programming models more widespread is a good idea, but if that's the path that will be taken, there should be some kind of interface at the VDSO level that applications can call where the kernel will ensure that they do the right thing. The kernel will issue the proper cache-flush instructions or whatever else is necessary. The existing model for applications "can't ever work", he said.

Ts'o said that there can be a debate about how reliable these models and cache-flush instructions truly are, but he is reasonably confident that if they don't work, the hardware vendors will fix them when customers put pressure on them. In any case, though, that is orthogonal to the question of having a per-file DAX flag and what its semantics should be.

Williams said he was uncomfortable calling it a "DAX flag", though he acknowledged that could lead directly to the bike shed. He thought that perhaps a MAP_SYNC flag on the inode would be better. Hellwig suggested that the name "DAX" should be retired because it is confusing at this point; he, of course, had his own suggestion for a name for the flag ("writethrough"), as did several others, though no real conclusion was reached.

The discussion moved onto how the flag, however named, could be set. Ts'o said he was uncomfortable restricting it to empty directories, and then having all of the files in it inheriting the attribute, due to hard links and renames. If that is really going to be the way forward, then filesystems need to look at restricting hard links and renames. This is why he wants to nail down the semantics before implementing it in filesystems.

Lever is uncomfortable with a "permanent sticky bit" in the filesystem that is set by administrators, however. He is concerned that administrators will turn it off when it needs to be on, or the reverse; he wondered if a flag to open() was a better path, since the application should know what it needs. But Hellwig pointed out that open() flags are not checked to see if unsupported options are being passed; applications could not be sure to get the behavior they asked for.

Williams pointed out that there already is a dax mount option, so that ship has already sailed to some extent. Ts'o also noted that open() is not the right time to specify this; it needs to be a property of the file itself. If the file is opened twice, once with DAX and once without, what would that mean? One way to handle that might be to fail an open() with the "wrong" mode if the file is already open; the "real disaster" for buffered versus direct I/O was in allowing both types of open() to succeed. Beyond that, though, Hellwig was emphatic that open() flags should never be used for data integrity purposes.

Sprinkled throughout the latter part of the discussion were more suggestions of different names for the flag, but Ts'o thinks they are stuck with the DAX name. There were also questions about how per-file flags interact with the global mount option, including whether a nodax mount option was required. Most seemed to think that option was not needed, however.

In summary, Ts'o said that he thought the overall consensus was to have a flag for empty directories that would be inherited on files created there. The flag could also be set for zero-length files and he heard no enthusiasm for allowing the flag to be cleared once it was set. He plans to summarize the discussion in a post to the relevant mailing lists (fsdevel and DAX) for further discussion.

Comments (23 posted)

A filesystem for virtualization

By Jake Edge
May 14, 2019

LSFMM

A new filesystem aimed at sharing host filesystems with KVM guests, virtio-fs, was the topic of a session led by Miklos Szeredi at the 2019 Linux Storage, Filesystem, and Memory-Management Summit. The existing solution, which is based on the 9P filesystem from Plan 9, has some shortcomings, he said. Virtio-fs is a prototype that uses the Filesystem in Userspace (FUSE) interface.

The existing 9P-based filesystem does not provide local filesystem semantics and is "pretty slow", Szeredi said. The FUSE-based virtio-fs (RFC patches) is performing "much better". One of the ideas behind the new filesystem is to share the page cache between the host and guests, so there would be no data duplication for multiple guests accessing the same files from the host filesystem.

There are still some areas that need work, however. Metadata and the directory entry cache (dcache) cannot be shared, because data structures cannot be shared between the host and guests. There are two ways to handle that. Either there can be a round trip from the guest to the host for each operation to ensure the coherence of the metadata cache and dcache, or the guest can cache that information and somehow revalidate the cache on each operation without going to the host kernel.

The question is what the best solution would be, he said. For example, if a file has changed on the host, the modification time is updated and a stat() on the guest should indicate that. There have been some discussions on how to get notifications from the host kernel to the guest; the notifications would be propagated via a ring buffer in memory. When the guest caches an inode, it could tell the host that it wants notifications for that inode. When it gets a notification, the guest can revalidate its cache. If the ring buffer overflows for some reason, the guest will need to revalidate all of its caches.

Amir Goldstein asked if that mechanism could also be used by Samba to implement its own dcache. Trond Myklebust said that what Szeredi was talking about was an asynchronous notification mechanism, while Samba needs something synchronous. The problem with doing synchronous notifications, Szeredi said, is that the guest should not be able to block operations in the host kernel.

Another topic is POSIX file locking, he said. It is difficult to write a user-space filesystem that allows POSIX locking to work consistently with the host filesystem. The kernel NFS server (knfsd) uses kernel-internal functions to do its locking, but he is not sure what user-space NFS servers do.

The traditional way to handle that is with a user-space lock manager that takes the standard POSIX locks as needed, Myklebust said. Szeredi asked if it would make sense to add a kernel interface for the kernel-internal locking used by knfsd. Boaz Harrosh said that the Ganesha NFS server had a similar problem; it used open file description locks (OFD locks), which put the lock on the struct file so that multiple threads can use the locks successfully, unlike POSIX locks.

Szeredi said the idea was to have POSIX locks that work across guests and the host. Steve French said that Samba also uses OFD locks, which is what he recommended. They have easier semantics, in part because they don't get lost when the file is closed. It is a solution that was added partly for NFS, he said. Szeredi said that it sounded like the conclusion is that it is not worth it to make a new kernel interface for POSIX locks.

Another area that needs attention is on the ctime and mtime timestamps stored for files. They record the time of the last metadata update (ctime) and file data update (mtime). If writes to the file are going to a shared page cache, it will cause the timestamps to be updated on the host filesystem, but only sometimes. That could lead to inconsistencies with the guests' metadata caches.

He is thinking about adding a flag to open() to turn off the updating of these timestamps, which would partially solve the problem. XFS already has a flag like this, but it is not exported to user space. That kind of flag may well have security implications, he said. Goldstein said that he thought the flag was added for Data Management API (DMAPI) support in XFS so that it could make changes to files without updating the timestamps. But DMAPI has been deprecated for XFS, which is probably why the flag is not exported.

The worry about such a flag is that changes can be made to a file's contents without anyone noticing, Myklebust said. That is why it was not added to POSIX, he believes. The solution to the problem is to implement a proper version field that gets exported from the inode.

Comments (10 posted)

NFS topics

By Jake Edge
May 14, 2019

LSFMM

Trond Myklebust and Bruce Fields led a session on some topics of interest in the NFS world at the 2019 Linux Storage, Filesystem, and Memory-Management Summit. Myklebust discussed the intersection of NFS and containers, as well adding TLS support to NFS. Fields also had some container changes to discuss, along with a grab bag of other areas that need attention.

Myklebust began with TLS support for the RPC layer that underlies NFS. One of the main issues is how to do the upcall from the RPC layer to a user-space daemon that would handle the TLS handshake. There is kernel support for doing TLS once the handshake is complete; hardware acceleration of TLS was added in the last year based on code from Intel and Mellanox, he said. RPC will use that code, but there is still the question of handling the handshake.

There are a few different options for the handshake. It could use the same kind of upcall that rpc.gssd does. That would require a daemon listening for handshake requests; once the handshake is complete it would hand the connection back to the kernel to use for the RPC traffic.

Another option would be to do something based on netlink. That would be more generic, but is more appropriate to discuss with the networking developers. He plans to determine if those developers are interested in using netlink for that and will fall back to using the rpc.gssd approach if not. He knows the latter works well with containers and the types of applications in use.

For containers, he said that there are a fair number of patches going into the kernel recently; more are queued up in Fields's and Anna Schumaker's trees. The main issue that still needs to be dealt with is how to handle user namespaces, he said. There are two problems there, the first is that the NFSv4 ID mapper (rpc.idmapd) has been using the keyring upcall interface, which does not support user namespaces at all. The kernel has not been able to map the kernel user ID (kuid) to the user ID inside the user namespace where rpc.idmapd is running. There are patches queued up to rectify that.

The other issue is which IDs get put on the wire. When using NFSv3 and some configurations of NFSv4, raw user and group IDs are sent by clients to the server. Should those be the kernel user/group ID or those of the container? The plan is to use the IDs from inside the namespace, as there is "no real point" in hiding them or translating them to something other than what is seen in the container.

Fields asked if there were other container gaps that Myklebust knew of. He said that there is still an issue with DNS lookups from within a container, because that also uses the keyring upcall interface. That should be fixed, but is not critical because it is not used for a lot of things.

The NFSv4 state in different containers should be different so tenants on the same system don't share the same client ID, Fields said, and wondered if that problem had been solved. Myklebust said that many containers do not set a hostname, which makes it difficult to determine what ID to use to set up a lease when talking to the NFS server. He has looked into a way to create an ID in a generic way so that other filesystems could also use it rather than doing something in an NFS-specific daemon.

Chuck Lever asked if he was referring to "clientid4" (which is part of the NFSv4 protocol). Myklebust said that he was; he has something working using a udev upcall into the container namespace, but hasn't yet published the patches. It will also be used for non-containerized NFS clients because there are a number of Linux distributions that do not require setting a hostname. That leads to a lot of "localhost.localdomain" leases, which is not desirable. The patches will require some cleanup, so it will be Linux 5.3 or later before they will be upstream.

At that point, Fields stepped up to the lectern. There is some amount of state that the server needs to store to track clients across server reboots so that clients can reclaim their locks, he said. The way that was being done did not work well for containers, but that has been fixed, though there are both kernel and user-space parts, so it may take a little while for it all to roll out.

He is concerned that the duplicate reply cache is global to the server, which means that it is shared with all containers. He is fairly certain that malicious clients could snoop the cache, but it could also be a source of bugs since clients could get the wrong cache entry. Either creating separate caches for each client or keying the cache entries using the network namespace should take care of that problem.

Lever asked about containerizing the performance metrics for the server. That does need to be looked at, Fields said. Some of the metrics should be global, but others may not. What is needed is for more people to be using NFS from containers because there may be more corner cases that need to be handled, he said.

Server-to-server copy offload is something that is being worked on. When those patches were posted, Dave Chinner noticed some problems in the filesystem and VFS layers; Chinner sent out patches to fix those problems, but has not pushed them further, possibly due to lack of time. It requires someone picking them up and getting them upstream, Fields said.

Next up was delegations, which are a mechanism where the server can grant exclusive read or write access to a client for a file. Multiple clients can have read delegation for a file, but if another client opens the file for write the server needs to revoke any read delegations. But if there is only one client that has a read delegation and it is the one that writes to the file, there is no need to revoke the delegation.

In order to implement that, there is a need to track the client ID all the way through the VFS. Trying to plumb that through all of the VFS was not realistic. The second attempt, which was to put something in the task structure, was not popular with other developers. The latest attempt is to use the thread-group ID (tgid). In order to do that, he had to make some small modifications to the kernel thread daemon (kthreadd) so that the NFS daemon could run a private version to get all of its threads into the same thread group. Nobody seemed to have objections to that approach, but he is not sure who is supposed to review code in that area.

Steve French asked about adding file attributes, such as those now available using the statx() system call, to the protocol. There are some attributes that are being considered as part of the internet draft, such as the archive bit, Lever said. Others, like birth time, have already been added to the specification, Myklebust said, so they could be added to the Linux NFS implementation.

Ted Ts'o referred to the support for case insensitivity that has recently been added for ext4. He asked if there is anything that needs to be done so that NFS can also use it. Myklebust said there is work needed on both the NFS client and server in order to support that. He is waiting to see what lands in the kernel (the feature was merged for Linux 5.2 after LSFMM) before looking into all that needs to be done. He said that there will at least be changes needed for file name lookups and in managing the directory entry cache (dcache).

Comments (none posted)

Common needs for Samba and NFS

By Jake Edge
May 15, 2019

LSFMM

Amir Goldstein led a discussion on things that the two major network filesystems for Linux, Samba and NFS, could cooperate on at the end of day one of the 2019 Linux Storage, Filesystem, and Memory-Management Summit. In particular, are there needs that both filesystems have that the kernel is not currently providing? He had some ideas of areas that might be tackled, but was looking for feedback from the assembled filesystem developers.

He has recently just started looking at the kernel NFS daemon (knfsd) as it is a lesser use case for the customers of his company's NAS device. Most use Samba (i.e. SMB). He would like to see both interoperate better with other operating systems, though.

He noted that he had asked some Samba developers why it has been hard to get the features they need into the kernel. The "vibe" he got is that they were rather intimidated by the kernel community. But he believes that the last time they tried, "they did it wrong". They wanted Samba-specific features, such as preventing writes or reads to a specific file, to be enforced by the VFS. Those kinds of changes were not acceptable to the VFS maintainers, however.

In talking with some NFS developers, it was agreed that Samba and NFS should "talk amongst themselves" to find areas where both needed kernel support. There are likely things that either can only be done in the kernel or are far better done in the kernel. If there is a minimal set of infrastructure that the kernel could provide to help solve those problems, it may be possible to get it added.

He started with opportunistic locks (OpLocks). In SMB, these locks can be requested when the client takes a lease on a file. If granted, that means the client does not need to flush its changes to the server as long as it holds the lock. If another client is accessing the file, the server can revoke the OpLock. There are large performance gains that OpLocks provide, so it is important to be able to fully support them.

Samba will use OpLocks, but only if it is configured in the mode that it will be the only user of the filesystem it is serving. If, say, an NFS server could also be touching the filesystem, Samba does not use OpLocks at all. There are also two levels of OpLocks; Samba uses both when it is configured as the only thing touching the filesystem.

Steve French said there is more to it than that. OpLocks are identified with a key, which would allow level-1 locks to be upgraded to level 2, but doing an upgrade is not implemented for Linux. He was not sure whether NFS could upgrade its locks or not. Trond Myklebust said that the NFS protocol allows upgrading, but that it is not implemented for Linux.

So Goldstein wondered what Samba and NFS need in order to be able to implement level 2. He plans to go to the Samba conference (sambaXP) in June and will gather more information there. One thing he knows needs to be done is to be able to track clients through the filesystem operations in order to manage their leases, which is similar to what Bruce Fields described for NFS in an earlier talk. That would take care of file leases.

For directory leases, Goldstein has been working on a way to get notifications of directory changes. He has hooks for doing synchronous notifications, but there are sensible concerns about exporting those, he said. He thinks that perhaps exposing them in a way similar to leases, with a timeout in case there is no reply, might be possible.

One of the problems that NFS has with leases, Myklebust said, is that it doesn't detect that they have timed out. If you miss your notification window, ideally the file descriptor would be closed or something else would "fence off" access to the file. That would be an indication that some kind of recovery action needs to be taken. Goldstein said that he is not tied to any specific solution; he is just bringing up areas that might need to be addressed.

Next up was share modes that can be specified when a file is opened. Those modes can request that any other opens of the file, for say read, write, or both, will fail. There is a patch set from five years ago that implemented a flavor of mandatory lock to support these modes, but it took the approach of enforcing the modes in the VFS.

Goldstein thinks that Samba and NFS could cooperate on a new flavor of lock that only they would use. It would be sort of similar to the BSD flock() system call, he said. New open() flags would be added to request this exclusive access (e.g. O_DENYWRITE and O_DENYREAD), but they would only be implemented for filesystems that want to enforce that.

Ted Ts'o said that some kind of document outlining the various problems is going to be needed. He suggested that it would be helpful to explain what the downsides to not implementing those features would be. Samba works pretty well, so explaining why kernel developers should care about these additional features may help smooth their path.

Goldstein said that he was mostly just trying to inform the other filesystem developers of his plans. Overall, he does not expect all that much to come of this effort, but "maybe we can move the needle". His list is not comprehensive by any means; he does not know if such a thing exists. Mainly, he wants to start the conversation about these network filesystem needs.

Comments (41 posted)

The future of Docker containers

May 15, 2019

This article was contributed by Sean Kerner

Michael Crosby is one of the most influential developers working on Docker containers today, helping to lead development of containerd as well as serving as the Open Container Initiative (OCI) Technical Oversight Chair. At DockerCon 19, Crosby led a standing-room-only session, outlining the past, present and — more importantly — the future of Docker as a container technology. The early history of Docker is closely tied with Linux and, as it turns out, so too is Docker's future.

Crosby reminded attendees that Docker started out using LXC as its base back in 2013, but it has moved beyond that over the past six years, first with the docker-led libcontainer effort and more recently with multi-stakeholder OCI effort at the Linux Foundation, which has developed an open specification for a container runtime. The specification includes the runc container runtime which is at the core of the open source containerd project that Crosby helps to lead. Containerd is a hosted project at the Cloud Native Computing Foundation (CNCF) and is one of only a handful of projects that, like Kubernetes, have "graduated", putting it in the top tier of the CNCF hierarchy in terms of project stability and maturity.

Docker 19.03

Docker has both a slow moving enterprise edition and a more rapidly released community edition. At DockerCon 19, Docker Enterprise 3.0 was announced based on the Docker Community Edition (CE) 18.09 milestone. Docker developers are currently working on finalizing the next major release of Docker CE with version 19.03.

The time-based nomenclature for Docker CE release would imply that 19.03 should have been a March release, but that's not the case. Docker has been somewhat delayed with its numbered release cycle, with recent release dates not matching up with actual general availability. For example, the current Docker CE 18.09 milestone became generally available in November 2018, not September 2018 as 18.09 would seem to imply. The current Docker CE number is, however, more closely aligned with the feature-freeze date for releases. The GitHub repository for Docker CE notes that the feature freeze for the 19.03 release did not occur until March 22. The beta 4 release is currently scheduled for May 13, with the final general availability release date listed as "to be determined" sometime in May 2019.

Crosby said that among the big new features that are set to land in Docker CE 19.03 is full support for NVIDIA GPUs, marking the first time that Docker has had integrated GPU support in a seamless manner. Crosby added that NVIDIA GPU support will enable container workloads to take full advantage of the additional processing power offered by those GPUs, which is often needed for artificial intelligence and machine learning use cases.

Containerd is also getting a boost, advancing to version 1.2 inside Docker CE. Containerd 1.2 benefits from multiple bug fixes and performance gains. Among the new capabilities that have landed in this release is an updated runtime that integrates a gRPC interface that is intended to make it easier to manage containers. Overall, Crosby commented that many of the common foundational elements of Docker have remained the same over time.

"Even though we've had kind of the same primitives from back in 2013 in Docker, they've been optimized and matured," Crosby said.

The future of Docker

Docker containers were originally all about making the best use possible of Linux features. Just as Docker containers started out based on a collection of Linux kernel features, the future of Docker is about making the best use of newer kernel features. "Containers are made up of various kernel features, things like cgroups, namespaces, LSMs, and seccomp," he said. "We have to tie all those things together to create what we know of now as a container.

Looking forward to what's next for containers and Docker, Crosby said that it's all about dealing with the different requirements that have emerged in recent years. Among those requirements is the need to take advantage of modern kernel features in Linux 5.0 and beyond, as well as dealing with different types of new workloads, including stateful workloads, which require a degree of persistence that is not present in stateless workloads. Edge workloads for deployments at the edge of the network, rather than just within a core cloud, are another emerging use case. Internet of Things (IoT) and embedded workloads in small footprint devices and industrial settings are also an important use case for Docker in 2019.

One of the Linux kernel features that Docker will take full advantage of in the future is eBPF, which will someday be usable to write seccomp filters. Crosby explained that seccomp and BPF allow for flexible system call interception within the kernel, which opens the door for new control and security opportunities for containers.

Control groups (cgroups) v2 is another Linux feature that Docker will soon benefit from. Cgroups v2 has been in the kernel since the Linux 4.5 release, though it wasn't immediately adopted as a supported technology by Docker. The project isn't alone in not immediately supporting cgroups v2, Red Hat's Fedora community Linux distribution also has not integrated cgroups v2, though it plans to for the Fedora 31 release that is currently scheduled for November. Crosby said that cgroups v2 will give Docker better resource isolation and management capabilities.

Enhanced user namespace support is also on the roadmap for Docker as part of a broader effort for rootless containers; it will help to improve security by not over-provisioning permissions by default to running containers. The idea of running rootless Docker containers with user namespaces is not a new one, but it's one that is soon to be a technical reality. "Finally, after all these years, user namespaces are in a place where we can really build on top of them and enable unprivileged containers," he said.

More kernel security support is also headed to Docker in the future. Crosby said that SELinux and AppArmor are no longer the only Linux Security Modules (LSMs) that developers want. Among the new and emerging LSMs that Docker developers are working to support in the future is Landlock. Crosby added that developers will also have the ability to write their own custom LSMs with eBPF. Additionally, he highlighted the emergence of seccomp BPF.

Making containers more stateful

One of the areas that Crosby is most interested in improving is the stateful capabilities of Docker, which in his view are currently relatively limited. Better stateful capabilities include backup, restore, clone, and migrate capabilities for individual containers. Crosby explained that stateful management today in Docker typically relies on storage volumes and not the actual containers themselves.

"We kind of understand images now as being portable, but I also want to treat containers as an object that can be moved from one machine to another," Crosby said. "We want to make it such that the RW [read/write] layer can be moved along with the container, without having to rely on storage volumes." Crosby added that he also wants to make sure that not only the container's filesystem data is linked, but also the container configuration, including user-level data and networking information.

Rethinking container image delivery

Container images today are mostly delivered via container registries, like Docker Hub for public access, or an internal registry deployment within an organization. Crosby explained that Docker images are identified with a name, which is basically a pointer to content in a given container registry. Every container image comes down to a digest, which is a content address hash for the JSON files and layers contained in the image. Rather than relying on a centralized registry to distribute images, what Crosby and Docker are now thinking about is an approach whereby container images can also be accessed and shared via some form of peer-to-peer (P2P) transfer approach across nodes.

Crosby explained that a registry would still be needed to handle the naming of images, but the content address blobs could be transferred from one machine to another without the need to directly interact with the registry. In the P2P model for image delivery, a registry could send a container image to one node, and then users could share and distribute images using something like BitTorrent sync. Crosby said that, while container development has matured a whole lot since 2013, there is still work to be done. "From where we've been over the past few years to where we are now, I think we'll see a lot of the same type of things and we'll still focus on stability and performance," he said.

A video of this talk is available.

Comments (18 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: ZombieLoad; MDS reading list; Kernel Summit planning; Firefox add-ons; PHP; Quotes; ...
Announcements: Newsletters; events; security updates; kernel patches; ...

Next page: Brief items>>