LWN.net Weekly Edition for February 14, 2019

Welcome to the LWN.net Weekly Edition for February 14, 2019

This edition contains the following feature content:

France enters the Matrix: from FOSDEM: an overview of the Matrix messaging system and its use in the French government.
Avoiding the coming IoT dystopia: from LCA: Bradley Kuhn on how the GPL can help with Internet-of-Things problems.
Blacklisting insecure filesystems in openSUSE: an extensive discussion in the openSUSE community sparked by the blacklisting of a few relatively obscure filesystem modules.
Concurrency management in BPF: as BPF grows in complexity, it is having to deal with the same concurrency issues that have been solved in the rest of the kernel.
io_uring, SCM_RIGHTS, and reference-count cycles: an in-depth look at file reference counting, SCM_RIGHTS datagrams, and how io_uring fell into (and got out of) a trap.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

France enters the Matrix

February 11, 2019

This article was contributed by Tom Yates

FOSDEM

Matrix is an open platform for secure, decentralized, realtime communication. Matthew Hodgson, the Matrix project leader, came to FOSDEM to describe Matrix and report on its progress. Attendees learned that it was within days of having a 1.0 release and found out how it got there. He also shed some light on what happened when the French reached out to them to see if Matrix could meet the internal messaging requirements of an entire national government.

From a client's viewpoint, Matrix is a thin set of HTTP APIs for publish-subscribe (pub/sub) data synchronization; from a server's viewpoint, it's a rich set of HTTP APIs for data replication and identity services. On top of these APIs, application servers can provide any service that benefits from running on Matrix. Principally, that has meant interoperable chat, but Hodgson noted that any kind of JSON data could be passed, including voice over IP (VoIP), virtual or augmented reality communications, and IoT messaging. That said, Matrix is independent of the transport used; although current Matrix-hosted services are built around HTTP and JSON, more exotic transports and data formats can be used and, at least in the laboratory, have been.

Because Matrix is inherently decentralized, no single server "owns" the conversations; all traffic is replicated across all of the involved servers. If you are using your server to talk to someone on, say, a gouv.fr server, and your server goes down, then because their server also has the whole conversation history, when your server comes back up, it will resync so that the conversation can continue. This is because the "first-class citizen" in Matrix is not the message, but the conversation history of the room. That history is stored in a big data structure that is replicated across a number of participants; in that respect, said Hodgson, Matrix is more like Git than XMPP, SIP, IRC, or many other traditional communication protocols.

Matrix is also end-to-end encrypted. Because your data is replicated across all participating servers, said Hodgson, if it's not end-to-end encrypted, it's a privacy train wreck. The attack envelope enlarges each time a new participant enters a conversation and gets a big chunk of conversation history synchronized to their server. In particular, admitting a new participant to your conversation should not mean disclosing the conversation history to whoever administers their server; end-to-end encryption removes that possibility.

Matrix was started back in May 2014. By Hodgson's admission, the first alpha release in September 2014 was put together far too quickly, in a process where "everybody threw lots of Python at the wall, to see what would stick". In 2015, federation became usable, Postgres was added as an internal database engine alongside SQLite, and IRC bridging was added. Later that year the project released Vector as its flagship client, which meant also releasing the first version of the client-server API.

In 2016, much hard work was done on scaling, the Vector client was rebranded as Riot (which it remains today), and end-to-end encryption was added. This latter feature turned out to be difficult in a decentralized world: if you have a stable decentralized chat room, and someone's server goes offline for a few hours, during which time a couple more devices are added to the server, then when the server comes back into federation, should those new devices be in the conversation or not? To Hodgson, this is as much a philosophical question as it is a technical one, but it had to be answered before end-to-end encryption could be implemented.

In 2017, the project added a lot of shiny user-interface whizziness: widgets, stickers, and the like, then in 2018 tried to stabilize all this by feature-freezing and pushing hard towards version 1.0. This push has included the necessary step of setting up a foundation to act as a neutral guardian of the standard. That will allow others to build on it knowing that it's stable and won't be changed at a moment's notice to suit Hodgson and the Matrix developers. Creating that stable base to hand off to the foundation meant nailing down all the protocol APIs, which has not been without pain. Some of them, particularly the federation API, have needed significant changes to correct design errors in the original specifications.

Rolling these changes out has been harder than it should have been because the Matrix developers didn't include protocol versioning in everything from the outset. Hodgson pleaded with the audience, should any of us ever build a protocol, to "make damn sure you can ratchet the version of everything from the outset, otherwise you just paint yourself into a corner. One day you discover a bug in your federation API, and then, before you can fix it, you have to retrofit a whole versioning system".

The move to 1.0 has also meant a complete rethink of certificate management. Back in 2014 the Matrix project decided to abjure traditional certificate authorities; instead, self-signed certificates would be used. To democratize decisions about who to trust, notary servers would be implemented to build a consensus about the trustability of a given TLS certificate. It was, in Hodgson's words, a disaster. While the developers were trying to fix it, Let's Encrypt came along, which made the process of getting properly-signed certificates trivial, so the self-signing experiment has been abandoned. The 0.99 version of the home server code is ACME-capable and can get a certificate directly from Let's Encrypt; the 1.0 version removes support for self-signed certificates.

At 2am on the day of Hodgson's talk, February 2, the server-server API was released at version 0.1. All five core APIs are now stable, he said, which is a necessary precondition for a 1.0 release. Two of them (the client-server and identity APIs) still need final tweaks, but apparently these are expected to have landed by the time this article is published, signaling Matrix's official exit from beta.

Matrix has already been pretty successful. Hodgson presented two graphs (active users, publicly-visible servers) both with pleasing-looking exponential curves on them. The matrix.org server has about half of the seven million globally visible accounts, so it was interesting that Hodgson announced the long-term intention to turn that server off. Apparently, once it becomes common for other organizations to offer a federated Matrix server, there will be no need for a "default" server for new users to land on. It's clear that Hodgson really does regard Matrix as a fully federated system, rather than something centralized, or centralized-plus-hangers-on.

The French connection

In early 2018, DINSIC, which Hodgson described as the "French Ministry of Digital", reached out to one of the developers working on the Riot client for Android to ask if they might get a copy for their own purposes. On investigation, it turned out that what the ministry wanted was end-to-end encrypted, decentralized communication across the entire French government, running on systems that France could provision and control. The agency thought that Matrix might be an excellent platform for providing this.

One might, said Hodgson, question why a government wanted a decentralized system. It turns out that governments, at least in France, only look centralized; in real life they're made up of ministries, departments, sub-departments, schools, hospitals, universities, and so forth. Each of these organizations will have its own operational requirements, security model and policies, and so on; a federated solution allows server operation to be decentralized to an extent that makes the most sense for server operation and for the user community. In France, the user community turns out to be about 5.5 million users — that is, about 9% of the population of the country. In addition, although this was to be a standalone deployment, DINSIC wanted the ability to federate publicly, to be able to connect to other governments, suppliers, contractors, etc., using their shiny new system but without all those external people needing accounts on it.

Because of the federated nature of what was being sought, end-to-end encryption was a requirement, which Matrix was in a position to provide. But DINSIC also wanted enterprise-grade anti-virus (AV) support, so that the ability to share documents, images, and other data through the system didn't present an exciting new infection vector. As Hodgson pointed out, AV is pretty much entirely incompatible with end-to-end encryption: if you've built a system that prevents anyone except sender and receiver from knowing what's in a file, how's a third party going to intercept it en route to scan it for known-harmful content?

In the end, this required adding the ability for files to be exfiltrated from Matrix to an arbitrary external scanning service. This service is provided with the URL for a piece of encrypted content, plus the encryption keys for that content — not for the message in which the content appears, or for the room in which the message appears — encrypted specifically for the content-scanning service. The scanning service then retrieves the content, decrypts and scans it, and proxies the result back into Matrix. Having acquired this ability, Matrix scans content both on upload and download, for extra security; the code to do all this will be making its way back into mainstream Matrix in the near future.

Having committed to using Matrix, DINSIC started work in May 2018 on Tchap, its fork of the Riot client. The current state of the work can be found on GitHub; user trials started in June. The French National Cybersecurity Agency has audited the system, as has an external body. As of January 2019, it's being rolled out across all French ministries, which involves a great deal of Ansible code. Hodgson gave a quick demo of Tchap, noting that at the moment he feels that Tchap is probably more usable than the mainstream Riot client, not least because DINSIC has had a professional user-experience (UX) agency working hard on it.

It's no secret that FOSDEM suffers from having a fixed selection of rooms, of fixed sizes, which often don't correspond to the demand for the activities in those rooms. This year, the queue for entry to the JavaScript devroom, for example, was so enormous that the door itself was quite invisible for most of the weekend. On the flip side, the huge Janson auditorium that is used for the keynotes often has to host talks given to audiences that, though they'd fill most other rooms, look a little lost in that vast cavern. Hodgson gave his talk to a Janson auditorium with virtually no free seats — the largest audience I have ever seen in there for a non-keynote talk.

Clearly, there is some interest in the project. Some of this will be curiosity about the French connection, but much will be because Matrix is a fully-working professional system that can be used productively right now. If you're in one of those organizations that uses Slack for nearly everything, there is no good reason not to start looking at a migration to Matrix, because if the migration is successful, you can bring your messaging data back in-house. Free software is about the user having control, and Matrix honors that promise.

For anyone who'd like to see the whole talk, the video can be found here.

[We would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Brussels for FOSDEM.]

Comments (23 posted)

Avoiding the coming IoT dystopia

By Jake Edge
February 12, 2019

LCA

Bradley Kuhn works for the Software Freedom Conservancy (SFC) and part of what that organization does is to think about the problems that software freedom may encounter in the future. SFC worries about what will happen with the four freedoms as things change in the world. One of those changes is already upon us: the Internet of Things (IoT) has become quite popular, but it has many dangers, he said. Copyleft can help; his talk is meant to show how.

It is still an open question in his mind whether the IoT is beneficial or not. But the "deep trouble" that we are in from IoT can be mitigated to some extent by copyleft licenses that are "regularly and fairly enforced". Copyleft is not the solution to all of the problems, all of the time—no idea, no matter how great, can be—but it can help with the dangers of IoT. That is what he hoped to convince attendees with his talk.

A joke that he had seen at least three times at the conference (and certainly before that as well) is that the "S" in IoT stands for security. As everyone knows by now, the IoT is not about security. He pointed to some recent incidents, including IoT baby monitors that were compromised by attackers in order to verbally threaten the parents. This is "scary stuff", he said.

The IoT web cameras that he uses at his house to monitor his dogs have a "great way" to avoid any firewalls they may be installed behind. They simply send the footage to the manufacturer's servers in China so that he can monitor his dogs from New Zealand or wherever else he might be. So there may be Chinese "hackers" watching his dogs all day; he hopes they'll call if they notice the dogs are out of water, he said to laughter.

As a community, we are quite good at identifying problems, which is why that joke has been repeated so often, for example. And even though we know that the IoT ecosystem is a huge mess, we haven't really done anything about it. Maybe the task just seems too large and too daunting for the community to be able to do something, but Linux has always provided the ultimate counter-example.

Many of us got started in Linux as hobbyists. There was no hardware with Linux pre-installed, so we had to get it running on our desktops and laptops ourselves. That is how he got his start in the Linux community; when Softlanding Linux System (SLS) didn't work on his laptop in 1992, he was able to comment out some code in the kernel, rebuild it, and get it running. Now it is not impossible to find pre-installed Linux on laptops, but, in truth, most people install Linux on their laptops themselves, like in 1992.

The hobbyist culture, where getting Linux on your laptop meant putting it there yourself, is part of what made the Linux community great, Kuhn said. The fact that nearly everyone at his talk was running Linux on their laptop (and only a tiny fraction of those had it come pre-installed) is a testament to the hard work that many have done over the years to make sure we can buy off-the-shelf hardware and run Linux on it. It was the hobbyist culture that made that happen and they did so in spite of, not with the help of, the manufacturers.

If he said that he didn't think it was important that users be able to install Linux on their laptops, no one would think that was a reasonable position—"not at LCA, anyway". But that is exactly what we are being told in IoT-land. The ironic thing is that most of those devices do come with Linux pre-installed. So you still can't buy a laptop with Linux on it, for the most part, but you can't buy a web camera (or baby monitor) without Linux on it.

If, in 1992, we had heard that 90+% of small devices would come with Linux pre-installed in 2019, we would have been ecstatic, he said. But the problem is that to a large extent, there are no re-installs. Some people do install alternative firmware on their devices, but the vast majority do not. And, in fact, most IoT device makers seek to make it impossible for users to re-install Linux.

Help from the GPL

But the GPL is "one of the most forward-looking documents ever written—at least in software", Kuhn said. While the GPL did not anticipate the advent of IoT, it already contains the words needed to help solve the problems that come with IoT: "the scripts used to control compilation and installation of the executable". The freedom to study the code is not enough; the GPL requires more than just the source code so that users can exercise their other software freedoms, which includes the freedom to modify and to install modified versions of the code.

The Linksys WRT54G series of home WiFi routers is one of the first IoT devices to his way of thinking. These devices brought a major change to people's homes. Like Rusty Russell (whose keynote earlier that day clearly resonated with Kuhn), he remembers going to conferences that had no WiFi. He was not surprised to hear that someone printed out the front page of Slashdot at the first LCA.

The WRT54G was the first device where "we did things right": a large group of people and organizations got together and enforced the GPL. That led to the OpenWrt project that is still thriving today. The first commit to the project's repository was the source code that the community had pried out of the hands of Linksys (which, by then, had been bought by Cisco).

But the worry is that OpenWrt is something of a unicorn. There are few devices that even have an alternative firmware project these days. It is, he believes, one of the biggest problems we face in free software. We need to have the ability to effectively use the source code of the firmware running in our devices.

Another example of a project that came about from GPL-enforcement efforts is SamyGO, which creates alternative firmware for Samsung televisions. That project is foundering at this point, he said, which is sad. He does not understand why the manufacturers don't get excited by these projects; at the very least, they will sell a few more devices. As an example, the WRT54G series is the longest-lived router product ever made; it may, in fact, be the longest-lived digital device of any kind.

GPL enforcement can make these kinds of projects happen. But there is a disconnect between two different parts of our community. The Linux upstream is too focused on big corporations and their needs, to the detriment of the small, hobbyist users, he said to applause. Multiple key members of the Linux leadership take the GPL "cafeteria style"; they really only want the C files that have changed and do not care if the code builds or installs.

But all of those leaders, and lots of others, got their start by hacking on Linux on devices they had handy. Kuhn pointed to a 2016 post by Matthew Garrett that described this problem well:

[...] do you want 4 more enterprise clustering filesystems, or another complete rewrite of the page allocator for a 3% performance improvement under a specific database workload, or do you want a bunch of teenagers who grow up hacking this stuff because it's what powers every device they own? Because honestly I think it's the latter that's helped get you where you are now, and they're not going to be there if the thing that matters to you most is making sure that large companies don't feel threatened rather than making sure that the next 19 year old in a dorm room can actually hack the code on their phone and build something better as a result. It's what brought me here in the first place, and I'm hardly the only one.

The next generation of developers will come from the hobbyist world, not from IBM and other corporations, Kuhn said. If all you need to hack on Linux is a laptop and an IoT device, that will allow lots of people, from various income levels, to participate.

Users and developers

In the software world, there is a separation between users and developers, but free software blurs that distinction. It says that you can start out as a user but become a developer if you ever decide you want to; the source code is always there waiting. He worries that our community will suffer if, someday, the only places you can install Linux are in big data centers and, perhaps, in laptops. If all the other devices that run Linux can only be re-installed by their manufacturers, where does the next crop of upstream developers come from? That scenario would be a dystopia, he said; we may have won the battle on laptops, but we're losing it for IoT.

Linux is the most important GPL program ever written, Kuhn said; it may be the most important piece of software ever written, in fact. It was successful because of, not in spite of, its GPL license; he worries that history is being rewritten on that point. Linux was also successful because users could install it on the devices they had access to. He believes that the leaders of upstream Linux are making a mistake by not helping to ensure that users can install their modified code on any and every device that runs Linux. "Tinkering is what makes free software great"; without being able to tinker, we don't have free software.

Upstream does matter, but it does not matter as much as downstream. He said he was sorry to any upstream developers of projects in the audience, but that their users are more important than they are. "It's not about you", he said with a grin.

It is amazing to him, and was unimaginable to him in 1992, that there are many thousands of upstream Linux developers today. But he really did not anticipate that two billion people would have Linux on their devices; that is even more astounding than the number of developers. But most of those two billion potential Linux developers aren't actually able to put code on their devices because those devices are locked down, the process of re-installing is too difficult, or the alternative firmware projects haven't found the time to do the reverse engineering needed to be able to do so.

The upstream Linux developers are important; they are our friends and colleagues. But upstream and downstream can and do disagree on things. Kuhn strongly believes that there is a silent plurality and a loud minority that really want to see IoT devices be hackable, re-installable, changeable, and "study-able". That last is particularly important so that we can figure out what the security and privacy implications of these devices are. Once again, that was met with much applause.

Being able to see just the kernel source is not going to magically solve all of these problems, but it is a "necessary, even if not sufficient condition". We have to start with the kernel; if some device maker wants to put spyware in, the kernel is an obvious place to do so. The kernel's license is fixed, and extremely hard to change as we frequently hear, but it is the GPLv2, which gives downstream users the rights they need. Even if upstream does not prioritize those rights the same way as its downstream users do, it has been quite nice and licensed its code in a way that assures software freedom.

There is no need for a revolution to get what we need for these IoT devices. The GPL already has the words that we need to ensure our ability to re-install Linux on these devices. We simply need to enforce those words, he said.

Call to action

Kuhn put out a call to action for attendees and others in our community. There are some active things that people can do to help this process along. The source code offer needs to be tested for every device that has one. Early on, companies figured out that shipping free software without an offer for the source code was an obvious red flag indicating a GPL violation. So they started putting in these offers with no intention of actually fulfilling them. They believe that almost no one will actually make the request.

That needs to change. He would like everyone to request the source code using the offer made in the manual of any device they buy. It is important that these companies start to learn that people do care about the source code, so even if you do not have the time or inclination to do anything with it, it should still be requested. Every time you buy a Linux-based device, you should have the source code "or something that looks like it" or you should request it.

If you have the time and skills, try to build and install the code on the device. If you don't, see if a friend can help or ask for help on the internet. Re-installing can sometimes brick your device, but that probably means that the company is in violation of the GPL. The idea is to create a culture within our community that publicizes people getting source releases and trying to use them; if they do not work, which is common, that will raise the visibility of these incomplete source releases.

"If it doesn't work, they violated the GPL, it's that simple", Kuhn said. You were promised a set of rights in the GPL and you did not get them. You can report the violation but, unfortunately, SFC is the only organization that is doing enforcement for the community at this point. It has a huge catalog of reports, so it may not be able to do much with the report, but the catalog itself is useful. It shows the extent of the problem and it helps the community recognize that there is a watchdog for the GPL out there; for better or worse, he said, that watchdog is SFC.

But he had an even bigger ask, he said. He is hoping that at least one person in the audience would step up to be the leader of a project to create an alternative firmware for some IoT device. It will have to be done as a hobbyist, because no company will want to fund this work, but it is important that every class of device has an alternative firmware project. Few people are working on this now, but if you are interested in the capabilities of some device, you could become the leader of a project to make it even better. "Revolutions are run by people who show up."

It feels like an insurmountable problem, even to him most days, but it does matter. It only requires that we exercise the rights that we already have, rights that were given to us by the upstream developers by way of the GPL.

Being able to rebuild and re-install Linux on these devices won't magically fix the numerous privacy and security flaws that they have—but it is a start. The OpenWrt project started by getting the source code it was provided running, it then started adding features. Things like VPN support and bufferbloat fixes have improved those devices immeasurably. We can restore the balance of power by putting users back in charge of the operating systems on their devices; we did it once with laptops and servers and we can do it again for IoT devices.

A WebM format video of the talk is available, as is a YouTube version.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Christchurch for linux.conf.au.]

Comments (81 posted)

Blacklisting insecure filesystems in openSUSE

By Jonathan Corbet
February 8, 2019

The Linux kernel supports a wide variety of filesystem types, many of which have not seen significant use — or maintenance — in many years. Developers in the openSUSE project have concluded that many of these filesystem types are, at this point, more useful to attackers than to openSUSE users and are proposing to blacklist many of them by default. Such changes can be controversial, but it's probably still fair to say that few people expected the massive discussion that resulted, covering everything from the number of OS/2 users to how openSUSE fits into the distribution marketplace.

On January 30, Martin Wilck started the discussion with a proposal to add a blacklist preventing the automatic loading of a set of kernel modules implementing (mostly) old filesystems. These include filesystems like JFS, Minix, cramfs, AFFS, and F2FS. For most of these, the logic is that the filesystems are essentially unused and the modules implementing them have seen little maintenance in recent decades. But those modules can still be automatically loaded if a user inserts a removable drive containing one of those filesystem types. There are a number of fuzz-testing efforts underway in the kernel community, but it seems relatively unlikely that any of them are targeting, say, FreeVxFS filesystem images. So it is not unreasonable to suspect that there just might be exploitable bugs in those modules. Preventing modules for ancient, unmaintained filesystems from automatically loading may thus protect some users against flash-drive attacks.

If there were to be a fight over a proposal like this, one would ordinarily expect it to be concerned with the specific list of unwelcome modules. But there was relatively little of that. One possible exception is F2FS, the presence of which raised some eyebrows since it is under active development, having received 44 changes in the 5.0 development cycle, for example. Interestingly, it turns out that openSUSE stopped shipping F2FS in September. While the filesystem is being actively developed, it seems that, with rare exceptions, nobody is actively backporting fixes, and the filesystem also lacks a mechanism to prevent an old F2FS implementation from being confused by a filesystem created by a newer version. Rather than deal with these issues, openSUSE decided to just drop the filesystem altogether. As it happens, the blacklist proposal looks likely to allow F2FS to return to the distribution since it can be blacklisted by default.

The core of the debate, though, was over whether openSUSE should be blacklisting filesystem modules at all. Various participants made the claim that broad filesystem support was part of what attracted them to Linux in the first place, and that they would be upset if things stopped working. The plan is to continue to ship the blacklisted kernel modules, so getting support back would be a simple matter of (manually) deleting an entry from the list, but many users, it was argued, are not capable of making such a change. Blacklisting filesystem types, thus, makes the distribution less friendly for a certain class of users.

As others noted, though, that class of users is likely to be quite small. One would not expect to encounter many users who are running a system that dual-boots between Linux and, say, OS/2, but who also lack the skills to edit a configuration file. That is likely to be the case regardless of which Linux distribution is installed, but openSUSE in particular is not generally viewed as a beginner's Linux. On the other hand, blacklisting risky filesystems could potentially provide a layer of protection for millions of users — a rather larger group.

One commonly heard theme was that disabling these filesystems by default would put openSUSE at a competitive disadvantage with regard to other distributions that have not made this change. Users would note that some other distribution is able to mount their obscure filesystem while openSUSE is not (by default), so they would gravitate to competing distributions. This, Liam Proven said, would be bad:

To thrive, Linux distros have to attract users from other Linux distros. If you only get income from your existing customers, then you are going to die, because sometimes your customers will die. They will fail, or go bankrupt, or get bought out, or switch providers, or something.

This is a law of the market.

It is fair to say that many members of the openSUSE community do not see their situation that way and do not see maximizing the number of openSUSE users as being an important goal in its own right. Richard Brown asserted that "openSUSE is a community project, therefore one of our first concerns should always remain ensuring our Project is self-sustaining"; trying to support users whose interests diverge significantly from those of the community runs counter to that goal. Michal Kubecek agreed, saying:

We have limited resources and we have to weigh carefully what we use them for. Focusing on the kind of users you want to attract (unwilling to think about problems, unwilling to learn things, unwilling to invest their time and energy) means getting a lot of users who will need a lot of help even with the basic tasks. That means that skilled users will either spend a lot of time helping them or (more likely) will simply stop helping.

In the end, the argument that making this change would cause openSUSE to lose users did not carry the day. Another "helpful" participant repeatedly told the community to implement a microkernel model for filesystems so that security problems could be contained. As of this writing, that suggestion, too, has failed to find a critical mass of support.

The part of the discussion that did go somewhere had to do with what happens when a filesystem fails to mount because the requisite module has been blacklisted. In response, there has been some work done to improve the documentation and the messages emitted by the system when that happens so that users know what is going on and what to do about it. There were also concerns about existing systems that would fail to boot after an update — an outcome that all were able to agree is not entirely optimal. The solution to this problem appears to be a scheme whereby filesystem modules would be removed from the blacklist if they are loaded in the kernel that is running when the update is installed. So a system that is actively using one of the blacklisted filesystem types would continue to have access to it by default.

As of this writing, a pair of proposals for the actual changes are under consideration. One of them adds the blacklist and the mechanism described above; the other improves the output from a failed mount attempt, noting that the blacklist may be involved. The long discussion has mostly wound down with the proposal looking mostly like it did at the outset. In the end, it seems, the openSUSE community feels that protecting users against attacks on old and unmaintained code is more important than having ancient filesystems work out of the box.

Comments (41 posted)

Concurrency management in BPF

By Jonathan Corbet
February 7, 2019

In the beginning, programs run on the in-kernel BPF virtual machine had no persistent internal state and no data that was shared with any other part of the system. The arrival of eBPF and, in particular, its maps functionality, has changed that situation, though, since a map can be shared between two or more BPF programs as well as with processes running in user space. That sharing naturally leads to concurrency problems, so the BPF developers have found themselves needing to add primitives to manage concurrency (the "exchange and add" or XADD instruction, for example). The next step is the addition of a spinlock mechanism to protect data structures, which has also led to some wider discussions on what the BPF memory model should look like.

A BPF map can be thought of as a sort of array or hash-table data structure. The actual data stored in a map can be of an arbitrary type, including structures. If a complex structure is read from a map while it is being modified, the result may be internally inconsistent, with surprising (and probably unwelcome) results. In an attempt to prevent such problems, Alexei Starovoitov introduced BPF spinlocks in mid-January; after a number of quick review cycles, version 7 of the patch set was applied on February 1. If all goes well, this feature will be included in the 5.1 kernel.

BPF spinlocks

BPF spinlocks can only be placed inside structures that, in turn, are stored in BPF maps. Such structures should contain a field like:

    struct bpf_spin_lock lock;

A BPF spinlock inside a given structure is meant to protect that structure in particular from concurrent access. There is no way to protect other data, such as an entire BPF map, with a single lock.

From the point of view of a BPF program, this lock will behave much like an ordinary kernel spinlock. An example provided with the patch-set cover letter starts by defining a structure containing a counter:

    struct hash_elem {
    	int cnt;
    	struct bpf_spin_lock lock;
    };

Code that is compiled to BPF could then increment the counter atomically with something like the following:

    struct hash_elem *val = bpf_map_lookup_elem(&hash_map, &key);
    if (val) {
    	bpf_spin_lock(&val->lock);
    	val->cnt++;
    	bpf_spin_unlock(&val->lock);
    }

BPF programs run in a restricted environment, so there is naturally a long list of rules that regulate the use of spinlocks. Only certain types of maps (hash, array, and control-group local storage) support spinlocks at all, and only one spinlock is allowed within any given map element. BPF programs can only acquire one lock at a time (to head off deadlock worries), cannot call external functions while holding a lock, and must release a lock prior to returning. Direct access to the struct bpf_spin_lock field by the BPF program is disallowed. A number of other rules apply as well; see this patch for a more complete list.

Access to BPF spinlocks from user space is naturally different; a user-space process cannot be allowed to hold BPF spinlocks for unbounded periods of time since that would be an easy way to lock up a kernel thread. The complex bpf() system call thus does not get the ability to manipulate BPF spinlocks directly. Instead, it gains a new flag (BPF_F_LOCK) that can be added to the BPF_MAP_LOOKUP_ELEM and BPF_MAP_UPDATE_ELEM operations to cause the spinlock contained within the indicated element to be acquired for the duration of the operation. Reading an element does not reveal the contents of the spinlock field, and updating an element will not change that field.

One implication of this design is that user space cannot use BPF spinlocks to protect complex changes to structures stored in BPF maps; even the simple counter-incrementing example shown above would not be possible, since the lock cannot be held over the full operation (reading the counter, incrementing it, and storing the result). The implicit assumption seems to be that such manipulations will be done on the BPF side, so the locking functionality serves mostly to keep user space from accessing a structure that a BPF program has partially modified. For example, a test program included with the patch set includes a BPF portion that repeatedly picks a random value, then sets every element of an array to that value while holding the lock. The user-space side reads that array under lock and verifies that all elements are the same, thus showing that the element was not read in the middle of an update operation.

The patch set has seen a number of changes as the result of review comments. One significant added restriction is that BPF spinlocks cannot be used in tracing or socket-filter programs due to preemption-related issues. Those restrictions seem likely to be lifted in the future, but other types of BPF programs (including most networking-related programs) should be able to use BPF spinlocks once the feature goes upstream.

The BPF memory model

In the conversation around version 4 of the patch set, Peter Zijlstra asked about the overall memory model for BPF. In contemporary systems, there is a lot more to concurrency control than spinlocks, especially when the desire is to minimize the cost of that control. Access to shared data can be complicated by the tendency of modern hardware to cache and reorder memory accesses, with the result that changes made on one CPU can appear in a different order elsewhere. Concurrency-aware code may have to make careful use of memory barriers to ensure that changes are globally visible in the right order.

Such code tends to be tricky when written for a single architecture, but it is further complicated by the fact that, naturally, every CPU type handles these concurrency issues differently. Kernel developers have done a lot of work to hide those differences to the greatest extent possible; details of that work can be found in Documentation/memory-barriers.txt and the formal kernel memory-model specification. All of that work refers to kernel code running natively on the host processor, though, not code running under the BPF virtual machine. As BPF programs run in increasingly concurrent environments, the need to specify the memory model under which they run will grow.

Starovoitov, who remains the leader of the kernel's BPF efforts, has proved resistant to defining the memory model under which BPF programs run:

What I want to avoid is to define the whole execution ordering model upfront. We cannot say that BPF ISA is weakly ordered like alpha. Most of the bpf progs are written and running on x86. We shouldn't twist bpf developer's arm by artificially relaxing memory model. BPF memory model is equal to memory model of underlying architecture. What we can do is to make it bpf progs a bit more portable with smp_rmb instructions, but we must not force weak execution on the developer.

This approach concerns the developers who have gone to a lot of effort to specify what the kernel's memory model should be in general; Will Deacon said outright that "I don't think this is a good approach to take for the future of eBPF". Paul McKenney has suggested that BPF should simply follow the memory model used by the rest of the kernel. Starovoitov doesn't want to do that, though, saying "tldr: not going to sacrifice performance".

That part of the conversation ended without any conclusions beyond a suggestion to talk further about the issue, either in a phone call or at an upcoming conference. It's not clear whether this off-list discussion has happened as of this writing. What seems clear, though, is that these issues are better worked out soon rather than having to be managed in an after-the-fact manner later on. Concurrency issues are hard enough when the underlying rules are well understood; they become nearly impossible when different developers are assuming different rules and code accordingly.

Comments (14 posted)

io_uring, SCM_RIGHTS, and reference-count cycles

By Jonathan Corbet
February 13, 2019

The io_uring mechanism that was described here in January has been through a number of revisions since then; those changes have generally been fixing implementation issues rather than changing the user-space API. In particular, this patch set seems to have received more than the usual amount of security-related review, which can only be a good thing. Security concerns became a bit of an obstacle for io_uring, though, when virtual filesystem (VFS) maintainer Al Viro threatened to veto the merging of the whole thing. It turns out that there were some reference-counting issues that required his unique experience to straighten out.

The VFS layer is a complicated beast; it must manage the complexities of the filesystem namespace in a way that provides the highest possible performance while maintaining security and correctness. Achieving that requires making use of almost all of the locking and concurrency-management mechanisms that the kernel offers, plus a couple more implemented internally. It is fair to say that the number of kernel developers who thoroughly understand how it works is extremely small; indeed, sometimes it seems like Viro is the only one with the full picture.

In keeping with time-honored kernel tradition, little of this complexity is documented, so when Viro gets a moment to write down how some of it works, it's worth paying attention. In a long "brain dump", Viro described how file reference counts are managed, how reference-count cycles can come about, and what the kernel does to break them. For those with the time to beat their brains against it for a while, Viro's explanation (along with a few corrections) is well worth reading. For the rest of us, a lighter version follows.

Reference counts for file structures

The Linux kernel uses the file structure to represent an open file. Every open file descriptor in user space is represented by a file structure in the kernel; in essence, a file descriptor is an index into a table in struct files_struct, where a pointer to the file structure can be found. There is a fair amount of information kept in the file structure, including the current position within the file, the access mode, the file_operations structure, a private_data pointer for use by lower-level code, and more.

Like many kernel data structures, file structures can have multiple references to them outstanding at any given time. As a simple example, passing a file descriptor to dup() will allocate a second file descriptor referring to the same file structure; many other examples exist. The kernel must keep track of these references to be able to know when any given file structure is no longer used and can be freed; that is done using the f_count field. Whenever a reference is created, by calling dup(), forking the process, starting an I/O operation, or any of a number of other ways, f_count must be increased. When a reference is removed, via a call to close() or exit(), for example, f_count is decreased; when it reaches zero, the structure can be freed.

Various operations within the kernel can create references to file structures; for example, a read() call will hold a reference for the duration of the operation to keep the file structure in existence. Mounting a filesystem contained within a file via the loopback device will create a reference that persists until the filesystem is unmounted again. One important point, though, is that references to file structures are not, directly or indirectly, contained within file structures themselves. That means that any given chain of references cannot be cyclical, which is a good thing. Cycles are the bane of reference-counting schemes; once one is created, none of the objects contained within the cycle will ever see their reference count return to zero without some sort of external intervention. That will prevent those objects from ever being freed.

Enter SCM_RIGHTS

Unfortunately for those of us living in the real world, the situation is not actually as simple as portrayed above. There are indeed cases where cycles of references to file structures can be created, preventing those structures from being freed. This is highly unlikely to happen in the normal operation of the system, but it is something that could be done by a hostile application, so the kernel must be prepared for it.

Unix-domain sockets are used for communication between processes running on the same system; they behave much like pipes, but with some significant differences. One of those is that they support the SCM_RIGHTS control message, which can be used to transmit an open file descriptor from one process to another. This feature is often used to implement request-dispatching systems or security boundaries; one process has the ability to open a given file (or network socket) and make decisions on whether another process should get access to the result. If so, SCM_RIGHTS can be used to create a copy of the file descriptor and pass it to the other end of the Unix-domain connection.

SCM_RIGHTS will obviously create a new reference to the file structure behind the descriptor being passed. This is done when the sendmsg() call is made, and a structure containing pointers to the file structure being passed is attached to the receiving end of the socket. This allows the passing side to immediately close its file descriptor after passing it with SCM_RIGHTS; the reference taken when the operation is queued will keep the file open for as long as it takes the receiving end to accept the new file and take ownership of the reference. Indeed, the receiving side need not have even accepted the connection on the socket yet; the kernel will stash the file structure in a queue and wait until the receiver gets around to asking for it.

Queuing SCM_RIGHTS messages in this way makes things work the way application developers would expect, but it has an interesting side effect: it creates an indirect reference from one file structure to another. The file structure representing the receiving end of an SCM_RIGHTS message, in essence, owns a reference to the file structure transferred in that message until the application accepts it. That has some important implications.

Suppose some process connects to itself via a Unix-domain socket, so it has two file descriptors, call them FD1 and FD2, one corresponding to each end of the connection. It then proceeds to use SCM_RIGHTS to send FD1 to FD2 and the reverse; each file descriptor is sent to the opposite end. We now have a situation where the file structure at each end of the socket indirectly holds a reference to the other — a cycle, in other words. This can work just fine; if the process then accepts the file descriptor sent to either end (or both), the cycle will be broken and all will be well.

If, however, the process closes FD1 and FD2 without accepting the transferred file descriptors, it will remove the only two references to the underlying file structures — except for those that make up the cycle itself. Those file structures will have a permanently elevated reference count and can never be freed. If this happens once as the result of an application bug, there is no great harm done; a small amount of kernel memory will be leaked.. If a hostile process does it repeatedly, though, those cycles could eventually consume a great deal of memory.

There are other ways of using SCM_RIGHTS to create this kind of cycle as well. The problem always involves descriptor-passing datagrams that have never been received, though; this fact is used by the kernel to detect and break cycles. When a file structure corresponding to a Unix-domain socket gains a reference from an SCM_RIGHTS datagram, the inflight field of the corresponding unix_sock structure is incremented. If the reference count on the file structure is higher than the inflight count (which is the normal state of affairs), that file has external references and is thus not part of an unreachable cycle.

If, instead, the two counts are equal, that file structure might be part of an unreachable cycle. To determine whether that is the case, the kernel finds the set of all in-flight Unix-domain sockets for which all references are contained in SCM_RIGHTS datagrams (for which f_count and inflight are equal, in other words). It then counts how many references to each of those sockets come from SCM_RIGHTS datagrams attached to sockets in this set. Any socket that has references coming from outside the set is reachable and can be removed from the set. If it is reachable, and if there are any SCM_RIGHTS datagrams waiting to be consumed attached to it, the files contained within that datagram are also reachable and can be removed from the set.

At the end of an iterative process, the kernel may find itself with a set of in-flight Unix-domain sockets that are only referenced by unconsumed (and unconsumable) SCM_RIGHTS datagrams; at this point, it has a cycle of file structures holding the only references to each other. Removing those datagrams from the queue, releasing the references they hold, and discarding them will break the cycle.

As one might imagine, given that the VFS is involved, there is more complexity than has been described above and some gnarly locking issues involved in carrying out these operations. See Viro's message for the gory details.

Fixing io_uring

Among the features provided by io_uring is the ability to "register" one or more files with an open ring; that speeds I/O operations by eliminating the need to acquire and release references to the registered files every time. When a file is registered with an io_uring, the kernel will create and hold a reference for the duration of that registration. This is a useful feature but it contained a problem that, seemingly, only somebody with a Viro-level understanding of the VFS could spot, describe, and fix; it is a new variant on the cycle problem described above. In short: a process could create a Unix-domain socket and register both ends with an io_uring. If it were then to pass the file descriptor corresponding to the io_uring itself over that socket, then close all of the file descriptors, a cycle would be created. The io_uring code was unprepared for that eventuality.

Viro proposed a solution that involves making the file registration mechanism set up the SCM_RIGHTS data structures as if the registered file descriptor were being passed over a Unix-domain socket. There is a useful analogy here; registering a file can be thought of as passing it to the kernel to be operated on directly. Once the setup has been done, the same cycle-breaking logic will find (and fix) cycles created using io_uring structures.

Jens Axboe, the author of io_uring, implemented the solution and verified that it works. With that issue resolved, it appears that the path to merging io_uring in the 5.1 development cycle may be clear. In the process, a bit of light has been shed on a corner of the VFS that few people understand. The problem of a lack of people with a wide understanding of the VFS layer as a whole, though, is likely to come up again; it rather looks like a cycle that we have not yet gotten out of.

Comments (14 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: runc container breakout; ClusterFuzz; GTK+ renamed GTK; LibreOffice 6.2; Plasma 5.15; PyPy 7.0; Quotes; ...
Announcements: Newsletters; events; security updates; kernel patches; ...

Next page: Brief items>>