Leading items

Welcome to the LWN.net Weekly Edition for October 17, 2019

This edition contains the following feature content:

WireGuard and the crypto API: the WireGuard VPN might just have a path into the mainline — again.
FPGAs and free software: parallels between Linux support for FPGA programming and its support for graphics processors.
BPF at Facebook (and beyond): a Kernel Recipes talk on how Facebook uses BPF internally.
Finding race conditions with KCSAN: a new tool for detecting concurrency issues in the kernel.
Calibrating your fear of big bad optimizing compilers: what kinds of constructs will get you into trouble with optimizing compilers?

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

WireGuard and the crypto API

By Jake Edge
October 16, 2019

When last we looked in on the progress of the WireGuard VPN tunnel toward the mainline kernel, it seemed like the main sticking point had been overcome. The Zinc cryptography API used by WireGuard was generally seen as a duplication of effort with the existing kernel cryptographic algorithms, so an effort to rework Zinc to use that existing code seemed destined to route around that problem and bring WireGuard to the mainline. In the six months since then, though, things have gone fairly quiet in WireGuard-land; that all changed based on a conversation at the recent Kernel Recipes conference in Paris.

WireGuard developer Jason A. Donenfeld posted a message from the conference describing a conversation he had there that included kernel networking maintainer David Miller. In the message, Donenfeld announced that WireGuard would be ported to use the existing crypto API in the interests of getting it upstream—based on Miller's advice. Donenfeld said that he was generally opposed to the idea for a few reasons, but now thinks it would make sense to go that route "and afterwards work evolutionarily to get Zinc into Linux piecemeal". He outlined his concerns about the kernel crypto API:

I've long resisted the idea of porting to the existing crypto API, because I think there are serious problems with it, in terms of primitives, API, performance, and overall safety. I didn't want to ship WireGuard in a form that I thought was sub-optimal from a security perspective, since WireGuard is a security-focused project.

In the message, though, he apparently overstated Miller's opinions somewhat. Miller complained that he was being misquoted and was concerned about Donenfeld's announcement:

I didn't say "must" anything, I suggested this as a more [smooth] and efficient way forward.

I'm also a bit disappointed that you felt the need to so quickly make such an explosive posting to the mailing list when we've just spoken about this amongst ourselves only 20 minutes ago.

Donenfeld said that "explosiveness" was not his intent, but that informing the project and others interested in it was what he was aiming for. It turns out that Ard Biesheuvel, who has been critical of the approach tying WireGuard to Zinc along the way, has been working on a patch series to "incorporate WireGuard into the kernel without relying on a wholesale replacement of the existing crypto stack". He posted the series to the linux-crypto mailing list—seemingly in response to Donenfeld's announcement.

Donenfeld's reply was positive overall, but there were still some fairly strong criticisms of the approach. To start with, he is concerned with the performance of using indirect function calls as part of the handshake process:

In this case, the relevance is that the handshake in WireGuard is extremely performance sensitive, in order to fend off DoS. One of the big design gambits in WireGuard is – can we make it 1-RTT [round-trip time] to reduce the complexity of the state machine, but keep the crypto efficient enough that this is still safe to do from a DoS perspective. The protocol succeeds at this goal, but in many ways, just by a hair when at scale, and so I’m really quite loathe to decrease handshake performance.

He is also unhappy with the use of the asynchronous-oriented parts of the crypto API, which was a complaint first raised by Linus Torvalds. Both Torvalds and Donenfeld think that avoiding the asynchronous interface is best, at least for the initial merge. Donenfeld said:

I’d recommend leaving this synchronous as it exists now, as Linus suggested, and we can revisit that later down the road. There are a number of improvements to the async API we could make down the line that could make this viable in WireGuard.

But he definitely wanted to make it clear that he would like to work with Biesheuvel on getting something ready for the mainline:

And for the avoidance of doubt, or in case any of the above message belied something different, I really am happy and relieved to have an opportunity to work on this _with you_, and I am much more open than before to compromise and finding practical solutions to the past political issues.

For his part, Biesheuvel seems pleased to get things moving along again. He believes that the crypto API as a whole could benefit from moving away from the dynamic dispatch approach:

This is one of the issues in the 'fix it for everyone else as well' category. If we can improve the crypto API to be less susceptible to these issues (e.g., using static calls), everybody benefits. I'd be happy to collaborate on that.

"Static calls" are a technique that turns indirect function calls into fixed jump instructions, which has performance and other benefits. Peter Zijlstra recently posted a static_call() patch set, which may be getting closer to being merged.

Biesheuvel was a bit surprised that the handshake timing is so sensitive. He suggested that it was simply a performance issue, rather than a security problem. "But the security of any VPN protocol worth its salt should not hinge on the performance delta between the reference C code and a version that was optimized for a particular CPU." Donenfeld, however, is adamant that the fast primitives are required in order to avoid denial-of-service security problems. Since there are other reasons that algorithmic flexibility is needed for WireGuard (though without the indirect function call overhead), the problem needs to be solved anyway, he said.

Based on the feedback on that first approach, Biesheuvel came back with another RFC patch set. It reworked the use of the crypto API so that Torvalds's and Donenfeld's concerns about the using asynchronous interfaces were addressed:

Linus has made it abundantly clear that using the abstract AEAD interface is not acceptable for instantiating a transformation that is known at compile time, so I will abandon that approach for the time being. If anyone turns up with appropriate h/w to run WireGuard in async mode, we might revisit this, but for sync s/w algorithms, a concrete library interface is clearly preferred.

There were few comments on those patches, so Biesheuvel followed up with a v2 of that approach. This time, Donenfeld had some concerns about some of the architecture-specific optimized implementations of the ChaCha20 cipher, particularly with regard to changes made from the Zinc versions. Those were largely hashed out in the thread, but a bigger question surrounded Biesheuvel's switch away from potentially using static calls. In the patch cover letter, he said:

As it turns out, we don't actually need static calls for this. Instead, we can simply permit weak symbol references to go unresolved between modules (as we already do in the kernel itself, due to the fact that ELF permits it), and have the accelerated code live in separate modules that may not be loadable on certain systems, or be blacklisted by the user.

Andy Lutomirski wondered about that mechanism; he pointed out that the symbol resolution was module-loading-order dependent, which might lead to unexpected results:

Won't this have the counterintuitive property that, if you load the modules in the opposite order, the reference won't be re-resolved and performance will silently regress?

I think it might be better to allow two different modules to export the same symbol but only allow one of them to be loaded. Or use static calls.

While Biesheuvel agreed that module-loading order should not affect which implementation gets chosen, the fact that static calls are not yet available, and might not be the right approach even if they were, means that he is proceeding without them:

I have disregarded static calls and weak references for the time being, and I will proceed with an implementation that uses neither. The downside of this is that, e.g., we are forced to load the NEON module on non-NEON capable hardware without calling any of its code, but this is basically the Zinc situation only with the NEON and the generic code living in different modules.

That led to the most recent patch posting as of this writing. It has some tweaks here and there and removes the weak-references-based implementation. As he put it in the changelog: "Defer using weak references or static calls until the dust around this has settled". There are some of the expected, minor-sounding comments on the patches, but overall the sense is that this work is close to being ready for merging. The next step is presumably to post it to the linux-kernel mailing list and, if it passes muster there, get it into linux-next by way of the crypto subsystem tree. Donenfeld seemed satisfied that it is close to being merge-ready, to the point that he listed things that were deferred from the original Zinc library that might be taken up again after the merge. He also suggested that WireGuard itself would be up next:

WireGuard - things are quasi-ready, so when the time comes, I look forward to submitting that to Dave's net tree as a single boring [PATCH 1/1].

It would seem that there are no real barriers to the inclusion of this crypto code, at least so far, and the WireGuard code itself has never really been controversial—quite the reverse in fact. One would guess that "boring" patch posting will still require review and, quite possibly, revision, but the biggest hurdle has always been the crypto code. With luck, that has been cleared at this point—though it also kind of looked that way back in March. Mainline WireGuard seems quite plausible for the 5.6 or 5.7 kernel; many are looking forward to that day.

Comments (1 posted)

FPGAs and free software

By Jake Edge
October 16, 2019

XDC

The problems with field-programmable gate arrays (FPGAs) is not exactly an obvious talk topic for a graphics-related conference like the 2019 X.Org Developers Conference (XDC). Ben Widawsky acknowledged that, but said that he sees parallels in the situation with FPGA support in the free-software world and the situation with graphics hardware support in the past. It is his hope that the tools for developing with FPGAs can make the same journey that graphics drivers have made over the last two decades or so.

Widawsky began by saying that attendees must think he knows some pretty important people to be able to present on a non-graphics topic; "I guess you'd be right", he said with a grin—to laughter. By way of an introduction, he said that he worked on the graphics stack at Intel from 2010 until 2018 when he moved into other work at the company. He is now using some of his free time "looking at FPGAs and trying to make a more open stack for FPGAs".

Architecture

He spent some time describing the building blocks that make up FPGAs, starting out with lookup tables (LUTs). He gave an example (as can be seen in the PDF slides or YouTube talk video) of a two-input XNOR gate, which is true (1) if both inputs are the same and false (0) otherwise. He then showed how a XNOR gate can be built with three multiplexers (MUXes) and some static RAM (SRAM) to hold the XNOR truth-table values. That same technique could be used to build any two-input gate of interest.

One problem with the LUT-based mechanism is that it is slower than using a "real" gate as the MUXes are driven by a clock, so each level of MUX (e.g. two levels for the two-input XNOR) adds some delay. The FPGA vendors have some optimizations for the mechanism, he said, so the delays are not as bad as they could be with a simple implementation; a four-input truth table, with four levels of MUX, will not have a four-clock delay, for example. Meanwhile, the LUTs can be combined to create "functions" for larger numbers of inputs.

FPGAs aren't just LUTs, however; there is a lot more to them, he said. There are things like flip-flops, adders, and carry chains that can be used to do arithmetic. All of those elements are collected up into logical array blocks (LABs); that is Intel terminology—other FPGA vendors have their own terms for these elements. Multiple LABs are conceptually laid out in a grid with mesh routing between them. One can start to see the complexity inherent in FPGAs by looking at the process of choosing how the functionality is arranged in the FPGA, he said; if functionality that goes together is placed far apart in signal-routing terms, more delay will be introduced.

There is, of course, more to it, Widawsky said. The LAB "network" is connected to general-purpose I/O (GPIO) or other pins on the chip to facilitate data transfer in both directions. Even though you can implement CPUs and other complex devices simply using the programmable logic available in the LABs, today's FPGAs have "hard" CPUs, typically Arm cores, that can be "wired" into the design. There may also be RAM, DSPs, and other hard devices available in the FPGA.

The title of his talk is "Everything Wrong with FPGAs", he said, so what's wrong with the FPGA architecture he described? To start with, the programmable nature of the devices makes them less efficient than a straight hardware implementation. That is inherent in the nature of FPGAs, though, so it is both a blessing and a curse: it allows doing things more easily and flexibly but at the cost of speed.

The problem that really bothers him, though, is that each FPGA vendor does things so completely differently from the others that it is nearly impossible to move to a new vendor. Beyond that, the standards for hardware description languages (HDLs) are loose, unlike the standards for graphics; there is not the same kind of conformance testing, so different tools will accept their own dialects of the languages. Designers will often get a library that implements a device (e.g. memory controller) they need from their vendor; once they get that integrated into their design, they are locked into using that vendor's FPGAs. It is too expensive to do the redesign necessary to switch vendors.

FPGAs differ wildly, but there is not even a standard way to describe the differences between them—a kind of "FPGA floor plan" description. He believes there is an opportunity to come up with intermediate representation that would allow FPGA users to move more freely between vendors. He likened it to the NIR intermediate representation that is used by Mesa. "You could think of it as the NIR for FPGAs", he said.

Tools

Moving on to the tools for FPGAs, Widawsky said that they are closer to a user-space driver for FPGAs, rather than tools like debuggers and the like. In the FPGA world, "vendor tools are the norm"; if you are doing anything in production, he said, you will be using proprietary vendor tools. These tools will provide the entire process needed to get a design into an FPGA: synthesis, timing, placing and routing, and emitting bitstreams. He would go into more detail about those steps a bit later in the talk.

He did want to note something that Xilinx is doing that is not open source, exactly, but is "friendly to open source". The company is allowing third-party tools to do the place and route step. That same mechanism can be used by open-source tools.

There are some third-party proprietary tools targeting the front end of the process (synthesis and timing), with Cadence, Mentor, and Synopsys being the big three. Those functions can be performed in a quasi-FPGA-independent way, which is where open-source tools can also fit in. He pointed to Yosys, which is "an incredible project" that has reverse engineered two Lattice FPGAs such that you can "run pretty complex designs on them with an entirely open toolchain".

As of the 5.2 kernel, there are now "kernel interfaces to let you program these FPGAs". In Widawsky's opinion, "this is where things are becoming really terrible". The interface being pushed by Intel is DFL (device feature list, though he cautioned that some of the acronyms have been changing), which allows an FPGA vendor to "list general properties of the FPGA if you wanted to use it as an accelerator". There are kernel drivers to read this data and to expose it to user space. But, as far as he knows, DFL is only used by Intel.

That means there is a "giant framework" in drivers/fpga that is only there to serve one vendor, he said. DFL gets wrapped with the open programmable acceleration engine (OPAE) interface; in graphics terms, DFL is the direct rendering manager (DRM), while OPAE is libdrm, providing a thin wrapper around DFL/DRM and providing easier access to the underlying interface. Beyond that are roughly 20 different FPGA access drivers—providing read, write, and erase functionality—for specific FPGA models. "The directory is a bit of a nightmare, I think."

He then moved onto the process of turning HDL code into the bitstream needed to program an FPGA. You start with HDL, likely Verilog or SystemVerilog, which gets turned into a netlist in the synthesis step. That, in turn, gets placed and routed, then turned into a bitstream that can be downloaded to the FPGA. Two of those operations are quite complex, synthesis is NP complete, while "place and route in the real world is NP hard". That means there are quite a few interesting technical challenges for any compiler writers in the room who might be getting tired of shader compilers, he said with a grin.

He went through a simplified example of synthesis and then described the place-and-route process. Handling the placement and routing of the logic is where the "secret sauce" for FPGA vendors is. It is a complex task and vendors have tuned their devices for specific workloads in order to have the right internal components on the fast path for their customers' designs.

Placing is the process of finding a place on the FPGA to put various parts of the logic in the design. There are constraints on where things can go based on the timing between them but also in minimizing the FPGA space it occupies. There is also a need to determine how to use the hard CPUs and other on-chip devices. Routing is the process of hooking everything up: making the connections between LABs, to and from the I/O pins, and so on.

There is no one optimal solution, so the tools typically allow the user to specify what they want to optimize. They can optimize for the smallest area, the highest performance, which typically means the shortest distances, or the lowest power usage. The information used to make those kinds of decisions (the FPGA floor plan) is not public, so much of the effort by free tools is spent trying to figure that out, he said.

The bitstream assembly is pretty straightforward. The results of the previous phases are encoded into the bitstream in unsurprising ways. There are some special cases, including encrypted bitstreams, which might be controversial with some, and compressed bitstreams, which are important because the bitstreams can be large and take a long time to transfer. There is some fanciness that can happen, he said, such as uploading part of the design that implements the decompression algorithm and sending the rest of the bitstream compressed. Overall, though, it is a "pretty linear translation" of the output of the earlier stages into the bitstream.

Returning to his "what's wrong with that?" theme, he noted that the tools are proprietary and are "drastically different" between hardware vendors. That means it takes a lot of effort to get up and running with a particular vendor's tools—which helps lock people in once that investment is made. The tools also require running some binary-blob license server, which is cumbersome and somewhat worrisome from a security perspective.

Customer support for the tools is provided by the hardware vendor, which can make it difficult to solve any problems you might have, he said. You can try to strace the tools, for example, but it doesn't help much in his experience. A look around the web will also easily find some horror stories about the support provided by the vendors. Customers cannot improve or even debug the tool flow; they are at the mercy of the vendor. Any kind of automation requires using Tcl-based interfaces that are not well documented and are not good interfaces in his opinion.

Beyond all of that, the high barrier to entry means that students typically only learn one vendor's tools—whichever vendor gave free hardware and software to their university. If they start to work for a company that uses a different FPGA vendor, they may not really be up to speed on that toolchain. That sometimes leads companies to buy most of the design in the form of IP blocks from the vendor, which is not inherently a bad thing, Widawsky said, but it does limit the scope of the designs if the designers are mostly relegated to connecting IP blocks together.

Use cases

He put up a graph of the growth of the FPGA market, but cautioned that it was really only the shape of the graph that was of interest since that seems to be roughly consistent while the overall numbers vary depending on the source. The growth was fairly flat from 1984, when Xilinx introduced its first FPGA until the late 1990s when it started to pick up some. There is a clear inflection point at 2010, which is roughly when the recent artificial intelligence push got going. While the increase in transistor density has helped vendors create highly capable FPGAs, he believes the main cause of the growth acceleration is that a problem was finally found where FPGAs were really needed. It was, to some extent, a solution looking for a problem prior to that.

The traditional (vertical) use case for an FPGA is as part of a device that gets shipped or for a proof of concept. If you are shipping something in a low-enough volume (he has heard three-million units, but is not sure that number is right), it makes sense to use an FPGA rather than creating an application-specific integrated circuit (ASIC). It also makes sense when the hardware is implementing a standard that is still changing, such as a media codec; a field upgrade can be done to the FPGA to pick up any changes. Another common use case is to validate the design of an ASIC before committing it to silicon; FPGAs are much faster than software simulators.

Another use case is the "FPGA as an accelerator" (or FPGAAAA). The idea is that "there is an FPGA somewhere and it helps you do things fast". There is runtime support available, such as Intel's OpenVINO, OpenStack Cyborg, or Kubernetes device plugins, to allow access to parts of the FPGA by an application in order to accelerate it. The runtimes facilitate getting the accelerator logic into the device and routing the calls needing acceleration to the FPGA; there is, he thinks, some overlap with GPU-based acceleration techniques. In addition, these days cloud providers have started offering raw access to FPGAs.

"FPGA as an AI accelerator" (or FPGAAAAA, he said to laughs) is less common these days. He pointed to Microsoft's Project Brainwave as an example. The company has created an FPGA-based neural processing unit (NPU) that can be accessed via the Azure cloud. One of the problems with this approach is that it can take on the order of a week to do synthesis for an NPU, which makes the overhead rather daunting. He believes there is a large potential for this use case in the future.

But there are a number of problems with all of that, he said. Even though the FPGA market is projected to be ten-billion dollars in 2022, that is still a pretty small market. Because of that, it tends to be expensive to get real hardware to experiment with, so open-source people are not able to crack the market without spending an enormous sum. The vertical FPGA developers have already learned a particular toolchain and are not interested in having different tools, even if they are better, so the interest in open-source solutions is fairly low.

Beyond that, OPAE is a nice open standard; trying to standardize the interface so that FPGAs can be used more is the right idea, he said. Unfortunately, the interface itself depends on proprietary stuff underneath. In graphics terms, there was no requirement that the user-space tools be open source before merging support for a particular FPGA, as Dave Airlie required for the Linux graphics stack. Since there was no requirement, all of the user-space tools are proprietary.

The tool makers within companies like Intel believe that the open-source community is not capable of building something as complex as an FPGA toolchain, Widawsky said. There is a lot of misinformation floating around within companies. For example, many insiders believe that the proprietary compilers sold by Intel and Arm are popular and make lots of money; when you try to explain that, overall, developers prefer to use GCC and LLVM, they think you are either lying or don't know what you are talking about, he said. The "FPGA way" has not changed much since it began in the mid-1980s; nobody has really been able to shake that community up and get it moving in an open-source direction.

Comparisons with GPUs

He then compared FPGAs with devices much better known at XDC: graphics processing units (GPUs). He began with the similarities between them. As with a GPU, an FPGA is a dense set of transistors providing more power than a CPU. There is a small set of vendors that produce FPGAs and there is a lot of proprietary software involved. In fact, all of the forums where questions are answered about FPGAs refer to the proprietary tools and how to use them. Aside from the folks working on Yosys, no one is trying to improve the process or tools, they are all just trying to use them to kick out a product.

The documentation is lacking for what is needed to create tools, such as the floor-plan information that has to be reverse engineered. In his case, he could get the documentation for Intel products, but not share it and he was not allowed to reverse engineer his own company's products; so even if there are sympathetic employees at a particular vendor, their hands may be tied.

Nearly all of the code uploaded to FPGAs is a binary blob of one sort or another; customers have no idea how most of it works. But there is a dedicated community that is reverse engineering the floor plans of various FPGAs. There are a few different projects, including Project Trellis that targets Lattice FPGAs. There is also a substantial academic interest in making better tools.

There are some dissimilarities as well. For one, there are no strict standard APIs for FPGAs, so there are no real conformance tests, which is unlike the graphics world. Yosys and nextpnr (for place and route) should be the Mesa of FPGAs; they would define an interface that would work across all of the different FPGA hardware. It is not that there are alternative open-source projects that vendors would rather rally around, Widawsky said; there just aren't enough people pushing for open source in this space at all. Something that would help is to have a piglit-like test suite for the FPGA space: something that can be run quickly to show that the tools can make a design that runs on a particular FPGA.

The kernel interfaces for FPGAs are in their infancy, while those for graphics devices are pretty well established at this point. Intel has taken the lead to try to consolidate the interfaces, but no one else is really following along. Separate from having open-source implementations of various accelerators (e.g. gzip), he would like to see the interface to that code be common across vendors. The code that runs on the devices is a complete black box and the emphasis has been on getting the black box to the device, not on what's in the black box—or at least having a common way to interface to the black boxes.

There is little pull for open tools, he said. He was told that no big customer was asking for open tools, so why would a vendor spend time making them? "That's a fair point", he said, but large companies will find it much easier to create their devices if they can have visibility into the tools. He has heard anecdotally that SpaceX had serious trouble debugging its Xiliinx-based FPGA designs. He suggested that large customers of the FPGA vendors start asking for open tools to help avoid those kinds of problems.

There are lessons that can be learned from graphics, he said. He touched on some of them earlier, like enforcing an open user space before accepting patches for a driver and not adding interfaces that only support one vendor's hardware. It is important for the FPGA vendors to recognize that the open-source community can solve difficult engineering problems—it already has in many areas.

Call to action

He ended with a "call to action" suggesting that those interested give Yosys a try—"do something cool"—and help improve it. He said that a handwavy estimate of the processing power available in an FPGA means that current FPGAs could theoretically build an Intel Sandy Bridge processor. "Go do that, find the bugs, make it work with the open tools." The Yosys community is easy to work with, though he didn't mean the talk to be a "rah rah Yosys talk". But he was able to get a few simple patches into the tool; it is something that anyone interested in open FPGA tools should look at.

Another area that needs work is to improve place and route, which in practice means to do more reverse engineering of FPGAs. There are multiple projects out there doing that. Also, if writing standards is of interest, the bitstreams and floor plans could use some work of that sort; maybe standards are simply not possible for those things, but no one has really looked into it. There is also the linux-fpga mailing list for those that want to keep an eye on what's going on in the kernel; it is a low-traffic list, so it is not hard to follow. With that, Widawsky was done with his talk; he took a few questions and comments from attendees, most of which were pointing to additional efforts in this space.

[I would like to thank the X.Org Foundation and LWN's travel sponsor, the Linux Foundation, for travel assistance to Montréal for XDC.]

Comments (16 posted)

BPF at Facebook (and beyond)

By Jonathan Corbet
October 10, 2019

Kernel Recipes

It is no secret that much of the work on the in-kernel BPF virtual machine and associated user-space support code is being done at Facebook. But less is known about how Facebook is actually using BPF. At Kernel Recipes 2019, BPF developer Alexei Starovoitov described a bit of that work, though even he admitted that he didn't know what most of the BPF programs running there were doing. He also summarized recent developments with BPF and some near-future work.

Kernels at Facebook

Facebook, he began, has an upstream-first philosophy, taken to an extreme; the company tries not to carry any out-of-tree patches at all. All work done at Facebook is meant to go upstream as soon as it practically can. The company also runs recent kernels, upgrading whenever possible. The company can move to a new kernel in a matter of days; this process could be faster, he said, except that it still takes some time to reboot thousands of servers. As of just before the talk, most of the Facebook fleet was running 4.16, with a few 4.11 machines hanging around and some at 5.2.

He pointed out the lack of long-term-support kernels in the above list. Facebook does not plan to stay with any given kernel for a long time, so the company doesn't care about long-term support. Instead, machines are simply upgraded to whichever kernel is available. Within a given version, though, there can be a fair amount of variation across the fleet; the kernel team evidently backports features into older kernels when the need arises. That can create challenges for applications and, especially, BPF-based applications.

The first rule of kernel development is "don't break user space"; anything that might cause a user-space program to fail becomes part of the kernel ABI. Performance regressions are included in this rule. Performance problems are easy to create, so Facebook needs a team to track them down. Often, it seems, BPF is fingered as the cause of these problems.

Starovoitov asked the audience to guess how many BPF programs were running on their laptops, then to run this command:

    ls -la /proc/*/fd | grep bpf-prog | wc -l

The answer on your editor's system is six, all running from systemd. He was surprised by the answer at Facebook: there are about 40 BPF programs running on each server, with another 100 that are demand loaded. There are many teams within the company writing and deploying these programs; the kernel team doesn't even know about all of these BPF programs. These programs are about evenly split between those attached to kprobes, those attached to tracepoints, and network scheduling-class helpers; about 10% fall into other categories.

He gave a few examples of performance issues that, at least on the surface, were caused by BPF:

Facebook runs a packet-capture daemon that makes use of a network scheduling-class BPF program; it occasionally spits out a packet for inspection. On new kernels, running that daemon regressed overall system performance by about 1%. It turns out that this daemon uses another BPF program, attached to a kprobe, for a different purpose. The function that probe attaches to didn't exist in the newer kernel, causing the daemon to conclude that BPF as a whole was broken; it then fell back to an older, slower method for packet capture. Kprobes are not a stable ABI, Starovoitov said, but when kernel developers change a function kprobe usage can still require somebody to investigate the resulting breakage.
The number-one performance-analysis tool at Facebook is a profiling daemon that attaches BPF programs to tracepoints and kprobes in the scheduler and beyond. On new kernels, it caused a 2% performance regression, manifesting as an increase in software-interrupt time. It turns out that, in the 5.2 kernel, setting a kprobe causes the text section to be remapped from 2MB huge pages to normal 4KB pages, with a resulting increase in TLB misses and decrease in performance.
There is a security monitoring daemon that sets BPF programs on three kprobes and one kretprobe. It runs at low priority, waking up every few seconds and consuming about 0.01% of the CPU. This daemon was causing huge latency spikes for the high-priority database application. Some tracing work showed that, on occasion, a memcpy() call in the database could stall for as much as 1/4 second while this daemon was reading its /proc/pid/environ file. Much more tracing showed that this daemon was acquiring the mmap_sem lock when reading that /proc file, then being scheduled out for long periods of time, blocking page faults in the main application. The root cause was a basic priority-inversion issue; raising the security daemon's priority prevents this problem.

The takeaway from all of these episodes — and especially the last one — is that the best tool for tracking down BPF-related performance regressions is BPF.

Current and future BPF improvements

Another kind of problem results from how BPF programs are built. A user-space application will contain one or more BPF programs to be loaded into the kernel. These programs are written in C and compiled to the BPF virtual machine instruction set; this compilation happens on the target system. To ensure that the compilation can be done consistently, a version of the LLVM compiler is embedded in the application itself. This makes the applications big, and the compilation process can perturb the main workload on the target system. The compilation can also take a long time, since it is done at a low priority; several minutes to compile a 100-line program is not unusual. The system headers needed to understand kernel data structures may be missing from the target system, creating compilation failures. It is a pain point, he said.

The solution to this problem is to be found in the "compile once, run everywhere" work that reached a milestone with the 5.4 kernel. It uses the BPF type format (BTF) data describing kernel data structures that was created for just this purpose. With BTF provided by the kernel, there is no longer any need to ship kernel headers around; instead, the bpftool utility just extracts the BTF data and creates a "monster header file" on the target system. An LLVM built-in function has been added to preserve the offsets into structures used by BPF programs; those offsets are then "relocated" at load time to match the version of the structure used in the target kernel.

A number of other interesting projects have made progress in 2019, he said. Support for bounded loops in the verifier was added to 5.3 after two years of work. BPF programs can now manage concurrency with spinlocks, with the verifier proving that these programs will not deadlock. Dead-code elimination has been added, and scalar precision tracking as well.

Starovoitov said that people often complain that the BPF verifier is painful to deal with. But, he said, it is far smarter than the LLVM compiler, and a number of advantages come from that, starting with the ability to prove that a program is safe to load into the kernel. The verifier is also able to perform far better dead-code elimination than LLVM can.

In the future, the verifier is set to get better by making more use of the available BTF data. Every program type, for example, must implement its own boilerplate functions to provide (and check) access to the context object passed to the programs themselves. This code bloats the kernel, he said, and tends to be prone to bugs. With BTF, those functions will no longer be necessary; the verifier can use the BTF data to check programs directly. That will enable the removal of 4,000 lines of code, he said.

He concluded by saying that BPF development is "100% driven by use cases"; the way to shape its future direction is to show the ways in which new features can be useful. Even better, of course, is to hack new extensions and to share them with the community.

[Your editor thanks the Linux Foundation, LWN's travel sponsor, for supporting his travel to this event.]

Comments (7 posted)

Finding race conditions with KCSAN

By Jonathan Corbet
October 14, 2019

Race conditions can be some of the trickiest bugs to find. The resulting problems can be subtle, and reproducing the problem in order to track it down can be difficult or impossible; often code inserted to narrow down a race condition will cause it to stop manifesting entirely. A tool that can find race conditions automatically would thus be a valuable thing for the kernel community to have. In late September, Marco Elver announced a tool called KCSAN (the Kernel Concurrency Sanitizer) that does exactly that — and which has already found a number of real problems.

A common cause of race conditions is unserialized access to shared data leading to inconsistent views of the state of the world. Such problems can be hard to identify with code inspection, since they involve the interaction of multiple sites within the code itself and generally require a rare type of bad timing. But an automated tool might be able to note when this kind of access happens and flag a problem even if the system manages not to misbehave as a result. KCSAN finds potential problems by monitoring access to memory locations and identifying patterns where:

multiple threads of execution access the location,
those accesses are unordered — not protected by a lock, for example, and,
at least one of those accesses is a write.

Needless to say, watching every memory location that the kernel accesses would be a massive task, so KCSAN takes a statistical approach in the hope of stumbling across problems eventually.

How it works

The first step is to compile the kernel with the -fsanitize=thread option, which is supported by both GCC and Clang. This will cause the compiled code to be instrumented to allow the monitoring of its memory accesses. Specifically, each memory access will be augmented by a function call; if the program reads a four-byte quantity at addr, for example, the generated code will first make a call to __tsan_read4(addr). The monitoring code provides these __tsan_readN() and __tsan_writeN() functions, which can then do something useful with the access pattern it sees.

In the case of KCSAN, these function calls are simply ignored 1,999 out of 2,000 times; to do otherwise would slow the kernel to a point of complete unusability. On the 2,000th time, though, KCSAN keeps an eye on the address for a period of time, looking for other accesses. While running in the context of the thread where the access was performed, KCSAN will set a "watchpoint", which is done by recording the address, the size of the data access, and whether the access was a write in a small table. This thread will then simply delay for (by default) 10µs.

The above picture is simplified somewhat; there are a couple of exceptions to keep in mind. The first of those is that, before deciding whether to ignore an access, KCSAN looks to see if there is already a watchpoint established for the address in question. If so, and if either the current access or the access that created the watchpoint is a write, then a race condition has been detected and a report will be sent to the system log.

In the absence of a watch point, the code will check whether the current access is being performed in an atomic context (using KCSAN's definition, which is a bit different than what the rest of the kernel uses) before deciding whether to ignore the access or not. An atomic access, thus, will not result in the creation of a watch point, but if one already exists then the code is accessing the data location in question in both atomic and non-atomic ways, which rarely leads to good things.

Meanwhile, the original thread is delaying after having set the watchpoint. At the end of the delay period, the watchpoint will be deleted and monitoring of that address stops. But before execution continues, the value at the accessed address will be checked; if it has changed since the watchpoint was set, a race condition is once again deemed to have occurred.

Naturally, the above story leaves out some details, but that is the core of the algorithm used. One would expect it to miss a lot of races, since it is only looking at a fraction of the kernel's memory accesses and only watches any given location for a short period of time. But, if run for long enough, KCSAN does indeed appear to be able to find race conditions that have escaped the developers of the code in question.

KCSAN in action

The syzbot system has started adding KCSAN to its fuzzing runs, resulting in some immediate bug reports. The first of those was in the kernel's "taskstats" subsystem; that resulted in a fix from Christian Brauner that has not yet made it into the mainline. After five revisions, though, chances are that it should be close to being ready.

Few parts of the kernel have been more closely scrutinized for concurrency issues than the read-copy-update (RCU) subsystem, which is itself a concurrency-management mechanism. On October 7, KCSAN found a data race in RCU; various patches have been circulated to fix it but, once again, nothing has found its way to the mainline yet.

As might be expected, KCSAN also produces warnings in situations that are not actual race conditions. Silencing those warnings requires annotating the code, often by marking the accesses in question with READ_ONCE() or WRITE_ONCE(). Some developers have already raised concerns that KCSAN reports will lead to a flurry of "fixes" from developers who may not fully understand the code they are working with. Even false positives can have their value, though; while addressing some reports in the TCP code, Eric Dumazet discovered and fixed a real race that was added with the TCP fast open mechanism in 2012.

KCSAN is a new tool that still has some rough edges to it. Some time will be required to smooth those down and for the development community as a whole to figure out how to integrate its reports into the development process. There is clear value, though, in a tool that can find this kind of obscure problem; it should lead to more reliable kernels in the future.

Comments (none posted)

Calibrating your fear of big bad optimizing compilers

October 11, 2019

(Many contributors)

This article was contributed by Jade Alglave, Will Deacon, Boqun Feng, David Howells, Daniel Lustig, Luc Maranget, Paul E. McKenney, Andrea Parri, Nicholas Piggin, Alan Stern, Akira Yokosawa, and Peter Zijlstra.

As noted earlier, when compiling Linux-kernel code that does a plain C-language load or store, as in "a=b", the C standard grants the compiler the right to assume that the affected variables are neither accessed nor modified by any other thread at the time of that load or store. The compiler is therefore permitted to carry out a surprisingly large number of optimizations, any number of which might ruin your concurrent code's day. Given that current compilers usually do not emit diagnostics warning of potential ruined days, it would be good to have other tools take on this task. One such tool is the Kernel Thread Sanitizer (KTSAN), but its great strength, the ability to analyze huge bodies of code such as the Linux kernel, is also its great weakness, namely the need to use approximate (though still quite good) analysis techniques.

Quick Quiz 1: But we shouldn't be afraid at all for things like on-stack or per-CPU variables, right?
Answer

What is needed is a tool that can do exact analyses of huge bodies of code, but unfortunately the universe is under no compunction to give us what we think we need. We have therefore upgraded the Linux Kernel Memory Model (LKMM) to do exact analyses of small bodies of code, and this upgrade was accepted into the Linux kernel during the v5.3 merge window. The challenge of doing exact analyses of large bodies of code thus remains open, but in the meantime we have another useful tool at our disposal.

The following sections describe how to use this upgrade to LKMM:

Goals and non-goals
A plain example
A less-plain example
Locking
Reference counting
Read-copy update (RCU)
Debug output
Access-marking policies
Limitations
Summary

This is followed by the inevitable answers to the quick quizzes.

Goals and non-goals

TL;DR: The goal of LKMM is to help people understand Linux-kernel concurrency.

A key point from the first article is that we simply do not have a full catalog of all compiler optimizations that currently exist, let alone of all possible compiler options that might one day exist. Furthermore, the kernel must, in many cases, live outside of the bounds of the C standard, and thus cannot simply make direct use of the C11 memory model. The details of the compiler, of the compiler options used in Linux-kernel builds, and all of the architecture-specific code therefore impinge on LKMM—and vice versa.

Because of all of these complications and uncertainties, LKMM cannot possibly be a cut-and-dried judge and juror for all Linux-kernel memory-ordering matters. It should instead be seen as an advisor. Now, in generic, non-performance-critical code, it might be wise to pay extremely close attention to LKMM's advice. On the other hand, developers writing fastpath code might need to take an LKMM warning as a sign of the need to be careful, rather than as an error message to be fixed at all costs.

With that in mind, let's look at some LKMM litmus-test examples involving plain C-language accesses.

A plain example

Consider the following message-passing litmus test using plain C-language accesses and read and write memory barriers:

Litmus Test #1

  1 C C-MP+p-wmb-p+p-rmb-p
  2
  3 {
  4 }
  5
  6 P0(int *x0, int *x1)
  7 {
  8         *x0 = 1;
  9         smp_wmb();
 10         *x1 = 1;
 11 }
 12
 13 P1(int *x0, int *x1)
 14 {
 15         int r1;
 16         int r2;
 17
 18         r1 = *x1;
 19         smp_rmb();
 20         r2 = *x0;
 21 }
 22
 23 exists (1:r1=1 /\ 1:r2=0)

As always, this litmus test can be run from within the kernel's tools/memory-model directory using the following command:

herd7 -conf linux-kernel.cfg /path/to/litmus/tests/C-MP+p-wmb-p+p-rmb-p.litmus

Here, the "/path/to/litmus/tests" is of course replaced by the path to the directory containing your litmus tests. (See the README file in that same directory for installation instructions and an earlier LWN article for more details on litmus tests.) The output of this command is shown below:

Outcome for Litmus Test #1 (linux-kernel model)

 1 Test C-MP+p-wmb-p+p-rmb-p Allowed
 2 States 4
 3 1:r1=0; 1:r2=0;
 4 1:r1=0; 1:r2=1;
 5 1:r1=1; 1:r2=0;
 6 1:r1=1; 1:r2=1;
 7 Ok
 8 Witnesses
 9 Positive: 1 Negative: 3
10 Flag data-race
11 Condition exists (1:r1=1 /\ 1:r2=0)
12 Observation C-MP+p-wmb-p+p-rmb-p Sometimes 1 3
13 Time C-MP+p-wmb-p+p-rmb-p 0.00
14 Hash=055863a755bfaf3667f1667e6d660349

Quick Quiz 2: But suppose only one of the accesses to a given variable is a plain C-language access, and that access is the only store. Should this be considered a data race?
Answer

Quick Quiz 3: Why the cop-out? Why not just do the work required to ensure that the list of states and the Observation line are all accurate?
Answer

Line 10 contains the key advice: "Flag data-race". This advice means that the Observation line's verdict is untrustworthy and that the list of states on lines 3-6 is unreliable. The problem is that this litmus test has at least one data race, meaning that there are multiple concurrent accesses to a given variable, at least one of which is a plain C-language access and at least one of which is a store.

The variable x1 is clearly subject to a data race because it can be accessed concurrently by a pair of plain accesses, at least one of which is a write. However, x1 only ever takes on the values zero and one, so the data race on that variable might be tolerable, at least assuming a healthy fear of the big bad optimizing compiler. But what if we want to check for other data races?

The solution is to tell LKMM that you are excluding x1 from data-race flagging. One way to do this is to add READ_ONCE() and WRITE_ONCE() to the litmus test (as opposed to your Linux-kernel code), preferably with a comment explaining the situation:

Litmus Test #2

  1 C C-MP+p-wmb-o+o-rmb-p
  2
  3 {
  4 }
  5
  6 P0(int *x0, int *x1)
  7 {
  8         *x0 = 1;
  9         smp_wmb();
 10         WRITE_ONCE(*x1, 1); // Tolerate data race
 11 }
 12
 13 P1(int *x0, int *x1)
 14 {
 15         int r1;
 16         int r2;
 17
 18         r1 = READ_ONCE(*x1); // Tolerate data race
 19         smp_rmb();
 20         r2 = *x0;
 21 }
 22
 23 exists (1:r1=1 /\ 1:r2=0)

LKMM still flags a data race:

Outcome for Litmus Test #2 (linux-kernel model)

 1 Test C-MP+p-wmb-o+o-rmb-p Allowed
 2 States 3
 3 1:r1=0; 1:r2=0;
 4 1:r1=0; 1:r2=1;
 5 1:r1=1; 1:r2=1;
 6 No
 7 Witnesses
 8 Positive: 0 Negative: 3
 9 Flag data-race
10 Condition exists (1:r1=1 /\ 1:r2=0)
11 Observation C-MP+p-wmb-o+o-rmb-p Never 0 3
12 Time C-MP+p-wmb-o+o-rmb-p 0.00
13 Hash=743f0171133035c53a5a29972b0ba0fd

The reason for this is that the plain C-language accesses to x0 can also execute concurrently, and one of these accesses is a write. We could fix this by also marking the x0 accesses with READ_ONCE() and WRITE_ONCE(), but an alternative is to avoid concurrency on x0 as follows:

Litmus Test #3

  1 C C-MP+p-wmb-o+o+ctrl-rmb-p
  2
  3 {
  4 }
  5
  6 P0(int *x0, int *x1)
  7 {
  8         *x0 = 1;
  9         smp_wmb();
 10         WRITE_ONCE(*x1, 1); // Tolerate data race
 11 }
 12
 13 P1(int *x0, int *x1)
 14 {
 15         int r1;
 16         int r2;
 17
 18         r1 = READ_ONCE(*x1); // Tolerate data race
 19         if (r1) {
 20                 smp_rmb();
 21                 r2 = *x0;
 22         }
 23 }
 24
 25 exists (1:r1=1 /\ 1:r2=0)

The if statement on line 19, when combined with the smp_wmb() on line 9, is intended to guarantee that lines 8 and 21 never execute concurrently. Running the model yields:

Outcome for Litmus Test #3 (linux-kernel model)

 1 Test C-MP+p-wmb-o+o+ctrl-rmb-p Allowed
 2 States 2
 3 1:r1=0; 1:r2=0;
 4 1:r1=1; 1:r2=1;
 5 No
 6 Witnesses
 7 Positive: 0 Negative: 2
 8 Condition exists (1:r1=1 /\ 1:r2=0)
 9 Observation C-MP+p-wmb-o+o+ctrl-rmb-p Never 0 2
10 Time C-MP+p-wmb-o+o+ctrl-rmb-p 0.00
11 Hash=01fe003cd2759d9284d40c081007c282

Quick Quiz 4: But the outcome "1:r1=0; 1:r2=1;" also disappeared. Why?
Answer

There is no longer a data race flagged, and the cyclic outcome no longer occurs. Therefore, if we are willing to live in sufficient fear of the compiler to keep accesses to x1 sane on the one hand and if we add a conditional check protecting x0 on the other, we can obtain the required outcome.

In this case, because of the smp_wmb() and the fact that there was only a single use of the value read, all that was needed to tell LKMM about a tolerated data race was to use WRITE_ONCE() and READ_ONCE() in the corresponding litmus test. Unfortunately, some situations require a bit more work, as will be hinted at in the next section.

A less-plain example

The compiler has great freedom to optimize plain accesses, and with this great freedom comes great complexity. To see this, consider the following litmus test:

Litmus Test #4

  1 C C-read-multiuse
  2
  3 {
  4 }
  5
  6 P0(int *a)
  7 {
  8         *a = 1;
  9 }
 10
 11 P1(int *a, int *b, int *c)
 12 {
 13         int r1;
 14
 15         r1 = *a;
 16         *b = r1;
 17         *c = r1;
 18 }
 19
 20 locations [1:r1; a; b; c]
 21 exists(b=1 /\ c=0)

As we should by now expect, this results in a data race:

Outcome for Litmus Test #4 (linux-kernel model)

 1 Test C-read-multiuse Allowed
 2 States 2
 3 1:r1=0; a=1; b=0; c=0;
 4 1:r1=1; a=1; b=1; c=1;
 5 No
 6 Witnesses
 7 Positive: 0 Negative: 2
 8 Flag data-race
 9 Condition exists (b=1 /\ c=0)
10 Observation C-read-multiuse Never 0 2
11 Time C-read-multiuse 0.00
12 Hash=0cab074d9a510f141aae9026ce447828

Creating a data-race-tolerant litmus test is straightforward:

Litmus Test #5

  1 C C-read-multiuse-drt1
  2
  3 {
  4 }
  5
  6 P0(int *a)
  7 {
  8         WRITE_ONCE(*a, 1); // Tolerate data race
  9 }
 10
 11 P1(int *a, int *b, int *c)
 12 {
 13         int r1;
 14
 15         r1 = READ_ONCE(*a); // Tolerate data race
 16         *b = r1;
 17         *c = r1;
 18 }
 19
 20 locations [1:r1; a; b; c]
 21 exists(b=1 /\ c=0)

And this both tolerates the data race and excludes the undesirable outcome:

Outcome for Litmus Test #5 (linux-kernel model)

 1 Test C-read-multiuse-drt1 Allowed
 2 States 2
 3 1:r1=0; a=1; b=0; c=0;
 4 1:r1=1; a=1; b=1; c=1;
 5 No
 6 Witnesses
 7 Positive: 0 Negative: 2
 8 Condition exists (b=1 /\ c=0)
 9 Observation C-read-multiuse-drt1 Never 0 2
10 Time C-read-multiuse-drt1 0.00
11 Hash=96b3ae01a3c486885df1aec4d978bad9

Job done, right?

Not quite, courtesy of the aforementioned complexity. Recall that compilers can invent loads. Therefore, a better translation to a data-race-tolerant litmus test would duplicate the load from shared variable a as follows:

Litmus Test #6

  1 C C-read-multiuse-drt2
  2
  3 {
  4 }
  5
  6 P0(int *a)
  7 {
  8         WRITE_ONCE(*a, 1); // Tolerate data race
  9 }
 10
 11 P1(int *a, int *b, int *c)
 12 {
 13         int r1;
 14         int r2;
 15
 16         r1 = READ_ONCE(*a); // Tolerate data race
 17         *b = r1;
 18         r2 = READ_ONCE(*a); // Tolerate data race
 19         *c = r2;
 20 }
 21
 22 locations [1:r1; a; b; c]
 23 exists(b=1 /\ c=0)

And this still tolerates the data race and excludes the undesirable outcome:

Outcome for Litmus Test #6 (linux-kernel model)

 1 Test C-read-multiuse-drt2 Allowed
 2 States 3
 3 1:r1=0; a=1; b=0; c=0;
 4 1:r1=0; a=1; b=0; c=1;
 5 1:r1=1; a=1; b=1; c=1;
 6 No
 7 Witnesses
 8 Positive: 0 Negative: 3
 9 Condition exists (b=1 /\ c=0)
10 Observation C-read-multiuse-drt2 Never 0 3
11 Time C-read-multiuse-drt2 0.01
12 Hash=17ff8b2e2c285776994d4488fcdcd3bb

So now the job is done, right?

Still not quite. Recall that compilers can reorder code. And there is nothing telling the compiler that the store to b needs to precede the store to c; additionally, the fact that the actual code uses a plain C-language load from shared variable a allows the compiler to assume (incorrectly, in this case) that shared variable a isn't changing. We therefore need to account for that with another data-race-tolerant litmus test that does the reordering, for example, as follows:

Litmus Test #7

  1 C C-read-multiuse-drt3
  2
  3 {
  4 }
  5
  6 P0(int *a)
  7 {
  8         WRITE_ONCE(*a, 1); // Tolerate data race
  9 }
 10
 11 P1(int *a, int *b, int *c)
 12 {
 13         int r1;
 14         int r2;
 15
 16         r2 = READ_ONCE(*a); // Tolerate data race
 17         *c = r2;
 18         r1 = READ_ONCE(*a); // Tolerate data race
 19         *b = r1;
 20 }
 21
 22 locations [1:r1; a; b; c]
 23 exists(b=1 /\ c=0)

This still tolerates the data race, but allows the undesirable outcome:

Outcome for Litmus Test #7 (linux-kernel model)

 1 Test C-read-multiuse-drt3 Allowed
 2 States 3
 3 1:r1=0; a=1; b=0; c=0;
 4 1:r1=1; a=1; b=1; c=0;
 5 1:r1=1; a=1; b=1; c=1;
 6 Ok
 7 Witnesses
 8 Positive: 1 Negative: 2
 9 Condition exists (b=1 /\ c=0)
10 Observation C-read-multiuse-drt3 Sometimes 1 2
11 Time C-read-multiuse-drt3 0.01
12 Hash=61f32f3a79e57808d348f31f5800ae1d

Quick Quiz 5: But wait! Given that READ_ONCE() provides no ordering, why would Litmus Test #5 avoid the undesirable outcome?
Answer

This example illustrates the need to carefully think through the possible compiler optimizations when using plain C-language loads and stores in situations involving data races. It also illustrates the possible need to use multiple litmus tests to fully analyze the possible outcomes.

So should use of plain C-language loads and stores for shared variables be completely abolished throughout the kernel?

Absolutely not.

For one thing, it is often the case that a given plain C-language load and store will be data-race free, as illustrated by the next few sections.

Locking

The prevalence and usefulness of locking, particularly on systems with modest numbers of CPUs, is one reason why C and C++ did not add concurrency features for the first few decades of their existence. We should therefore expect that fully locked concurrent code should do fine with plain C-language accesses. For example, consider this fully locked store-buffering litmus test:

Litmus Test #8

  1 C C-SB+l-p-p-u+l-p-p-u
  2
  3 {
  4 }
  5
  6 P0(int *x0, int *x1, spinlock_t *s)
  7 {
  8         int r1;
  9
 10         spin_lock(s);
 11         *x0 = 1;
 12         r1 = *x1;
 13         spin_unlock(s);
 14 }
 15
 16 P1(int *x0, int *x1, spinlock_t *s)
 17 {
 18         int r1;
 19
 20         spin_lock(s);
 21         *x1 = 1;
 22         r1 = *x0;
 23         spin_unlock(s);
 24 }
 25
 26 exists (0:r1=0 /\ 1:r1=0)

As expected, LKMM shows that this litmus test produces only the expected serialized outcomes, and does so without data races:

Outcome for Litmus Test #8 (linux-kernel model)

 1 Test C-SB+l-p-p-u+l-p-p-u Allowed
 2 States 2
 3 0:r1=0; 1:r1=1;
 4 0:r1=1; 1:r1=0;
 5 No
 6 Witnesses
 7 Positive: 0 Negative: 2
 8 Condition exists (0:r1=0 /\ 1:r1=0)
 9 Observation C-SB+l-p-p-u+l-p-p-u Never 0 2
10 Time C-SB+l-p-p-u+l-p-p-u 0.01
11 Hash=a1b190dd8375d869bc8826836e05f943

But locking is not the only data-race-free synchronization primitive.

Reference counting

Reference counting has, if anything, been in use longer than has locking. It should therefore be no surprise that it is possible to atomically manipulate reference counts in such a way as to permit plain C-language access to shared variables. One approach uses atomic_dec_and_test() so that the task that decrements the reference count to zero owns all the data. The following (fanciful) litmus test illustrates this design pattern:

Litmus Test #9

  1 C C-SB+p-rc-p-p+p-rc-p-p
  2
  3 {
  4         atomic_t rc=2;
  5 }
  6
  7 P0(int *x0, int *x1, atomic_t *rc)
  8 {
  9         int r0;
 10         int r1;
 11
 12         *x0 = 1;
 13         if (atomic_dec_and_test(rc)) {
 14                 r0 = *x0;
 15                 r1 = *x1;
 16         }
 17 }
 18
 19 P1(int *x0, int *x1, atomic_t *rc)
 20 {
 21         int r0;
 22         int r1;
 23
 24         *x1 = 1;
 25         if (atomic_dec_and_test(rc)) {
 26                 r0 = *x0;
 27                 r1 = *x1;
 28         }
 29 }
 30
 31 exists ~((0:r0=1 /\ 0:r1=1 /\ 1:r0=0 /\ 1:r1=0) \/
 32          (0:r0=0 /\ 0:r1=0 /\ 1:r0=1 /\ 1:r1=1))

Initially, each process owns its variable, that is, P0() owns x0 and P1() owns x1. This reference count rc is initialized to the value 2 on line 4, indicating that both processes still own their respective variables. Each process updates its variable (lines 12 and 24), then releases its reference on rc (lines 13 and 25). The "winning" process that decrements rc to zero reads out both values, so that its values of r0 and r1 are equal to the value 1. The "losing" process's local variables remain zero. The exists clause on line 31 verifies this relationship:

Outcome for Litmus Test #9 (linux-kernel model)

 1 Test C-SB+p-rc-p-p+p-rc-p-p Allowed
 2 States 2
 3 0:r0=0; 0:r1=0; 1:r0=1; 1:r1=1;
 4 0:r0=1; 0:r1=1; 1:r0=0; 1:r1=0;
 5 No
 6 Witnesses
 7 Positive: 0 Negative: 2
 8 Condition exists (not (0:r0=1 /\ 0:r1=1 /\ 1:r0=0 /\ 1:r1=0 \/ 0:r0=0 /\ 0:r1=0 /\ 1:r0=1 /\ 1:r1=1))
 9 Observation C-SB+p-rc-p-p+p-rc-p-p Never 0 2
10 Time C-SB+p-rc-p-p+p-rc-p-p 0.02
11 Hash=7692409758270a77b577b11ab7cca3e3

Quick Quiz 6: I have seen plain C-language incrementing and decrementing of reference counters. How can that possibly work?
Answer

As expected, there are no data races and the "winner" always safely reads out the data.

Read-copy update (RCU)

A common RCU use case inserts items into a linked data structure. If these items are initialized prior to insertion and if the non-pointer fields are never changed after insertion, then plain C-language accesses can be used throughout, as illustrated by this litmus test:

Litmus Test #10

  1 C C-MP+p-rap+rl-rd-p-rul
  2
  3 {
  4         int z=42;  (* Initial garbage value *)
  5         int y=2;
  6         int *x=&y; (* x is the list head; initially it points to y *)
  7 }
  8
  9 P0(int **x, int *y, int *z)
 10 {
 11         *z = 1;
 12         rcu_assign_pointer(*x, z); // Now x points to z.
 13 }
 14
 15 P1(int **x, int *y, int *z)
 16 {
 17         int *r1;
 18         int r2;
 19
 20         rcu_read_lock();
 21         r1 = rcu_dereference(*x); // Pick up list head.
 22         r2 = *r1; // Pick up value.
 23         rcu_read_unlock();
 24 }
 25
 26 locations [x; y; z]
 27 exists (1:r1=z /\ 1:r2=42) (* Better not be pre-initialization value!!! *)

Quick Quiz 7: But given that Litmus Test #10 has no synchronize_rcu(), what is the purpose of the rcu_read_lock() on line 17 and the rcu_read_unlock() on line 20?
Answer

(Note that herd requires (*a different comment syntax*) in the initializer portion of the file.) Lines 5 and 6 abuse herd7 to create a rudimentary linked data structure, with x initially referencing y. P0() initializes z on line 11 and links it into the list on line 12. P1() then picks up pointer x and dereferences it within the confines of an RCU read-side critical section in the normal manner.

This avoids data races and also prevents P0() from seeing pre-initialization garbage in z, which LKMM confirms:

Outcome for Litmus Test #10 (linux-kernel model)

 1 Test C-MP+p-rap+rl-rd-p-rul Allowed
 2 States 2
 3 1:r1=y; 1:r2=2; x=z; y=2; z=1;
 4 1:r1=z; 1:r2=1; x=z; y=2; z=1;
 5 No
 6 Witnesses
 7 Positive: 0 Negative: 2
 8 Condition exists (1:r1=z /\ 1:r2=42)
 9 Observation C-MP+p-rap+rl-rd-p-rul Never 0 2
10 Time C-MP+p-rap+rl-rd-p-rul 0.01
11 Hash=fbe83006932079946732b23c5af9033d

It is also necessary to remove data from linked data structures and free them, at least if we are to avoid memory leaks. This process is illustrated with a similar litmus test:

Litmus Test #11

  1 C C-MP+rap-sync-p+rl-rd-rd-rul
  2
  3 {
  4         int z=1;
  5         int y=2;
  6         int *x=&y; (* x is the list head; initially it points to y *)
  7 }
  8
  9 P0(int **x, int *y, int *z)
 10 {
 11         rcu_assign_pointer(*x, z); // Now x points to z.
 12         synchronize_rcu();
 13         *y = 0; // Emulate kfree(y).
 14 }
 15
 16 P1(int **x, int *y, int *z)
 17 {
 18         int *r1;
 19         int r2;
 20
 21         rcu_read_lock();
 22         r1 = rcu_dereference(*x); // Pick up list head.
 23         r2 = *r1; // Pick up value.
 24         rcu_read_unlock();
 25 }
 26
 27 locations [1:r1; x; y; z]
 28 exists (1:r2=0) (* Better not be freed!!! *)

This again avoids data races and also prevents readers from creating a use-after-free bug:

Outcome for Litmus Test #11 (linux-kernel model)

 1 Test C-MP+rap-sync-p+rl-rd-rd-rul Allowed
 2 States 2
 3 1:r1=y; 1:r2=2; x=z; y=0; z=1;
 4 1:r1=z; 1:r2=1; x=z; y=0; z=1;
 5 No
 6 Witnesses
 7 Positive: 0 Negative: 2
 8 Condition exists (1:r2=0)
 9 Observation C-MP+rap-sync-p+rl-rd-rd-rul Never 0 2
10 Time C-MP+rap-sync-p+rl-rd-rd-rul 0.00
11 Hash=abfbb3196e583a4f3945a3e3846442b0

It is also possible to use synchronize_rcu() as an extremely heavy memory barrier, gaining an effect similar to Litmus Test #3:

Litmus Test #12

  1 C C-MP+p-sync-o+rl-o-ctrl-p-rul
  2
  3 {
  4 }
  5
  6 P0(int *x0, int *x1)
  7 {
  8         *x0 = 1;
  9         synchronize_rcu();
 10         WRITE_ONCE(*x1, 1);
 11 }
 12
 13 P1(int *x0, int *x1)
 14 {
 15         int r1;
 16         int r2;
 17
 18         rcu_read_lock();
 19         r1 = READ_ONCE(*x1);
 20         if (r1) {
 21                 r2 = *x0;
 22         }
 23         rcu_read_unlock();
 24 }
 25
 26 exists (1:r1=1 /\ 1:r2=0)

LKMM confirms that this avoids both data races and the undesirable outcome flagged by the exists clause:

Outcome for Litmus Test #12 (linux-kernel model)

 1 Test C-MP+p-sync-o+rl-o-ctrl-p-rul Allowed
 2 States 2
 3 1:r1=0; 1:r2=0;
 4 1:r1=1; 1:r2=1;
 5 No
 6 Witnesses
 7 Positive: 0 Negative: 2
 8 Condition exists (1:r1=1 /\ 1:r2=0)
 9 Observation C-MP+p-sync-o+rl-o-ctrl-p-rul Never 0 2
10 Time C-MP+p-sync-o+rl-o-ctrl-p-rul 0.00
11 Hash=7672d2fc273055d3dcf0fc68801e113a

In short, these past few sections have shown a few of the great many situations in which plain C-language accesses to shared variables are perfectly safe and legitimate. However, Mr. Murphy guarantees that complications can always arise, and one such complication is presented in the following section.

Debug output

Suppose that you have a nice simple reference-counting scheme like the one shown in Litmus Test #9, but you need to add a printk() or an event trace to help debug some related problem. The straightforward approach might result in the following:

Litmus Test #13

  1 C C-SBr+p-rc-p-p+p-rc-p-p+p
  2
  3 {
  4         atomic_t rc=2;
  5 }
  6
  7 P0(int *x0, int *x1, atomic_t *rc)
  8 {
  9         int r0;
 10         int r1;
 11
 12         *x0 = 1;
 13         if (atomic_dec_and_test(rc)) {
 14                 r0 = *x0;
 15                 r1 = *x1;
 16         }
 17 }
 18
 19 P1(int *x0, int *x1, atomic_t *rc)
 20 {
 21         int r0;
 22         int r1;
 23
 24         *x1 = 1;
 25         if (atomic_dec_and_test(rc)) {
 26                 r0 = *x0;
 27                 r1 = *x1;
 28         }
 29 }
 30
 31 P2(int *x0, int *x1) // Emulate debug output.
 32 {
 33         int r0;
 34         int r1;
 35
 36         r0 = *x0;
 37         r1 = *x1;
 38 }
 39
 40 exists ~(0:r0=0:r1 /\ 1:r0=1:r1 /\ ~(0:r0=1:r0) /\ ~(0:r0=1:r1))

Unfortunately, this straightforward approach results in data races:

Outcome for Litmus Test #13 (linux-kernel model)

 1 Test C-SBr+p-rc-p-p+p-rc-p-p+p Allowed
 2 States 2
 3 0:r0=0; 0:r1=0; 1:r0=1; 1:r1=1;
 4 0:r0=1; 0:r1=1; 1:r0=0; 1:r1=0;
 5 No
 6 Witnesses
 7 Positive: 0 Negative: 8
 8 Flag data-race
 9 Condition exists (not (0:r0=0:r1 /\ 1:r0=1:r1 /\ not (0:r0=1:r0) /\ not (0:r0=1:r1)))
10 Observation C-SBr+p-rc-p-p+p-rc-p-p+p Never 0 8
11 Time C-SBr+p-rc-p-p+p-rc-p-p+p 0.05
12 Hash=f752f7f65493e036e734709bdb9233be

One response to this situation would be to quickly apply WRITE_ONCE() to all the now-conflicting plain C-language stores, namely those on lines 12 and 24. However, in larger code bases, the typos that are likely to ensue might introduce bugs, and frequent applying and removing of WRITE_ONCE() would certainly be a maintenance nightmare.

Another response would be to take a look at the table in the first article in this series and to note that the only common-case store transformation is store fusing, which should not be problem for otherwise correct concurrent code. This suggests that use of READ_ONCE() is much more important than use of WRITE_ONCE(), a position recently taken and later reiterated by none other than Linus Torvalds.

Some might attack this position as a head-in-the-sand denial of the C standard, but as Torvalds pointed out:

If such garbage happens, we need to fix the compiler, the same way we already do with

-fno-strict-aliasing
-fno-delete-null-pointer-checks
-fno-strict-overflow

because all those "optimizations" are just fundamentally unsafe and wrong.

In short, the kernel community will be taking on the responsibility of ensuring that new releases of the compiler won't break their code, and taking appropriate actions to deal with any problematic optimizations that might appear. For an example, consider this one, which has a fix for volatile on the way. And, to be fair, this community has had its share of successes, so that many past problems have been resolved.

But what about those of us who would prefer not to live in quite so much fear of the big bad optimizing compiler? Again quoting Torvalds:

In the end, the most common reason for a WRITE_ONCE() is mostly just "to visually pair up with the non-synchronized read that uses READ_ONCE()".

Quick Quiz 8: Are you sure that all situations involving data races need at least one READ_ONCE()?
Answer

Thus, if you want to use WRITE_ONCE() for a data-racy store, Torvalds is OK with that as long as there is a corresponding READ_ONCE() of that same variable.

Access-marking policies

One approach is to mark all accesses that are subject to data races, both loads and stores. To this end, here is a list of situations allowing plain loads and stores for some accesses to a given variable, while requiring markings (such as READ_ONCE() and WRITE_ONCE()) for other accesses to that same variable:

A shared variable is only modified by a given owning CPU or thread, but is read by other CPUs or threads. All stores must use WRITE_ONCE(). The owning CPU or thread may use plain loads. Everything else must use READ_ONCE() for loads.
A shared variable is only modified while holding a given lock, but is read by code not holding that lock. All stores must use WRITE_ONCE(). CPUs or threads holding the lock may use plain loads. Everything else must use READ_ONCE() for loads.
A shared variable is only modified while holding a given lock by a given owning CPU or thread, but is read by other CPUs or threads or by code not holding that lock. All stores must use WRITE_ONCE(). The owning CPU or thread may use plain loads, as may any CPU or thread holding the lock. Everything else must use READ_ONCE() for loads.
A shared variable is only accessed by a given CPU or thread and by a signal or interrupt handler running in that CPU's or thread's context. The handler can use plain loads and stores, as can any code that has prevented the handler from being invoked, that is, code that has blocked signals and/or interrupts. All other code must use READ_ONCE() and WRITE_ONCE().
A shared variable is only accessed by a given CPU or thread and by a signal or interrupt handler running in that CPU's or thread's context, and the handler always restores the values of any variables that it has written before return. The handler can use plain loads and stores, as can any code that has prevented the handler from being invoked, that is, code that has blocked signals and/or interrupts. All other code can use plain loads, but must use WRITE_ONCE() to prevent store tearing, store fusing, and invented stores.

Quick Quiz 9: What needs to happen if an interrupt or signal handler might itself be interrupted?
Answer

In most other cases, loads from and stores to a shared variable must use READ_ONCE() and WRITE_ONCE() or stronger, respectively. But it bears repeating that neither READ_ONCE() nor WRITE_ONCE() provide any ordering guarantees other than within the compiler.

Other developers might choose to omit the WRITE_ONCE() calls (perhaps give or take stores of constants), but otherwise follow the advice shown above. Still other developers might choose more aggressive paths.

LKMM now handles plain loads and stores, but at the end of the day, coding decisions are of course in the hands of the developers and maintainers.

Limitations

Many of the limitations called out in the 2017 LKMM LWN article still apply, but they do bear repeating:

Multiple access sizes are not supported.
Partially overlapping access sizes are not supported.
Nontrivial variables, including arrays and structures, are not supported. However, trivial linked lists are still supported.
Dynamic memory allocation is not supported.
Exceptions and interrupts are not supported.
I/O, including DMA, is not supported.
Self-modifying code is not supported, despite its being used in a surprisingly large number of places in the kernel, including the "alternative" mechanism, the function tracer, the eBPF JIT compiler, and the module loader.
This memory model does not constitute an official statement by the various CPU vendors on their respective architectures, despite some of their employees being on the author list. For example, any of these vendors might report bugs at any time against any version of this memory model. This memory model is therefore not a substitute for a carefully designed and vigorously executed validation regime. In addition, this memory model is under active development and might change at any time. Finally, despite much welcome progress, there are still quite a few CPU families lacking formal memory models.
It is quite possible that this memory model will disagree with CPU architectures or with real hardware. For example, the model might well choose to allow behavior that all CPUs forbid if forbidding that behavior would render the model excessively complex. On the other hand, any situation where the model forbids behavior that some CPU allows constitutes a bug, either in the model or in the CPU. However, there is work underway to automatically check that LKMM's verdict on specific litmus tests is compatible with herd-based formal models for the underlying hardware.
This tool is exponential in nature. Litmus tests that seem quite small compared to the entire Linux kernel might well take geologic time for the herd tool to analyze. That said, this tool can be extremely effective in exhaustively analyzing the code at the core of a synchronization primitive.
The herd tool can only detect problems for which you have coded an assertion. This weakness is common to all formal methods, and is one reason that we expect testing to continue to be important. In the immortal words of Donald Knuth (scroll to the bottom): "Beware of bugs in the above code; I have only proved it correct, not tried it."

However, there have been some areas of progress since 2017:

Limited arithmetic is now available.
A great many read-modify-write atomic operations are now supported, but the kernel community creates new ones at a surprisingly rapid rate. Therefore, it is unlikely that LKMM will be able to claim full coverage of all the kernel's read-modify-write atomic operations, at least not for very long.
SRCU is now supported. However, the modeling of RCU and SRCU still excludes asynchronous grace-period primitives such as call_rcu() and srcu_barrier().
Locking is now supported, but only exclusive locking. Reader-writer locking is still on the to-do list, as is the recently proposed double-lock that can have multiple readers or multiple writers, but not both readers and writers at the same time.
Plain C-language accesses are now supported, and compiler optimizations are modeled by (conservatively) flagging data races.

A number of LKMM's limitations appear to be inherent, but the good news is that significant progress has been made. However, the conservative treatment of data races means that LKMM should not be put forward as the final arbiter of what constitutes correct use of plain C-language accesses. As noted in the first article, in some cases developers and maintainers will continue to need to live with a healthy fear of the big bad optimizing compiler.

Summary

This article has demonstrated LKMM's new ability to handle plain C-language accesses using a pair of extended examples, and then shown how locking, reference counting, and RCU can be used in conjunction with plain accesses in a data-race-free manner. It then looked at complications due to debug output, presented a couple of possible access-marking policies, and finally presented an updated view of LKMM's limitations and progress.

This article showed that, although LKMM takes an expansive view of the concept of data races, it is easy to force it to model less expansive views. One approach is to mark selected accesses with READ_ONCE() or WRITE_ONCE() in the litmus test. For cases where the compiler has more freedom, it may be necessary to supply a number of litmus tests, each representing a different set of possible compiler optimizations.

LKMM is thus useful for modeling plain C-language accesses across a wide range of access-marking policies, and should thus be useful to a wide range of developers and maintainers.

Acknowledgments

As before, we owe thanks to a surprisingly large number of compiler writers and members of the C and C++ standards committees who introduced us to some of the things a big bad optimizing compiler can do, and to SeongJae Park for his help in rendering an earlier draft of the access-marking polices section human-readable. We are also grateful to Kara Todd for her support of this effort.

Answers to quick quizzes

Quick Quiz 1: But we shouldn't be afraid at all for things like on-stack or per-CPU variables, right?

Answer: Although on-stack and per-CPU variables are often guaranteed to be untouched by other CPUs and tasks, the kernel really does allow them to be concurrently accessed in many cases. You do have to go out of your way to make this happen, say by explicitly passing the address of such a variable to another thread, but it's certainly not impossible.

For example, the _wait_rcu_gp() function uses an on-stack __rs_array[] array of rcu_synchronize structures which, in turn, contain rcu_head and completion structures. The address of the rcu_head structure is passed to call_rcu(), which results in concurrent accesses to this structure, and eventually also to the completion structure.

Similar access patterns may be found for per-CPU variables.

Back to Quick Quiz 1.

Quick Quiz 2: But suppose only one of the accesses to a given variable is a plain C-language access, and that access is the only store. Should this be considered a data race?

Answer: Yes, this would definitely be a data race.

The plain C-language access might be subject to store tearing, code reordering, invented stores, and store-to-load transformations. Any of these transformations could ruin your concurrent algorithm's day.

However, as will be discussed in the Debug output section, there is some reason to believe that compilers are less likely to adversely optimize stores than loads. It all depends on how much fear you need in your life.

Back to Quick Quiz 2.

Quick Quiz 3: Why the cop-out? Why not just do the work required to ensure that the list of states and the Observation line are all accurate?

Answer: Because providing perfect accuracy would require perfect foresight, in particular, knowing exactly what optimizations all compilers, past, present, and future, might apply to this code. Worse yet, this list of optimizations will vary from compiler to compiler, and even with different command-line options for the same compiler. Therefore, LKMM's only real alternative is to throw up its hands and warn the user of the data race.

Back to Quick Quiz 3.

Quick Quiz 4: But the outcome "1:r1=0; 1:r2=1;" also disappeared. Why?

Answer: If r1 is zero, the load into r2 never happens, which means that r2 retains its initial value of zero. The outcome "1:r1=0; 1:r2=1;” therefore cannot happen.

Had we instead marked accesses to x0 in Litmus Test #2 with READ_ONCE() and WRITE_ONCE(), all three outcomes would still be possible. (But don't take our word for it, try it out!

Back to Quick Quiz 4.

Quick Quiz 5: But wait! Given that READ_ONCE() provides no ordering, why would Litmus Test #5 avoid the undesirable outcome?

Answer: First, please note that READ_ONCE() really does provide some ordering, namely, it prevents the compiler from reordering the READ_ONCE() with any other marked access. However, this restriction in no way applies to the CPU.

But CPU-level ordering is not required because, as the name suggests, READ_ONCE(a) is guaranteed to read a only once, and thus store the same value to both b and c. Because the undesirable outcome shown in the exists clause requires different final values for b and c, READ_ONCE() suffices despite its lack of ordering properties.

Back to Quick Quiz 5.

Quick Quiz 6: I have seen plain C-language incrementing and decrementing of reference counters. How can that possibly work?

Answer: It can work if the reference counter in question is protected by a lock. But yes, if there is no lock, then incrementing and decrementing of reference counters must be done via the handy and convenient atomic operations that the kernel provides for this purpose.

Back to Quick Quiz 6.

Quick Quiz 7: But given that Litmus Test #10 has no synchronize_rcu(), what is the purpose of the rcu_read_lock() on line 17 and the rcu_read_unlock() on line 20?

Answer: In theory, they could be omitted because there is in fact no synchronize_rcu() for them to interact with. They nevertheless serve as valuable documentation. In addition, their presence could save considerable time and frustration should someone later add an updater to this litmus test.

Back to Quick Quiz 7.

Quick Quiz 8: Are you sure that all situations involving data races need at least one READ_ONCE()?

Answer: First, developers who are concerned about store tearing, fused stores, invented stores, or store-to-load transformations will consider a situation involving concurrent stores to the same variable to be a data race if at least one of them is a plain C-language store, regardless of READ_ONCE().

But suppose that the variable is both loaded from and stored to, and that there are concurrent stores which might reasonably be marked with WRITE_ONCE() by sufficiently fearful developers. Are we sure that at least one of the loads from that variable will need to use READ_ONCE()?

We should be reasonably sure, but of course not absolutely sure. For example, in theory, all of the stores to a given variable could be executed while read-holding a given reader-writer lock and all loads could be executed while write-holding that same lock. In this case, only the stores are involved in data races. In practice, what would be the use case for such a thing?

Back to Quick Quiz 8.

Quick Quiz 9: What needs to happen if an interrupt or signal handler might itself be interrupted?

Answer: Then that interrupt handler must follow the same rules that are followed by other interrupted code. Only those handlers that cannot be themselves interrupted or that access no variables shared with an interrupting handler may safely use plain accesses, and even then only if those variables cannot be concurrently accessed by some other CPU or thread.

Back to Quick Quiz 9.

Comments (11 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>