|
|
Subscribe / Log in / New account

LWN.net Weekly Edition for September 10, 2020

Welcome to the LWN.net Weekly Edition for September 10, 2020

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (3 posted)

Notes from an online free-software conference

By Jonathan Corbet
September 4, 2020

LPC
The 2020 Linux Plumbers Conference (LPC) was meant to be held in Halifax, Nova Scotia, Canada at the end of August. As it happens, your editor was on the organizing committee for that event and thus got a close view of what happens when one's hopes for discussing memory-management changes on the Canadian eastern seaboard become one of the many casualties of an ongoing pandemic. Transforming LPC into a successful online experience was a lot of work, but the results more than justified the effort. Read on for some notes and thoughts from the experience of making LPC happen in 2020.

The first thing the organizers of a conference that is threatened in this way must do is to face the problem early and start thinking about how to respond. Your editor has talked with organizers from a few other events who didn't (or couldn't) do that; the result is that they were not fully prepared and the attendee experience suffered. In the case of LPC, planning started in March, fully six months ahead of the event — well before the decision to cancel the in-person event had been made. We needed every bit of that time.

Which platform?

An online event requires an online platform to host it. The Linux Foundation, which supports LPC in a number of ways, offered a handful of possibilities, all of which were proprietary and expensive. One cannot blame the Linux Foundation for this; the events group there was under great pressure with numerous large events going up in flames. In such a situation, one has to grasp at whatever straws present themselves. We, though, had a bit more time and a strong desire to avoid forcing our attendees onto a proprietary platform, even if the alternative required us to build and support a platform ourselves.

Research done in those early days concluded that there were two well-established, free-software systems to choose from: Jitsi and BigBlueButton. Either could have been made to work for this purpose. In the end, we chose BigBlueButton for a number of reasons, including better-integrated presentation tools, a more flexible moderation system, and a more capable front-end system (though, as will be seen, we didn't use that part).

BigBlueButton worked out well for LPC, but it must be said that this system is not perfect. It's a mixture of highly complex components from different projects glued together under a common interface; its configuration spans literally hundreds of XML files (and some in other formats). It only runs on the ancient Ubuntu 16.04 distribution. Many features are hard to discover, and some are outright footguns: for moderators, the options to exit a meeting (leaving it running) and to end the meeting (thus kicking everybody else out, disposing of the chat session, and more) are adjacent to each other on the menu and look almost identical. Most worryingly, BigBlueButton has a number of built-in scalability limitations.

The FAQ says that no BigBlueButton session should have more than 100 users — a limitation that is certain to get the attention of a conference that normally draws around 600 people. A lot of work was done to try to find out what the real limitations of the platform were; these included automated testing and running a couple of "town hall" events ahead of the conference. In the end, we concluded that BigBlueButton would do the job if we took care not to stress it too hard.

"Taking care" meant a number of things, starting with the elimination of plenary sessions. Such sessions have never been a big part of the event; LPC is focused on smaller groups where serious discussions happen and real work gets done, so this decision did not require major changes. An opening welcome message was recorded by 2020 LPC chair Laura Abbott and made available on YouTube. Not putting 500 or more people into a single "room" side-stepped many of the scalability concerns.

Beyond that, attendees were asked to keep their webcams turned off except when they were actively participating in the discussion; one of the quickest ways to find the scalability limits on both the server and client sides is to turn a lot of cameras on. Use of the "listen-only" mode was encouraged to reduce server load; this was probably unnecessary and also runs counter to the "everybody participates" goal of the event. Live streams were set up through YouTube as a way of further reducing load and providing a backup channel if everything went up in flames. The YouTube streams turned out to be useful for a number of people who were unable to register for the conference or just found it easier to watch that way.

The smoothness with which the actual conference ran made it clear that we need not have worried quite so much. LPC sessions routinely held well over 100 people, with the kernel development in Rust session drawing over 170 participants. The servers were never seriously stressed. Things only got bumpy during the "show off your beer" session held after the official close of the event, when 20 or more attendees had their cameras on at the same time; BigBlueButton just does not work well under that sort of load.

Creating the conference experience

BigBlueButton nicely handles the task of hosting videoconferences, but does not solve the whole problem of creating an online conference event. It is meant to run behind a front-end system that provides a landing page for incoming guests, controls access to rooms, and more. It ships with a system called Greenlight that handles these tasks, but Greenlight is not really meant for hosting conferences. It cannot, for example, present attendees with a set of [LPC front end] available rooms and tell them what is happening therein. Thus, we ended up writing our own front-end system to take care of these tasks; it fetched the schedule from the main LPC web site and produced a version that allowed attendees easy access to the appropriate rooms. The front end also provided some management screens allowing the organizers to get a picture of what was going on. For the curious, the resulting code has been released at git://git.lwn.net/lpcfe.git.

Anybody who has been to both an in-person conference and a videoconference knows that the two are not equivalent. The organizing committee worried that we have all had more than our share of videoconferences in the past few months, and that enthusiasm for a conference that looked like yet another one would be limited. We knew that we would never be able to replicate the experience of an in-person event, but we wanted to get as close as we could.

Part of that was working on a set of practices to make the sessions as productive and interactive as possible. The "cameras off except when discussing" protocol worked well in that regard. It allows attendees to focus on the people who are actually talking, and popping up a video window is a clear signal that somebody has something to add. The chat and polling systems built into BigBlueButton also helped to increase interactivity and were quite heavily used.

In-person events often create conversations that continue in the hallway, a separate breakout room, or in a pub afterward. We were unable to provide the pub experience, but could try to provide some of the rest. Several "hack rooms" were set aside for any group that wanted to use them for side discussions; those, too, were heavily used. We also encouraged the scheduling of birds-of-a-feather sessions at the last minute for topics needing further discussion; those sessions saw a lot of real work done during the conference.

In an attempt to create a "hallway" for other discussions, a separate chat forum was set up using Rocket.Chat. This, too, proved to be popular, and it also served as a support channel for attendees during the event.

An online event must also address the problem that attendees are spread across time zones around the planet. LPC ran some polls to try to find an optimal time window, but ended up running four-hour days starting at 14:00 UTC — a slot that works reasonably well for the Americas and Europe, but which is rather less comfortable for attendees in much of Asia and Oceania. The Android microconference ran an extended BOF session outside of the normal schedule as a way of accommodating some of those folks; the conference as a whole probably could have benefited from more sessions at other times.

The decision to go with four-hour days was the only rational choice, though, even though it meant turning LPC into a five-day event. Videoconferences are tiring, and few people have the patience to sit through a full day even if the time window is favorable.

All told, nobody would ever confuse an online LPC for an in-person gathering, but we were able to preserve many of the aspects that has made LPC a productive gathering over the years. Attendees seemed happy about their experience. A number of them asked for there to be some sort of video presence mode available in the future, even after in-person gatherings become possible again.

Other points of interest

Some conferences have responded to the pandemic by simply canceling for this year; these include the Linux Storage, Filesystem, and Memory-Management Summit, Kernel Recipes, and the GNU Tools Cauldron. The Cauldron, though, subsequently accepted an invitation from LPC to run a track there. The result was an intermixing of GNU toolchain developers and their users on a scale that has rarely been seen; one can only hope that it will be possible to run other events together in the future.

One need not attend many in-person events to have the dubious pleasure of watching a presenter desperately trying to get their laptop working with the projector while the session's scheduled time ticks away. Putting all of those presenters onto a new and unfamiliar platform threatened to raise that experience to a whole new level. In an attempt to avoid that, committee members put together a set of documents on how to use BigBlueButton in both the presenter and moderator roles. Then, a set of training sessions was run to drive home the basics of operating the platform and running sessions. The fact that the sessions themselves ran quite smoothly is a testament to the effectiveness of this documentation and training.

We have all had painful videoconference experiences by now where some attendees simply cannot be seen or heard. In an attempt to make LPC a little better, the committee chose, for the traditional speaker gift, a kit consisting of a headset, camera, and lights, along with a few other goodies. The kits were appreciated and heavily used with, one hopes, a corresponding improvement in audio and video quality.

The LPC committee struggles every year with the question of how many tickets should be made available. LPC is a working conference; letting it get too big makes it harder for people to find each other and runs counter to the overall goals of the event. The question takes a bit of a different form for an online event, though; the decision was made to allow more than the usual number of attendees for a number of reasons. In the end, nearly 950 people registered, making this by far the largest LPC event ever — and we still ended up turning people away.

Some people were surprised by this, clearly thinking that an online event could allow entry to an unlimited number of people. The registration cap existed for two reasons: concerns about having too many people in the "room" and the above-mentioned scalability worries. In the end, we could have allowed more people in. More people in the "room" do not inhibit the discussion, and the problem of pressing through a crowded hallway to reach the coffee or find a collaborator are not present. The servers handled the load just fine. But we had no way to know how things would work out ahead of time.

Of the nearly 950 people who registered, just over 800 attended at least one session during the event. The growing body of wisdom about online events suggests that no-show rates of up to 50% can be expected, so this is a relatively high turnout. Even for an online event, where travel is not necessary and the registration fee is minimal, people who register for LPC actually want to attend.

For the curious, the LPC online infrastructure ended up consisting of 17 virtual machines. Six of those were BigBlueButton servers; they had 32 dedicated CPUs each, which turned out to be ridiculously oversized. Half as many CPUs would still have been more than the system could use. For each BigBlueButton server there was a machine dedicated to creating the live YouTube stream for the principal room in that server. One machine held the Rocket.Chat server and one ran the front end; there was one more running an LDAP directory server providing login information to both. There was a coturn server for NAT traversal, and a monitoring host running Zabbix.

All told, the experience of running LPC as a virtual event was exhausting, but also rewarding. Despite our worries, we were able to pull off a successful event using 100% free software. Hopefully, by the time LPC 2021 comes around, it will be possible to meet in person again; should that not be the case, we will be in a much better position to run things virtually. One way or another, we'll get our community together to work out how to make free software even better.

Finally, your editor wrote this article and is fully responsible for any mistakes in it, but the work described above — along with the many other tasks required to run an event like LPC (developing the program, managing registrations and speaker passes, working with sponsors, etc.) — was done by the organizing committee as a whole. That committee was Laura Abbott, Elena Zannoni, Kate Stewart, James Bottomley, Christian Brauner, Jonathan Corbet, Guy Lunardi, Paul McKenney, Ted Ts'o, Steve Rostedt, and David Woodhouse. Thanks are due to the Linux Foundation for its support as well. The biggest thanks, of course, are owed to the attendees of LPC, without whom there would be no interesting conversations at all.

Comments (30 posted)

Preparing for the realtime future

By Jake Edge
September 9, 2020

LPC

Unlike many of the previous gatherings of the Linux realtime developers, their microconference at the virtual 2020 Linux Plumbers Conference had a different feel about it. Instead of being about when and how to get the feature into the mainline, the microconference had two sessions that looked at what happens after the realtime patches are upstream. That has not quite happened yet, but is likely for the 5.10 kernel, so the developers were looking to the future of the stable realtime trees and, relatedly, plans for continuous-integration (CI) testing for realtime kernels.

Stable trees

Since the realtime patch set will be fully upstream "any release now", a plan needs to be made for what will happen with the realtime stable (RT-stable) kernels, Mark Brown said to start his session. Currently, there are RT-stable kernels maintained for each of the long-term support (LTS) stable kernel versions; the realtime patch set is backported to those kernels. But once the patches are in the mainline, there will be no longer be a realtime tree to backport.

[Mark Brown]

He wondered if people should simply be told to use the mainline stable kernels if they want the realtime feature. If so, any realtime performance regressions that occur in the stable trees will need to be addressed and those fixes will need to be accepted by the stable maintainers. Realtime developers will need to help with any conflicts that arise in backporting fixes to the stable kernels as well.

Testing is another area that will need to be handled; in particular, realtime performance needs to be tested as part of the stable release process. Right now, Greg Kroah-Hartman largely outsources testing of specific use cases and workloads on stable kernels to those who are interested in ensuring those things continue to function well. Testing of realtime performance will need to be part of that.

Steven Rostedt was volunteered for the testing job by Clark Williams; Rostedt did not exactly disagree, noting that he had done that kind of thing in the past. Automating the realtime testing is something that needs to be done, he said. Ideally, each new stable kernel would be downloaded automatically, built, and run through a series of realtime-specific tests. Brown wryly noted that the next session in the microconference was on CI testing. He also said that it would make more sense to test the stable candidates, rather than the released kernels, so that any problems could be found before they get into the hands of users.

At that point Kroah-Hartman popped up to say that the realtime kernel is not unique in any way; "you're special just like anybody else". He will take regression fixes into the tree as needed and can provide various ways to trigger the building and testing of the kernels for realtime. Rostedt agreed that realtime is not special in any way from the perspective of the stable maintainers; but the realtime developers need to work out how to automate their testing.

Brown said that currently it is up to the RT-stable maintainers to apply the patches to a stable tree and manually test the resulting kernel. Kroah-Hartman suggested adding the realtime testing to the KernelCI infrastructure, so that it will be automatically built and tested whenever a stable candidate is released. Currently, the realtime patches are not merged into the stable tree right away, Rostedt said, because the stable changes often conflict with the realtime patches, but that should not be a problem once it is all upstream.

Getting into KernelCI is "very easy", Kroah-Hartman said, but Brown noted that the kinds of testing that need to be done for realtime is different than for other parts of the kernel. The realtime tests have a performance criteria rather than a functional criteria, Williams said. But Kroah-Hartman said that KernelCI has both functional and performance testing now, so there should be no real barrier to adding the realtime tests. Brown agreed, but said that someone needs to get the tests into a form that fits into the infrastructure.

As an example, Rostedt said that he runs a test that builds the kernel over and over again on multiple cores, while also running hackbench multiple times. All of that runs over a weekend, while he runs cyclictest with realtime tasks to record their latencies; he does not expect to find any latencies greater that 50µs. That kind of test would simply need to be packaged up and automated so that it can be run by bots of various sorts.

Another question is whether realtime should have its own separate staging tree to try out new features, such as a new futex() interface, Rostedt said. Would it make sense to turn the current RT-stable tree into a "testing playground" for new features, he asked. If those features were deemed useful for the mainline, they could be backported to the stable kernels as well. But Williams wondered if it was time to "come back into the fold and not stay out in the cold"; he sees the value in an "RT-next" for development purposes, but does not think it would work well to support these features in earlier kernel series. While it did not come up in the discussion, those kinds of changes might also run afoul of the stable kernel rules about only fixing actual bugs.

Rostedt more or less agreed with Williams but noted that there is a kind of "catch-22" for API design, in that you cannot get a good API without users testing it, but that it is hard to get users to test without having a good API. Williams agreed that there is a problem there, but did not think backporting from RT-next would really help solve it—it is likely to just bring headaches for the realtime developers. Testers could build and use RT-next itself, he said.

The main thing that needs to happen after realtime is in the mainline is to make sure there is a team paying attention to it going forward, Rostedt said. That team would ensure that realtime does not get broken in the stable kernels. Williams asked if there would be designated handlers for realtime bugs, but Rostedt thought that, once again, there is nothing special about realtime once it gets upstream. People will report bugs in the usual fashion, and the stable maintainers will direct the bugs to the realtime developers as needed.

Now is a good time to get the automated testing in place, Sasha Levin said; it is more difficult to do that after the feature is in the mainline. Most of the RT-stable patches will apply automatically on the stable candidates at this point, Williams said, so those can be used to start working up the automated testing strategy. A plan soon formed to use Daniel Wagner's scripts for the 4.4-rt tree as a starting point to try to automatically merge the stable release candidates and the realtime patches; if that succeeds, then testing could be kicked off to see if there are any realtime-specific problems in the resulting kernel. Once realtime is in the mainline, the merging step would simply be dropped.

Continuous integration

As the first session wound down, it segued nicely into a look at CI for realtime in the mainline led by Bastian Germann. There is some automated testing in place for realtime, he noted, though it was apparently not well known: the CI-RT system. It is a Jenkins-based CI system that is tailored to the needs of testing the realtime kernel. There is a one known lab running it at Linutronix (Germann's employer) on hardware donated by members of the Linux Foundation Real-Time Linux project.

Realtime developers can configure tests in CI-RT via a Git repository. The results of the tests are reported on the CI-RT site and also by email to the developer who is running them. The kernels are built on a build server, then booted on the target hardware, which serves as the first level of test. After that, the system runs tests somewhat similar to what Rostedt had described earlier. It uses cyclictest on both idle and stressed systems; the stress is created by hackbench coupled with other processes, such as a recursive grep that will generate a lot of interrupts, he said. The cyclictest results are then recorded for the systems.

Once realtime gets into the mainline, the CI-RT system can be used as is, he said, just by reconfiguring the Git source being used. Beyond the mainline itself, there are some other trees that should get tested, including some that came out in the previous session, Germann said. The current release candidate for the mainline and linux-next should be tested; the stable kernels should be tested as well, including their release candidates as was discussed. The test frequency and duration will need to be established for each tree; for example, he suggested that linux-next could be tested for eight hours every night.

No other CI systems currently run realtime tests, he said, though Brown wants to get them working in KernelCI. Germann said that more labs should be testing the realtime kernel once it gets merged. That will cover more hardware as well as raise the awareness of realtime among kernel developers. In order for that to happen, the realtime project needs to support other CI systems; KernelCI support is in the works, but he asked if there are other test or CI systems that should have support for realtime tests.

After something of a digression into how to handle signing Git tags in an automated fashion, which was deemed undesirable, Nikolai Kondrashov suggested that CI-RT send its reports to KernelCI. He and others are working on collecting and unifying test results in a common database.

Germann asked about the kinds of data that could be collected; ideally, CI-RT would want to present more than just a "pass" or "fail" and would include the latency measurements that were used to make that determination. Currently, the schema only provides a way to report the status of the test, Kondrashov said, but there is a way to attach additional data. The project is trying to work with the developers and operators of the various testing systems to determine what additional information should be added to the JSON schema. Veronika Kabatova mentioned that the Red Hat Continuous Kernel Integration (CKI) project would be willing to start running realtime tests, which would come with integration into the KernelCI unified reporting for free.

Mel Gorman said that SUSE also runs a Jenkins-based CI system that uses some of the realtime tests as part of its performance testing. He had some suggested configurations for his MMTests that could be used to help with realtime testing. Those could be combined with hackbench or kernel compilation runs and cyclictest to determine if the realtime latency requirements are being met. It might make sense to integrate the realtime tests into some other existing testing client framework (such as MMTests), rather than trying to make multiple versions of those tests targeted at each different CI system, he said.

The various CI efforts tend to congregate in the #kernelci channel on freenode or in the automated-testing@lists.yoctoproject.org mailing list. Attendees plan to work with those groups to determine the right path forward in order get more CI testing for the realtime kernel. Once the realtime patches are finally merged, the CI-RT system should provide a good starting point for CI testing moving forward.

As noted, these sessions were rather differently focused than most of those in the past. The final merging of the realtime patch set will make a big difference in how the project interacts with the rest of the kernel and the overall kernel ecosystem. It is important to get out ahead of the game with plans for stable-tree maintenance, along with ideas on how to make sure that the feature stays functional in the fast-moving mainline. The microconference would seem to have helped with both.

Comments (17 posted)

Profile-guided optimization for the kernel

By Jonathan Corbet
September 3, 2020

LPC
One of the many unfortunate consequences of the Covid-19 pandemic was the cancellation of the 2020 GNU Tools Cauldron. That loss turned out to be a gain for the Linux Plumbers Conference, which was able to add a GNU Tools track to host many of the discussions that would have otherwise occurred at Cauldron. In that track, Ian Bearman presented his group's work using profile-guided optimization with the Linux kernel. This technique, which he often referred to as "pogo", is not straightforward to apply to the kernel, but the benefits would appear to justify the effort.

Bearman is the leader of Microsoft's GNU/Linux development-tools team, which is charged with supporting those tools for the rest of the company. The team's responsibilities include ensuring the correctness, performance, and security of those tools (and the programs generated by them). Once upon a time, the idea of Microsoft having a GNU tools team would have raised eyebrows. Now, he said, about half of the instances in the Microsoft cloud are running Linux, making Linux a big deal for the company; it is thus not surprising that the company's cloud group is his team's biggest customer.

There was recently, he said, an internal customer working on a new Linux-based service that asked his team for performance help. After some brainstorming, the group concluded that this would be a good opportunity to use profile-guided optimization; the customer would have control of the whole machine running the service and was willing to build a custom kernel, making it possible to chase performance gains at any level of the system. But there was a small problem in that the customer was unable to provide any code to allow workload-specific testing.

Optimization techniques

From there, Bearman detoured into a "primer" on a pair of advanced optimization techniques. Profile-guided optimization is a technique where the compiler can be told how to optimize a program based on observations of its run-time [Ian Bearman] performance. In any given program, not all parts will be executed as often as others; some parts, indeed, may not be run at all. Using profile data, the compiler can separate off the rarely used code and optimize its compilation for space. Hot code, instead, can be fully optimized and allowed to take more space in the process. Hot code and data can be laid out next to each other in the address space. The result is better performance overall, greater locality, better use of the translation lookaside buffer (TLB), and less disk I/O for paging.

Link-time optimization is a different technique. Normally, the compiler only sees one file at a time, and is thus only able to optimize code within that one file. The linker then assembles the results of multiple compilation steps into the final program. Link-time optimization works by allowing the compiler to process the entire program at once, delaying the optimization and code-generation steps until all of the pieces are available. The result can be significant performance improvements.

The two techniques can work together, he said, with impressive results. Some work to optimize a SPEC benchmark yielded a 5% performance gain with link-time optimization, and a 20% gain when profile-guided optimization was added as well. That is fine for a standalone application, but Bearman wanted to apply these techniques to the kernel, where few have dared to tread. Some digging around turned up two papers published by Pengfei Yuan (and collaborators) in 2014 and 2015; the latter claimed an average speedup of 8%. So the technique seemed worth a try.

Optimizing the kernel

The work was done on an Ubuntu 19.10 system, using the toolchain shipped with the distribution. Link-time optimization was not entirely straightforward to set up; in the end, some assistance from Andi Kleen, who has been working on kernel link-time optimization for years, was necessary. Getting profile-guided optimization working was relatively easy, instead, just requiring some trial-and-error work.

The team proceeded by instrumenting the kernel, then running it with various workloads of interest. The kernel supports profiling with gcov; that provided much of the information that was needed. He cautioned that anybody wanting to repeat this work should take care to turn off the profiling options and rebuild the kernel after the data has been gathered, or the result will not be as optimized as one might like — words that sounded like the voice of experience. Getting the profile data into the compiler was a bit challenging; GCC expects it to be in a specific location with a complicated name. One file crashed the compiler, and various "other glitches" were encountered as well.

Much of the testing was done with the Redis database on a 5.3 kernel. Just building the kernel with the -O3 option turned out to make performance worse. The kernel that had been optimized with link-time and profile-guided optimization, though, outperformed the standard kernel by 2-4% on all but one test (that last test regressed performance by about 0.5%). This, he said, was an impressive performance gain, especially since Redis doesn't actually spend all that much time in the kernel.

Bearman's conclusion from this work is that use of these techniques with the kernel is worth the trouble. Windows relies heavily on profile-guided optimization with its kernel, he said, and gets 5-20% performance improvements in return. Linux could perhaps get results of this magnitude as well. There is a "cyclic dependency" that is inhibiting the use of these tools with the Linux kernel; profile-guided optimization is not being heavily used, so people don't see the value in it. That results in compiler developers not putting in the effort to make it work better, so it remains unused. If more developers were to put in the effort to apply profile-guided optimization, perhaps that cycle could be reversed to the benefit of the entire community.

More information can be found in the slides from this presentation [PDF].

Comments (3 posted)

Conventions for extensible system calls

By Jonathan Corbet
September 8, 2020

LPC
The kernel does not have just one system call to rename a file; instead, there are three of them: rename(), renameat(), and renameat2(). Each was added when the previous one proved unable to support a new feature. A similar story has played out with a number of system calls: a feature is needed that doesn't fit into the existing interfaces, so a new one is created — again. At the 2020 Linux Plumbers Conference, Christian Brauner and Aleksa Sarai ran a pair of sessions focused on the creation of future-proof system calls that can be extended when the need for new features arises.

Brauner started by noting that the problem of system-call extensibility has been discussed repeatedly on the mailing lists. The same arguments tend to come up for each new system call. Usually, developers try to follow one of two patterns: a full-blown multiplexer that handles multiple functions behind a single system call, or creating a range of new, single-purpose system calls. We have burned ourselves and user space with both, he said. There are no good guidelines to follow; it would be better to establish some conventions and come to an agreement on how future kernel APIs should be designed.

The requirements for system calls should be stronger, and they should be well documented. There should be a minimal level of extensibility built into every new call, so that there is never again a need to create a renameat2(). The baseline, he said, is a flags argument; that convention is arguably observed for new system calls today. This led to a brief side discussion on why the type of the flags parameter should be unsigned int; in short, signed types can be sign extended, possibly leading to the setting of a lot of unintended flags.

Sarai took over to discuss the various ways that exist now to deal with system-call extensions. One of those is to add a new system call, which works, but it puts a big burden on user-space code, which must change to make use of this call. That includes checking to see whether the new call is supported at all on the current system and falling back to some other solution in its absence. The other extreme, he said, is multiplexers, which have significant problems of their own.

Yet another approach is to set a flag (on calls that support flags, obviously) and make the system call take a variable number of arguments; among other things, variadic calls are difficult for C libraries to support properly. It is necessary to pass all possible arguments, leading to passing garbage on the stack into the kernel. System calls could be designed with fixed-size structs as arguments, with the idea that these structs would contain enough padding to handle any future needs. The problem with this approach, he said, was that it requires an ability to predict the future, which turns out to be difficult to do.

Finally, he said, the problem could be solved by getting and using a time machine. This solution, too, suffers from practical difficulties.

Extensible structs

Brauner and Sarai are pushing a different solution that they call "extensible structs"; it is the approach used in the design of the openat2() system call. This mechanism works by marshaling parameters to a system call into a single C structure; a pointer to that structure and the size of the structure are passed as the parameters to the system call. That size parameter acts as a sort of version number. When the time comes to extend the system call in a way the requires passing new data to the kernel, new fields are added to the end of the structure, increasing the size. Those new fields are always designed so that a value of zero implies the behavior that existed before those fields were added.

When the system call is invoked from user space, the kernel compares the structure size passed in to its own idea of how big the structure should be. If the size from user space is smaller, that indicates that the caller expects to use an older version of the system call; the fields that user space did not provide are filled with zeroes and the call proceeds as usual, with no visible change in behavior. If, instead, the kernel's size is smaller than the structure passed from user space, then the kernel is the older side. In that case, all of the excess fields (from the kernel's point of view) are checked; if they are all zero, the call can proceed. Otherwise, the call fails with an E2BIG error, since user space is requesting functionality that the kernel does not know how to provide.

H. Peter Anvin questioned the use of the E2BIG return code. Sarai responded that its assigned meaning is "argument list too long"; it is a bit weird, he said, but "it makes sense if you squint".

[Aleksa Sarai, Arnd
Bergmann, Christian Brauner, and Florian Weimer]

Brauner emphasized that the rules should prohibit any further multiplexer system calls. This is almost universally agreed on now. bpf() is not an extensible system call; it is a multiplexer that happens to use extensible structs. So bpf() is not an example of how system calls should be designed in the future. The GNU C Library developers have also made it clear that they would rather not see any more multiplexer calls, since they are hard to deal with at that level. Arnd Bergmann said that a number of developers try to block multiplexer system calls, but it is hard to see them all; that is why this rule should be in the documentation, Brauner replied.

Anvin asked why the structure size is passed as a separate argument rather than being embedded in the structure itself. Brauner replied that it feels more "C-like" to pass the size in the argument list, but that it doesn't matter that much in the end. The community should pick one convention or the other, though. Anvin said that the separate size can create problems if the struct is being passed through different user-space layers that may have different ideas of what its proper size is; that could result in passing bad data to the kernel.

Kees Cook, instead, said that confusion in user space is preferable to confusion in the kernel. There could be problems if the kernel reads the size from within the structure, then uses that size to read the structure itself in to kernel space. That size might have changed in between the two reads, possibly opening up vulnerabilities within the kernel. Thus, he said, the size is better passed as a separate element. When asked which convention would work better for C libraries, Florian Weimer responded that he didn't have a strong opinion either way.

Brauner said that, in the past, Al Viro has called extensible structs a "crap insertion vector", saying that they could be used to sneak new features into the kernel without review. That view doesn't hold up, Brauner said; as an example, the feature that prompted this reaction was indeed caught in review. There is not a problem there that is worse than with any other system call, he said.

This part of the discussion closed with Brauner saying that the new conventions need to be added to the documentation. The current documentation on adding system calls is not up to date; he tried to update it in the past the but that effort "went largely unnoticed". He has been trying to build more consensus around the extensible structs idea; meanwhile, he has landed two new system calls using it. He will be working on a new version of the documentation patch that is intended to describe the current best practices as he sees them.

Probing feature support

Extensible structs might be a solution to the problem of adding features to system calls, but they do not, in themselves, address the other part of the problem: helping user space to figure out which features are actually supported on a given system. As Brauner noted, programs normally have to adopt a sort of trial-and-error approach, where they try to exercise each feature in question and see if it actually works. This process is painful at best, and it can become expensive, especially in libraries and short-running programs. It should be possible to do better. The conversation on just how to do better got started in this session, then continued in a birds-of-a-feather session later.

Sarai mentioned one proposal that has been circulating: add a "no-op" flag to the system call. The kernel would respond by returning a copy of the extensible struct with all valid flag bits set and non-flag fields filled with the highest supported value. A unique error code would be returned if the queried operation is not supported at all. Bergmann quickly pointed out that this flag would turn the system call into a sort of multiplexer with multiple functions, but noted that he could see the upside of doing things that way.

The alternative would be to create a new system call dedicated to checking which features are supported by other system calls. This would not be entirely straightforward to implement, since it requires adding a new infrastructure within the kernel for defining system-call features. Brauner noted that there are two specific types of extensibility that would have to be handled: adding a new flag, and increasing the size of the struct.

Bergmann suggested that a minimal solution might be to add a counter in the VDSO area exported by the kernel to user space; every time a feature is added, the counter would be incremented. Cook answered that it was a little too minimal; user space would benefit more from the ability to inquire about the availability of specific features. Weimer said that this would make writing portable software more difficult, since it would be necessary to test all possible permutations of features, but Cook responded that this problem already exists, and the ability to query features would just improve visibility. Mark Rutland suggested starting with a single system call defining exactly which information would be exposed; Brauner said that clone3() might be a good starting point.

There appeared to be a consensus that a separate system call is the right way to solve this problem, so the discussion turned to what this system call would look like. Brauner started with a suggestion that this new call would take the number of the system call of interest, and would return the current set of valid flags and struct size. Sarai pointed out that openat2() has two flags arguments, complicating the situation. Weimer, instead, said he would like to be able to query whether vfork() is available, but there's no flags argument there at all.

Mark Rutland suggested an API that would look something like:

    int sys_features(int syscall_no, u32 *map, size_t mapsize);

The kernel would treat map as a bitmap to be filled in describing the available features for the requested system call. Each system call would have a set of constants for each feature that may or may not be present, each corresponding to one bit that would be set in map if the feature is available. This proposal seemed to gain some support during the discussion.

Brauner said that there is also a need to automate the process of adding a system call to the kernel — preferably one not written in Perl. It could perform a number of checks, including whether the struct size is described correctly and the sys_features() information is correct. Bergmann suggested extending the DECLARE_SYSCALL() macro in the kernel to take the bitmap of supported features as an additional argument.

The conversation wandered on for a little longer, but the form of the outcome was already clear. For the extensible structs portion, the main action item will be to update the documentation to reflect the new consensus (if there is truly a consensus) on how extensibility should be handled. For the feature-query system call, the task will be to write some code showing how it would actually work. Once the concept hits the mailing lists it may end up changing significantly. One can only hope that the end result gets things right, so we won't need a sys_features2() in the future.

Comments (33 posted)

Lua in the kernel?

By Jake Edge
September 9, 2020

Netdev

BPF is, of course, the language used for network (and other) customization in the Linux kernel, but some people have been using the Lua language for the networking side of that equation. Two developers from Ring-0 Networks, Lourival Vieira Neto and Victor Nogueira, came to the virtual Netdev 0x14 to present that work. It consists of a framework to allow the injection of Lua scripts into the running kernel as well as two projects aimed at routers, one of which is deployed on 20 million devices.

Neto introduced the talk by saying that it was also based on work from Ana Lúcia de Moura and Roberto Ierusalimschy of the Pontifical Catholic University of Rio de Janeiro (PUC-Rio), which is the home organization of the Lua language. They have been working on kernel scripting since 2008, Neto said, developing the Lunatik framework for Linux. It allows kernel developers to make their subsystems scriptable with Lua and also allows users to load and run their Lua scripts in the kernel.

Lua and sandboxes

Lua was chosen because it is a small, fast language, he said. It is also widely used as the scripting language in networking tools such as Wireshark, Nmap, and Snort. The talk focused on scripting two networking subsystems in Linux, netfilter using NFLua and the express data path (XDP) subsystem with XDPLua.

[Lourival Vieira Neto]

It is important that any scripting in the kernel not cause it to malfunction. Scripts should not be able to crash the system, run indefinitely, or corrupt other parts of the system. To ensure that, Lunatik uses the Lua virtual machine (VM) facilities for sandboxing the scripts so that they run in a safe execution environment, he said.

Lua scripts cannot address memory directly; they can only access it through Lua data types, such as strings and tables. All of the Lua types are allocated by the VM and garbage collected when they are no longer being used. But that is not enough to restrict scripts from causing harm since they could allocate too many objects and use enough memory to harm the rest of the system. A custom memory allocator is used that will cap the amount of memory available to Lua scripts in order to avoid this problem.

Lua provides "fully isolated execution states", Neto said. Those states are initially created with only the language operators available in them; the developer of the subsystem can then determine which libraries get loaded for additional capabilities given to scripts. Those might be Lua standard libraries or specialized libraries, such as Luadata and LuaRCU; the former provides safe access to data external to the Lua VM, while the latter is a mechanism for sharing data between execution states. Both NFLua and XDPLua use Luadata to access packet data, for example.

Lua provides a single-threaded execution environment without any primitives, such as mutexes, for synchronization. That means the scripts cannot explicitly block the kernel, but they could still run indefinitely. Lua has a facility to interrupt a script after it has run a certain number of instructions, which is used by both NFLua and XDPLua. Multitasking is allowed by Lunatik via multiple execution states in the kernel.

Only network administrators with the CAP_NET_ADMIN capability can load scripts and access the execution states. Netlink sockets are used to transfer data between the kernel and user space; the capability is checked on each access, he said.

NFLua

NFLua is a netfilter extension that targets advanced layer 7 (application layer) filtering using Lua. Iptables rules can be applied at layer 3 (network) and layer 4 (transport) to send packets to NFLua; scripts can then be called to inspect the upper layer. Lua is already widely used by network operators for various tasks, including for security and network monitoring, so Lua is a good fit for this kind of filtering.

NFLua is implemented as a loadable kernel module that contains the Lunatik framework, the Lua interpreter, and whatever libraries are being made available to execution states. Once it is loaded, the nfluactl command can be used to create a Lua state and to load Lua code into it.

He gave an example of a simple filter based on the User-Agent sent with an HTTP request. An iptables rule is used to direct packets to an execution state and a function in that state by name. Packets matching the rule (being sent to port 80) get passed to NFLua, which calls the named function with the packet data. The function looks up the User-Agent from the HTTP request in a table to determine whether to block it or not. The Lua function return value indicates whether netfilter should terminate the connection or allow it to proceed.

XDPLua

At that point, Nogueira took over; he described XDPLua as an extension for XDP that allows using Lua in the data path. It represents the natural evolution of NFLua to process packets before they are handled by the network stack. It creates one Lua execution state per CPU, so it can take advantage of parallelism on modern systems. One of the goals of the project was to add "expressiveness and dynamism" on the data path, so that programmers could create more complex applications to be loaded into the kernel.

XDP uses BPF, so XDPLua has added wrappers for the Lua C API as BPF helpers, allowing BPF programs to call Lua scripts. The XDPLua developers want BPF and Lua to cooperate, "so we can have the best of both of them", Nogueira said. They wanted the performance of BPF while maintaining the expressiveness of Lua.

[Victor Nogueira]

He quickly went through the same example as Neto. The Lua program is loaded into XDPLua, while a BPF program gets loaded into XDP. When a packet arrives at XDP, the BPF program can call the Lua function to determine whether to block or allow the processing of the request; if it is allowed, then the packet will be passed on to the networking stack.

Another example that he showed was processing cookie values, which are being used to distinguish bots from legitimate traffic, before the packet ever reaches the web server. On the first request from a particular client, the web server replies with a cookie value and some JavaScript to attach the cookie value to further requests. Since bots typically won't run the JavaScript, they will not have the proper cookie value.

When a new cookie is generated, the web server will pass its value and the source address to the Lua code, which stores it in a table. The code to actually handle the value is straightforward, simply extracting the cookie value and checking to see that it matches what is in the table. If it is not, the request is dropped before it ever reaches the web server. In addition, the XDP program will add the IP address to its block list so that no further requests will even need to consult the Lua program.

He also outlined an access-control example using server name indication (SNI) in TLS connection requests to restrict which domains can be connected to. This could be used to disallow local users of a network from accessing a forbidden site. Using a simple block list and function in Lua, along with a BPF program to recognize the TLS client hello message and call the Lua function, the SNI data can be checked from XDP.

Benchmarks

In order to gather some numbers, the access-control example was implemented for NFLua, XDPLua, and XDP (in BPF). The BPF version was difficult to write and turned out to be cumbersome to work with, he said, while the same Lua script is shared between NFLua and XDPLua. trafgen was then used to send TLS client hello packets, with SNI values that were in the block list, as quickly as possible. Two things were measured: how many connections per second are dropped on the server (the drop rate) and the CPU usage. It was a fully virtualized environment, both client and server ran on 8-core 3GHz CPUs with 32GB of RAM each; they were connected by a 10Gbps virtio network interface.

NFLua could drop roughly 0.5-million packets per second, while both XDPLua and XDP/BPF could handle around three times that rate (1.5Mpps). In addition, XDPLua and XDP/BPF both used roughly 0.1% of the available CPU, while NFLua used 50%. Nogueira said that NFLua only gets the packets once they have gone through the network stack and it does not take advantage of the multiple cores, which may help explain the 500x difference. It is important to note that having XDP call out to Lua did not have a significant impact in terms of CPU usage, he said.

Neto returned to the video stream to wrap things up before the speakers took questions from attendees; roughly half of the 45-minute slot was devoted to Q&A. He noted that NFLua is used in 20 million home routers and that it is being used by network operators for security and monitoring tasks. The lessons learned from NFLua were incorporated into XDPLua, which is designed from the outset to work cooperatively with BPF, so that developers get the ease of use of Lua combined with the performance of BPF. XDPLua is currently used in Ring-0 Networks firewall products that are deployed as part of the infrastructure at internet point of presence (POP) companies on 10Gbps networks.

One problem area that they have faced is that XDP does not support extensions as loadable kernel modules. Netfilter supports that functionality, which has been beneficial for developing the Lua-based filtering mechanisms. Maintaining out-of-tree bindings in order to support XDPLua has been somewhat difficult.

Instead of using an in-kernel verifier, as BPF does, the Lua environments take a sandboxing approach to protect the kernel. The BPF verifier can be hard to work with, as they found when developing the XDP/BPF version of the access-control benchmark, Neto said. With that, they turned it over to questions.

Questions

Tom Herbert, who was shepherding the Netdev track, noted from the outset that it would be an uphill struggle to try to get this work merged into the mainline. The BPF verifier is part of what allows the kernel developers to be comfortable with XDP, so adding Lua to the kernel will require a similar effort to convince them that Lua is also safe. For example, the kernel cannot crash because Lua has accessed memory inappropriately; what is being done to prevent that? Neto reiterated that Lua does not access kernel memory directly—it has no pointer type. It can allocate memory, but that can be (and is) limited. Furthermore, the number of instructions can be limited, so that infinite loops are not possible.

Neto said that there are various places you can enforce the safety assurances: at compile time, load time, or run time. BPF does load time checks with the verifier, while Lua sandboxes its programs with its VM at run time. There could, of course, be a bug in the VM implementation, but that is also true with the BPF verifier.

Another question that will likely be asked, Herbert said, is why a Lua-to-BPF compiler could not be created; there are already compilers for C and P4, why not do that for Lua? You could perhaps turn Lua syntax into BPF, Neto said, but you cannot write a Lua VM that runs on BPF, so you wouldn't get all of the features that Lua can provide. The verifier purposely limits the BPF that can be run, so you can't write general-purpose code. You might be able to have some elements of Lua, but not the "full package" if you are targeting BPF.

Shrijeet Mukherjee said that having two VMs in the kernel was likely to be problematic; he suggested minimizing the Lua VM component in the kernel and to move as much as possible into user space. The BPF VM has momentum and acceptance; from a community perspective, adding another will be difficult. Neto said that getting the Lua work upstream is not necessarily the path being pursued; if XDP could provide a mechanism to allow dynamic extensions, like netfilter has, that could work as well. Herbert said that will be a hard sell; XDP started with the idea that it would be pluggable, but it is now simply a BPF hook.

Netdev organizer Jamal Hadi Salim said that there is a need for both scripting and compiled code for networking tasks. But there are political and technical hurdles to getting another programming environment added to the kernel. The security concerns are important, but he believes that Lua could meet the requirements, just differently than is done with BPF.

Mukherjee suggested that there might be a way to split things up, such that the in-kernel packet-handling was done in XDP, while the policy handling could be done with Lua in user space; communication could be done through a shared BPF map. Packet handling is really in the kernel's domain, he said, but the policy aspects may not be. But, as Neto pointed out, that will add latency. They have tried that approach in the past, but the performance was such that they moved on to NFLua and then to XDPLua.

But Mukherjee wondered if caching the policy decisions in the kernel could avoid much of the added latency for consulting user space. The "basic stuff" could be handled in the kernel, while the "really complicated" pieces are handled in user space—with the results of the decision somehow cached in the kernel. He was not sure that was a reasonable approach, but there may be a middle ground to be found that would still allow much of what Lua is providing without putting it into the kernel.

An attendee asked about the maturity of XDPLua. Neto said that it is running in production, but it is also still under development. There is no patch ready for upstream submission at this point. There is cleanup work that needs to be done before that can happen. The system used for the benchmarks was overpowered, from a CPU standpoint, for the 10Gbps link speed, so the CPU usage difference between XDP/BPF and XDPLua was not truly shown, an attendee said. Neto agreed that more testing, including using slower virtual CPUs, needs to be done.

They are using the standard Lua, rather than the LuaJIT fork, Neto said, in answer to another question. Investigation of a "typed Lua" for compilation is something on the roadmap. That is the approach that the main Lua project is taking to compete with LuaJIT on performance. The Lunatik developers have avoided using LuaJIT directly because it is based on an older version of the language, but they are interested in pursuing the performance gains that could come with compiled and optimized Lua.

The entrenchment of BPF and its VM make it rather hard to see how Lua could actually be added into the kernel itself. Getting hooks for other pluggable programming environments added to XDP might be a more plausible approach, though Herbert did not seem too optimistic even though he (and others) thought the Lua approach was interesting and potentially useful. But, "it is a moonshot", Herbert said. Whether the XDPLua developers can overcome whatever resistance there will be remains to be seen, but it seems clear that there are at least some who are chafing at the restrictions of the BPF programming environment.

Comments (73 posted)

MagicMirror: a versatile home information hub

By John Coggeshall
September 7, 2020

Back in 2014, a Raspberry Pi enthusiast by the name of Michael Teeuw shared his build of a "magic mirror" with the world in a six-part series. The system consisted of a Raspberry Pi and monitor running a web browser in kiosk mode, with a web server that provided a dashboard interface — all stored in a custom-built case with a one-way mirror. Since his post, others around the world have built these devices for their home (including myself), forming both a community and an interesting open-source project. The recent release of MagicMirror2 (MM2) version 2.12.0 gives us an opportunity to learn more about where the project started and where it is today.

The MM2 project provides the software to convert what would otherwise be a normal household mirror into a valuable source of information. This information could take the form of drive times, train schedules, daily news, server loads, sports scores, or even the feed from the doorbell when someone is at the door. With the right know-how, the surface can even become interactive through the use of hand gestures or as a touchscreen.

During the first two years after Teeuw's posts, the GitHub repository for the project had been forked over 500 times. By October 2016, The MagPi (the official Raspberry Pi Foundation magazine) had declared Teeuw's project a first place winner in its "50 Greatest Raspberry Pi Projects" [PDF] issue.

In December 2016, Teeuw announced MagicMirror2 under the MIT license. MM2 was designed to replace the original code base; in the announcement, Teeuw explained why:

In the past two years, many community members worked on expanding the MagicMirror system allowing them to customize it to their needs. And while I admire this effort, I felt it needed some rethinking in order to grow beyond what was possible with the current version.

In another post Teeuw stated that MM2 was funded by a "successful entrepreneur", who contracted him to build a custom mirror as the centerpiece of their new home's living room.

MM2 is built using Electron and operates on a web server and browser model. The project can be divided broadly into two segments: the core framework and the modules. The core framework provides WebSocket communication between the browser-based user interface (UI) and the backend services; this framework is then used by modules to provide the functionality of the UI. MM2 ships with a handful of default modules; hundreds of third-party modules written by contributors are also available. The core MM2 project has 243 contributors to date; the most recent was in July 2020. The latest release fixed a number of bugs, added the ability to configure MM2's log verbosity, and cleaned up several places in the code to make the project easier to maintain.

Since Electron is cross-platform, in theory MM2 can be run in most common environments. However, the project itself exclusively supports Raspberry Pi devices (excluding the Zero series). The installation documentation provides multiple different options for getting started, including cloning the repository, using a Docker image, or using a pre-configured Raspberry Pi OS (formerly Raspbian) image.

Building the interface for MM2 is straightforward for many (if not most) configurations. Some modules are more complicated than others, especially ones that interact with hardware or communicate with each other; more fiddling might be required in those circumstances. Everything is controlled by a JavaScript configuration file. Here is a screenshot of the MM2 interface, taken from a mirror I built for my home (rendered in a web browser):

[My MM2 Interface]

This interface is implemented with five MM2 modules: a community-contributed MQTT client for the dinner menu, along with the built-in clock, current weather, weather forecast, and news-feed modules. Each component has its own configuration in addition to the MM2 core options that are always available. Modules are defined in the modules section of the global MM2 configuration as an array of JSON objects.

Modules that provide a visual aspect can be positioned in the UI using the position option. There are a total of 13 positions that MM2 divides the UI into. For example, bottom_bar places the headlines centered at the bottom of the screen as shown in the screenshot. Multiple modules can be placed in the same position and are rendered in the same order they are provided within the configuration.

The MM2 module framework provides each module with the ability to listen for and respond to notifications from other modules. For example, the newsfeed module has several notifications it listens for that can be used to control the feed in the interface (such as moving to the next article headline). This allows interaction with other modules like the community-contributed MMM-Gestures, which provides a mechanism to interface MM2 with infrared sensors that let the user "wave" their hand in front of the mirror to interact with it. Another module that I have personally found useful is MMM-homeassistant-sensors, which enables me to pull data from my Home Assistant hub into the MM2 UI.

For users like myself, who would like to provide their dashboard display in multiple locations and across multiple devices, MM2 provides a convenient way to manage that configuration. I run one instance of MM2 in server-only mode, with two Raspberry-Pi-powered displays that connect to it in client-only mode. This allows me to manage a single configuration and provide a consistent interface across both devices.

One place MM2 doesn't provide much as a project is around security, offering no vulnerability disclosure policy or security-focused documentation. A review of the project issue tracker did not turn up any reported security issues. It did show that the contributors keep the various dependencies used by the project up to date (including security fixes), indicating at least an interest in security concerns. It is worth noting that running MM2 behind a NAT firewall is likely a good idea, with proper care being taken to secure the underlying OS of the Raspberry Pi. For security within the LAN, MM2 allows users to configure lists of IP addresses that are authorized to access the server. It is also important to keep in mind that, since the majority of functionality comes from third-party modules not vetted by the project, care should be taken to ensure that those modules are safe before using them.

To get involved with the MM2 community, the first stop would be the project forums or perhaps the Discord channel. For contributors, the project also outlines the best ways to get involved. Alternatively, if the goal is to add new functionality in the form of a module, the module development documentation should have everything needed to do so. In my home, the MM2 devices have become an important daily part of life. Any readers interested in having a low-cost "magic mirror" of their own are encouraged to take a look at the project.

Comments (6 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Briefs: GnuPG security flaw; Security features in Linux 5.6; Android 11; LFS 10.0; GStreamer 1.18; GNOME patent troll; Quotes; ...
  • Announcements: Newsletters; conferences; security updates; kernel patches; ...
Next page: Brief items>>

Copyright © 2020, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds