|
|
Log in / Subscribe / Register

Leading items

Welcome to the LWN.net Weekly Edition for August 8, 2024

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

The complexity of BUSL transformation

By Joe Brockmeier
August 5, 2024

The Business Source License (BUSL) is a source-available license that "converts" to an open-source license after a period of time. In theory, this means that a few years after a version of a product is released under the BUSL, it becomes open source and is fair game for Linux distributions to package along with regular open-source projects. In practice, the license throws a few curveballs that require special consideration and caution, as the Fedora Project recently discussed.

The concept of proprietary-to-open has been around for quite some time. For example, Aladdin Enterprises developed Ghostscript under a similar scheme from 2000 to 2006. The company's proprietary version, Aladdin Ghostscript, was released under the source‑available (and misnamed) Aladdin Free Public License first and then later as Ghostscript under the GPLv2. That, however, was a two‑step process where the source release was performed under a different name and a clear change of license.

BUSL was, however, the first license to include a transformation clause rather than the licensor performing a separate release under the open-source license. It was introduced in the article "Introducing 'Business Source': The Future of Corporate Open Source Licensing?" by Michael (Monty) Widenius and Linus Nyman in 2013, and then applied in 2016 to MariaDB's database-proxy product, MaxScale.

The idea behind the license, according to the article, was to "harness the benefits of open source while generating sufficient income for the program's continued development". BUSL permits modification and redistribution of the code, but restricts "production use" of the software as long as the terms of the BUSL apply. After a period of time, generally four years, BUSL specifies that the code becomes available under the GPLv2 or later, or a compatible license.

Since the release of MaxScale under the license, a handful of other companies have adopted BUSL for some or all of their products as well. During its larval stage, BUSL-licensed software is a non-starter for inclusion in most Linux distributions. Now that some of the software is emerging from its non-free chrysalis, the question is whether and how it might be included as open-source software.

Transformation

Recently, Michal Schorm sought guidance about how to treat software once it had transformed to use a free license. Specifically, Shorm asked about the MaxScale 21.06 release on GitHub with a change date of June 3, 2024. He wanted to know how Fedora packagers should represent the license in the binary package, and if there were other special considerations for formerly BUSL software.

Neal Gompa wondered if the System Package Data Exchange (SPDX) format, which is what Fedora now uses in its packaging to identify licenses, should add a way to specify that a license had converted. "Non-free licenses converting to FOSS licenses have been a thing for a few years now. Now we're starting to see stuff 'convert', we probably want a way to express that?"

Richard Fontana, a member of Red Hat's legal team who advises Fedora on legal issues, said that the question of how to categorize software that had transitioned to a free license had not come up before. He said that it was "'fine' in theory, but somewhat risky" to package the software for Fedora:

I imagine that in some cases it won't be clear whether a particular version mixes BUSL (at various stages of the process towards the "change date") and post-BUSL licenses.

If it was clear that the change date had been reached, then it might require some further action to document that beyond the license tag in the RPM, Fontana said. But, he said, "we can cross the bridge when we come to it -- or have we come to it?" Daniel P. Berrangé said "Yes & no". The branch shared by Schorm had files with change dates that had not yet passed the change date, which was the risky scenario that Fontana worried about. "I presume this is because that branch will be getting [maintenance] bugfixes periodically and they'll be setting a new Change Date each time." But, if MariaDB had distributed a source tarball instead, it would be from a fixed point in time and easier to determine if the code was older than the change date.

Schorm followed up to say that he had been contacted by MariaDB and that it had indicated that they would like to see MaxScale adopted by Linux distributions since its oldest major version had just reached its transformation date.

With that information, Berrangé said that the company could take steps to make it simple by updating the branch for the old version and replacing the license and file headers with GPL instead of BUSL. Finally, the company should release an updated tarball with appropriate bug fixes to make the licensing entirely unambiguous, "and thus [trivially] accepted by any distro maintainers". If it is clear that software has passed its change date, Fontana said, then its BUSL heritage is not an obstacle to packaging in Fedora. Packagers should make it clear that the operative license is the one allowed for Fedora and use that license in the RPM spec file.

Vít Ondruch asked about security fixes. If a security vulnerability is discovered and fixed in a later version of the software, should that patch be reimplemented? Fontana said that might be an option, but Berrangé cautioned that it could be tricky if "the 'reimplemented' patch ends up being effectively (or actually) identical to the official patch".

Up in the air

The Fedora conversation trailed off without any concrete plan of action for handling security updates or other maintenance for ex‑BUSL software. It seems almost certain that packagers would be responsible for implementing their own bug and security fixes for any given release, effectively becoming a fork of the upstream. Then, when another version reaches its change date, the packagers would need to reconcile the homegrown fork with the next upstream release.

All of that, of course, could be avoided if companies chose open-source licenses in the first place. Failing that, they could take a cue from Aladdin's playbook and actively maintain open-source releases under a different name that don't have the same complexities of BUSL‑licensed software.

Comments (11 posted)

Divvi Up: privacy-respecting telemetry aggregation

By Daroc Alden
August 2, 2024

There is ongoing discussion about the ethics and effectiveness of telemetry following some recent LWN articles that touched on Thunderbird's use of opt-out telemetry and planned metrics in Fedora. The Internet Security Research Group (ISRG), the nonprofit behind Let's Encrypt, has a potential solution to the problem of how to collect and aggregate telemetry without violating users' privacy. The scheme is based on a draft protocol being standardized with the Internet Engineering Task Force (IETF), and has an open-source implementation available.

The ISRG's proposed solution is called Divvi Up. It's based on an existing research system from Stanford University called Prio. Unlike previous attempts to mitigate the privacy risks of telemetry with techniques like differential privacy, Prio ensures that as long as at least one participating server is honest, the aggregation servers learn "nearly nothing" about the users. In this case, "nearly nothing" is a cryptographic term of art which means that malicious servers can only learn a small and precisely bounded amount of information, depending on the choice of aggregation function. For simple sums and averages, malicious servers learn no additional information. Once the statistics have been aggregated by the servers, they can be made available publicly with no way to see or infer individual reports.

Specifically, Prio is designed to maintain three security properties: anonymity, privacy, and robustness. Anonymity means that the servers should not be able to determine which client submitted which individual piece of data. Of course, the server can observe that a client submitted some data, but not determine what the data was. Privacy is the slightly stronger guarantee that a server learns "nearly nothing". Finally, robustness means that a malicious client can't influence the final aggregated statistics any more than it could just by submitting an incorrect but valid measurement. For example, if the aggregation servers were counting the number of positive answers to a yes/no question, a malicious client could lie about whether its answer was yes or no, but it couldn't submit a result that makes the count entirely invalid, such as subtracting from the total.

The cryptography

Optional cryptographic detail

Prio's secret-sharing scheme is simple. As a parameter of the protocol, select a large finite field (such as the integers modulo some large prime number). To split a secret S into N shares, choose N-1 random elements of the field such that the sum of all the elements plus some final value equals S. Since the elements are chosen independently at random, a proper subset of the shares doesn't reveal the value in question. Shares created in this way can be added together to get a share of the sum.

Prio uses a combination of several cryptographic techniques. The most important is secret sharing — a kind of encryption that allows a secret value to be split into N separate parts such that its only possible to decrypt the value if one has all N parts; missing even a single part means the original value is unrecoverable. Secret sharing alone isn't enough to handle private telemetry, however.

Prio's chosen secret sharing scheme is homomorphic — it is possible to do math to two encrypted values, and obtain the encryption of the result. For example, suppose that two clients each have secret values A and B. They can split them into multiple pieces, and send each piece to a different server. Each server can take the pieces it has, and combine them to learn a piece of A + B, without learning A + B itself. Finally, the servers can send each other their pieces of the sum, and put them together to learn A + B. But as long as at least one server keeps its shares of the original A and B values private, the servers can only learn the sum, and not the original values.

In Divvi Up's proposed concrete implementation of this scheme, any project wishing to use Divvi Up's solution would host one aggregation server, and the Divvi Up project would host another. Any user would only have to trust that either the original project or Divvi Up won't share their data. And the underlying math does work with more than two servers, so projects wishing to be especially careful could potentially add more servers run by other trusted entities.

Prio's proofs of validity work by defining a fixed arithmetic circuit (like a boolean circuit, but with mathematical operators as gates instead of boolean operators as gates) that calculates whether a given value is valid. For linear operations, the aggregators can use their share of the input to calculate a share of the output of an intermediate gate. For multiplication operations, Prio uses a secure multiparty computation scheme invented by Donald Beaver to calculate shares of the output of intermediate gates. Then, once the aggregators have shares of the output, they can combine them and check whether the answer was "valid" or "invalid".

The scheme described above has two main flaws, however: clients submitting bogus values, and computing anything more complicated than the total or average value of a statistic. To remedy the former, Prio introduces a new technique that allows the client to write up a machine-checkable proof that their input value is valid — for example, that the encrypted value it is submitting is within an expected range — and then split the proof itself into separate shares. The servers can then check that the proof holds — and therefore the input is valid — without being able to learn anything more about the input. Unlike traditional zero-knowledge proofs, the fact that at least one server is assumed to be trustworthy lets the proof-validation mechanism be much more efficient. Notably, it doesn't require expensive public-key cryptography, just a fixed amount of arithmetic modulo a large prime number.

Finally, the Prio paper describes how to extend these techniques to calculate affine functions of the input. This allows servers to calculate not just sums and averages, but also standard deviations or even full least-squares regressions to the data. If implemented correctly, the protocol will let projects get nearly every kind of useful measurement, while providing much stronger guarantees for user privacy than other techniques.

The standardization

The ideas presented in the original Prio research paper were picked up by the IETF's Privacy Preserving Measurement working group, which has been working on a standard called the Distributed Aggregation Protocol (DAP) using them since 2021. Another draft standard — Verifiable Distributed Aggregation Functions (VDAFs) — describes extensions to DAP supporting different specific implementations of aggregation. Between them, the standards specify the important real-world details not present in the original paper, such as the format for encoding telemetry data, and how protocol participants ought to send messages to one another.

In DAP, messages are sent between clients and servers using HTTPS. Errors are reported as normal HTTP status codes. To reduce the burden on clients, messages for each participating server can be submitted in a single HTTPS request to any server — the separate shares that should not be combined are protected by another layer of public-key encryption so that they can only be decoded by their recipient server, and the servers gossip among themselves to distribute the shares.

DAP currently supports two different choices of VDAF: Prio, as discussed above, and Poplar, an aggregation function designed to work with textual data instead of numeric data. Specifically, Poplar permits counting how many submitted strings match a given prefix, which could be useful for sampling data like user-agents or distribution names.

The protocol also permits using other privacy-preserving techniques in conjunction with its aggregation mechanisms. For example, even though differential privacy provides much weaker guarantees than Prio, there's no reason that a telemetry system could not use both; the standard includes specific recommendations for how to do that. DAP also supports submitting telemetry to the aggregation servers using Oblivious HTTP, a technique that uses public key encryption and a proxy run by a different organization to make it hard to tell where a request was submitted from.

The standard also deliberately leaves open the possibility of authenticating clients. While a properly-constructed privacy-preserving aggregation function like Prio or Poplar ensures that clients cannot submit invalid data, the whole scheme is still vulnerable to Sybil attacks, where someone invents large numbers of fictitious users. Techniques like OAuth or Privacy Pass can be used to ensure that telemetry is only accepted from known clients when that makes sense for an application.

Existing work

In 2018, Mozilla experimented with using Prio for telemetry collection. At the time, there was no standard, and no other organizations looking to deploy it, so Mozilla ran both aggregation servers itself. This obviously does not provide a meaningful privacy benefit, but it did validate the concept, and show that it could be useful in a real-world telemetry collection system. With Divvi Up offering to run independent aggregation servers, the protocol being standardized, and MPL-2.0 licensed libraries for clients and servers available, privacy-preserving telemetry aggregation is now much more feasible.

On August 1, Divvi Up released an introduction on how to get started with its libraries and services. The announcement includes details on "a new divviup command line tool which can both integrate with the DAP endpoints and the Divvi Up control plane to configure accounts, create tasks, upload telemetric data, and run aggregations more succinctly." As befits a project that requires independent servers, the announcement also explains the structure of the server implementation, and links to information on setting up an independent DAP server.

People object to telemetry for many reasons, including privacy concerns, wasted network bandwidth, and being a poor substitute for conversations with users. Whether new protocols like DAP will tip the tradeoff offered by telemetry will undoubtedly depend on how each person balances those different concerns — but in an environment where some have called for open-source software to use telemetry to avoid "fighting with one hand behind its back", anything that helps reduce the privacy problems inherent in the technology is undoubtedly a good thing.

Comments (20 posted)

Maximal min() and max()

By Jonathan Corbet
August 1, 2024
Like many projects written in C, the kernel makes extensive use of the C preprocessor; indeed, the kernel's use is rather more extensive than most. The preprocessor famously has a number of sharp edges associated with it. One might not normally think of increased compilation time as one of them, though. It turns out that some changes to a couple of conceptually simple preprocessor macros — min() and max() — led to some truly pathological, but hidden, behavior where those macros were used.

min() and max() for the kernel

Your editor's well-worn, first-edition copy of The C Programming Language introduces the preprocessor with this example:

    #define max(A, B) ((A) > (B) ? (A) : (B))

The hazards that come with a macro like this, such as the double evaluation of the arguments, were pointed out in the text. Still, that did not prevent kernel developers from making use of it; as covered here in 2001, there were over 150 definitions of min() and max() matching the above pattern in the 2.4.8 kernel.

At that time, Linus Torvalds decided that a consolidation made sense; he added a single set of those macros meant to be used throughout the kernel. He also changed the interface, though, adding a type parameter describing how the comparison is to be performed — signed or unsigned integer, for example. The goal was to increase correctness, but the immediate effect was to break compilation throughout the kernel; the result was a classic linux-kernel flame war of the type that, fortunately, tends not to happen anymore.

Despite the complaints, the changes stuck — briefly. When the 2.4.9.8 release came about in February 2002, it included a change described as: "make the three-argument (that everybody hates) 'min()' be 'min_t()', and introduce a type-anal 'min()' that complains about arguments of different types". The max() and min() macros were back to their old form, but the definition had changed; they now looked like:

    #define min(x,y) ({ \
	const typeof(x) _x = (x);	\
	const typeof(y) _y = (y);	\
	(void) (&_x == &_y);		\
	_x < _y ? _x : _y; })

Unsurprisingly, the complexity of these macros only grew from there as developers added more features for flexibility and type safety. Numerous variants have also been added for special cases. Recently, this series from David Laight, merged for the 6.7 kernel, made min() and max() work properly in numerous cases where the two arguments have different types. All seemed well after that, and nobody felt a compelling urge to change these macros for at least three development cycles.

Maximal expansion

But, then, Arnd Bergmann observed that the time required to compile the kernel had grown considerably in recent releases, and that the preprocessor had a lot to do with it; one file took a full 15 seconds just to get through the preprocessor stage. The problem came down to a single line of code in arch/x86/xen/setup.c:

    extra_pages = min3(EXTRA_MEM_RATIO * min(max_pfn, PFN_DOWN(MAXMEM)),
    		       extra_pages, max_pages - max_pfn);

To see how this came about, it is worth looking at the 6.10 definitions of the min() and max() macros and their variants, all of which come from include/linux/minmax.h. To start with, min3() returns the minimum of three values; its implementation is straightforward enough:

    #define min3(x, y, z) min((typeof(x))min(x, y), z)

That uses our old friend min(); indeed, it nests one min() call inside another. In 6.10, min() looks like this:

   #define min(x, y) __careful_cmp(min, x, y)

The __careful_cmp() macro tries hard to perform a type-safe comparison while evaluating the arguments only once; it also endeavors to expand to a constant expression if its arguments are constant expressions. That leads to a certain amount of complexity, implemented this way (best read from bottom to top):

    #define __cmp_op_min <
    #define __cmp_op_max >

    #define __cmp(op, x, y)	((x) __cmp_op_##op (y) ? (x) : (y))

    #define __cmp_once_unique(op, type, x, y, ux, uy) \
	({ type ux = (x); type uy = (y); __cmp(op, ux, uy); })

    #define __cmp_once(op, type, x, y) \
	__cmp_once_unique(op, type, x, y, __UNIQUE_ID(x_), __UNIQUE_ID(y_))

    #define __careful_cmp_once(op, x, y) ({			\
	static_assert(__types_ok(x, y),			\
		#op "(" #x ", " #y ") signedness error, fix types or consider u" #op "() before " #op "_t()"); \
	__cmp_once(op, __auto_type, x, y); })

    #define __careful_cmp(op, x, y)					\
	__builtin_choose_expr(__is_constexpr((x) - (y)),	\
		__cmp(op, x, y), __careful_cmp_once(op, x, y))

Depending on the expressions passed in, this means that min3() can end up generating a fair amount of code. Even if one expects a large expansion, though, the actual amount may lead to significant eyebrow elevation: the single line of code shown above expands to 47MB of preprocessor output. Bergmann explained this result this way:

It nests min() multiple levels deep with the use of min3(), and each one expands its argument 20 times times now (up from 6 back in linux-6.6). This gets 8000 expansions for each of the arguments, plus a lot of extra bits with each expansion. PFN_DOWN(MAXMEM) contributes a bit to the initial size as well.

Kernel developers, as a rule, care deeply about efficiency; that is especially true when it comes to the time required to do a kernel build. So it is unsurprising that this problem attracted some attention once it came to light.

Minimizing the problem

Lorenzo Stoakes brought the issue to the linux-kernel mailing list, showing how the 6.7 changes had made compilation time worse. Laight posted a patch series one day later that attempted to mitigate the problem. That series improved compilation time, though not enough to completely make up for the build-time regressions seen. It also ended up provoking some warnings from the test bots, and some of the changes to the macros made some developers (including Bergmann) nervous; those macros have reached a level of subtlety that makes people reluctant to change them. Torvalds, too, was uncomfortable with some of the changes, but he also wondered if they were the right approach to take in the first place:

I do get the feeling that the problem came from us being much too clever with out min/max macros, and now this series is doubling down instead of saying "it wasn't really worth it".

He later suggested simply reverting the 6.7 changes even though the previous code was "stupid and limited and caused us to have to be more careful about types than was strictly necessary" but, as Stoakes pointed out, a lot of code in the kernel has since come to depend on the new functionality that those changes added. Reverting them now would not be a straightforward task.

So Torvalds decided to take a bit of a different approach after observing that many of the worse expansion cases were, in the end, relatively simple constant expressions. Rather than try to fix the existing complex macros, he just added a couple more with a familiar look to them:

    /*
     * Use these carefully: no type checking, and uses the arguments
     * multiple times. Use for obvious constants only.
     */
    #define CONST_MIN(a,b) ((a)<(b)?(a):(b))
    #define CONST_MAX(a,b) ((a)>(b)?(a):(b))

By the time these macros landed in the mainline they had naturally gained just a little complexity (and new names):

    #define MIN_T(type,a,b) __cmp(min,(type)(a),(type)(b))
    #define MAX_T(type,a,b) __cmp(max,(type)(a),(type)(b))

He converted a number of the worst expansion cases to use the new macros just prior to the 6.11-rc1 release, then merged a patch taking away the ability for min() and max() to work as part of a constant expression. That simplified the code somewhat at the cost of making the macros unsuitable for use in places where constants are needed, but the new macros can be used instead in such situations.

These changes will not entirely resolve the problem in cases where the expressions are not constant, so chances are that more tweaks to the regular min() and max() macros are in store. Meanwhile, though, we have had a convincing demonstration of the sorts of pitfalls that can accompany this sort of extensive use of the C preprocessor. It can accomplish some magical-seeming effects, but spells of this nature often have subtle and unpleasant side effects.

Comments (58 posted)

CRIB: checkpoint/restore in BPF

By Jonathan Corbet
August 7, 2024
The desire for the ability to checkpoint a process — to record its state in a form that can be restarted at a future time — on Linux is almost as old as Linux itself. See, for example, this announcement of a checkpoint project that appeared in LWN in 1998. While working solutions exist, they can be somewhat fragile and difficult to use; it is not surprising that some people are interested in finding a better alternative. A current effort goes by the name CRIB, for Checkpoint/Restore in (naturally) BPF. It is far from clear that CRIB will replace the existing solutions, but it is an interesting look at a different way of solving the problem.

A checkpoint/restore solution must overcome two challenges, neither of which is easy. On the checkpoint side, it is necessary to obtain a complete description of a process (or set of processes), with no important details overlooked; that requires collecting a lot of information that the kernel was not designed to export. On the restore side, that information must be used to recreate the checkpointed process(es), possibly on a different system, in such a way that those processes cannot tell the difference — once again, using interfaces that were not designed for this purpose.

Early efforts to implement checkpoint/restore functionality focused on the kernel. There was a patch set that first started getting serious attention in 2008 that added two new system calls (checkpoint() and restart()) to do all of the work. The former would write the entire state of a process to a given file, while the latter would restore the process from a file. This work added quite a bit of complexity to the kernel and never really got to the point where it could reliably checkpoint and restore processes. Kernel developers were concerned about the challenges of maintaining a feature that was widely intrusive even in its incomplete state, and about whether it would ever reach the needed level of completeness and reliability. More than two years later, this work was still being discussed, but there was a clear appetite for alternatives.

Moving to user space

In 2011, Pavel Emelyanov showed up with the first checkpoint/restore in user space (CRIU) patches that moved that work out of the kernel. This approach, which had its origins in the OpenVZ project, attracted a lot of interest, but it was still a long path toward a working solution. Kernel interfaces were not developed with the idea of providing enough information to completely describe a process, or to recreate a process that matches the original in every detail at a later time. It is worth thinking about all of the information that is needed, including:

  • All of the threads running within the process, where they are executing, their priority, and their signal-handling state.
  • A complete memory map of the process: which mappings exist at which addresses and the protections that apply to each.
  • A list of the process's open files, including the actual files that have been opened, whether each was opened for read, write, or append access, the current file position, and the file-descriptor number.
  • Every open network connection, who the peer is, which protocol is in use, and any in-transit data.
  • The configuration of the namespaces in which the process is running.
  • Active file notifications, terminal configurations, active timers, and no end of other details.

On the other side, any solution must be able to restore all of this data, creating a running process with no surprising changes in its configuration or environment.

Much of this information was already available, perhaps inefficiently, from the kernel via the system-call interface and /proc. But, in many cases, additional support was needed to get to a working solution; the developers behind the effort that eventually turned into CRIU spent years on that project. They added features like TCP connection hijacking and connection repair, more information in /proc, the kcmp() system call, time namespaces (so that a sudden time jump does not cause a restarted process to misbehave), and many others over time. In 2024, CRIU is, as a result of this long effort, a working solution with an active user and developer base.

CRIU is not without its shortcomings, though, many of which derive from the fact that CRIU must rely on a wide range of kernel interfaces that, for the most part, were not designed to support the checkpoint and restore operations. Checkpointing a process requires opening a large set of files in /proc and /sys, which slows things down considerably, and each of those files has its own special format that must be parsed. CRIU is easily broken as the kernel evolves. Newer features, such as io_uring, are difficult to support — if they can be supported at all. Similar challenges exist on the restore side of the problem.

Moving back into the kernel with BPF

Juntong Deng thinks that the solution to these problems lies in BPF; the current CRIB patches have been posted as a proof of the concept. The core idea is that a user-space checkpoint application would load a BPF program to obtain the necessary information directly from the kernel. It would marshal and format that data, then feed it to user space by way of a fast ring-buffer interface. The user-space piece would end up with an interface that provides functionality similar to the checkpoint() system call that was proposed years ago, without the complexity that CRIU must manage.

Of course, that complexity does not go away entirely; it is instead pushed into the BPF part of the system. A suitably privileged BPF program has read access to much of the kernel, so it could obtain a great deal of the needed information without any special support. It can simply read the kernel data structures directly. For the more complex cases where digging through kernel structures is not practical, special-purpose kfuncs can be provided for the BPF program to use. For example, this patch adds a kfunc called bpf_file_from_task_fd(), which will return the struct file pointer corresponding to a process's file descriptor. The function will also take a reference to that file so that it will not vanish while the BPF program is reading it. Many of the other added kfuncs are focused on obtaining data from network sockets, which tend to have a complex internal state.

Any BPF program used to checkpoint a process in this way is going to be strongly tied to a specific version of the kernel. In theory, BPF interfaces are not a part of the kernel's stable ABI, so the prospect of breaking checkpoint programs should not hinder ongoing kernel development. Kernel changes can break CRIU as well, of course, but CRIU depends on user-space interfaces that are not supposed to break; that suggests that a BPF-based checkpoint function might require more maintenance to keep up with the kernel. In exchange, though, it would have deeper and better access to the state of the processes it is checkpointing and should be quite a bit faster.

The restore side of the problem might prove to be a bit more difficult. While a BPF program can be given the ability to freely read data from the kernel's address space, the same is not true for writing data. There is a set of macros that goes by the name of BPF_CORE_READ() that BPF programs use to read data. Deng, in the patch cover letter, suggested the addition of an equivalent BPF_CORE_WRITE() set, saying:

I am not sure what the current attitude of the kernel community towards BPF_CORE_WRITE is, personally I think it is well worth adding, as we need a portable way to change the value in the kernel.

Followers of BPF development will not have been surprised when Alexei Starovoitov made the kernel community's position clear:

I'm afraid BPF_CORE_WRITE cannot be introduced without breaking all safety nets. It will make bpf just as unsafe as any kernel module if bpf progs can start writing into arbitrary kernel data structures. So it's a show stopper.

Deng responded that, without the ability to write arbitrary kernel data, the restore functionality cannot be easily implemented, so development on CRIB would focus on the checkpoint side for now. Kumar Kartikeya Dwivedi disliked that idea, saying that it would be better to have the form of a restore solution in mind, even if it cannot be implemented immediately.

Various other details of the series were discussed; they seem much more amenable to an agreed-upon solution. The ability to restore arbitrary kernel data structures would appear to be a real sticking point, though; it is not clear which direction an acceptable solution will take. The restore process could be carried out mostly in user space, as is done with CRIU now, or another set of kfuncs could be added to facilitate the restoration of a process's state. While this work has been scheduled for discussion at the upcoming Linux Plumbers Conference, the amount of progress that can be made in a 15-minute slot will be limited. The best guide to where this project is headed will be found in future postings of the patch series.

Comments (8 posted)

Handling filesystem interruptibility

By Jake Edge
August 5, 2024

LSFMM+BPF

David Howells wanted to discuss changing the way filesystem code handles the ability to interrupt or kill operations, in order to fix some longstanding problems with network (and other) filesystems, in a session at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit. As noted in his session proposal, some filesystems may be expecting to not be interruptible, but are calling code that can take locks and mutexes that are interruptible (or killable), which are effectively changing the state of the task incorrectly. He would like to find a solution for that problem.

The interruptibility here refers to signal handling. An interruptible process will respond to any signals that are not masked or ignored. Killable is a variant of interruptible that will only respond to fatal signals.

There are multiple places with locks and such that could be taken using the *_interruptible() and *_killable() variants, but those override the higher-level non-interruptible setting. Some kind of mass change is not really practical to address the problem, Howells said, so it will need to be done incrementally. He proposed a multi-year effort to switch to explicit begin and end functions to bracket non-interruptible regions, in a way that is analogous to how hardware interrupts are disabled. Code could disable interruptibility, which would be tracked with a counter, then reenable it when the critical section is finished.

For example, an overlayfs filesystem might include a network filesystem as one of its layers. The overlayfs might not take interruptible locks, but the network filesystem might do so, which results in operations that get interrupted in a way that overlayfs does not expect. Ted Ts'o thought that a change like what was described might be useful in some contexts, but did not think that "it would be something we would want to use all over the place". The interruptible status of a particular mutex, for example, is local to the code that takes it. Unlike the effort to switch to GFP_NOFS, where the eventual plan is to convert everything to use it, this change would only be needed for specific calls.

Both Kent Overstreet and Dave Chinner asked for more concrete examples of the problem being solved and how the code would need to change to accommodate Howells's proposal. The biggest problem he has encountered, Howells said, is that sendmsg() is interruptible, but that an NFS filesystem might be mounted as non-interruptible. "NFS thinks it is not interruptible, but it is because it is using the network interfaces that are." He noted that a conversion would eventually mean that many of the interruptible (and killable) variants of lock and mutex functions could be removed.

Chinner and others objected to that, saying that there will still be a need for those variants. There were also various objections because many of the calls to mutex_lock_interruptible() are not checked for an error return, though there were multiple people all talking at once making it somewhat hard to follow. Al Viro was also concerned about deadlocks resulting from the changes proposed.

Viro said that handling signals (such as from someone using control-C) is the responsibility of the caller of the network function; an NFS mount with -o hard does not want or expect its operations to be interrupted, though, Howells said. However, calling mutex_lock_interruptible() is only applying the interruptibility to that specific call, Wedson Almeida Filho and Ts'o said, not to the whole region between it and the unlock call. Ts'o said that without a specific patch changing a particular code path where there is a problem, it will be difficult for attendees to determine whether it makes sense or not; meanwhile, he reiterated that he did not see a justification for a widespread change.

Instead of having a call to bracket the regions of non-interruptibility, Viro asked, why not just disable signals for the region? But Howells said that SIGKILL cannot be masked, though Christian Brauner pointed out that the kernel can mask that signal even though user space cannot.

Jan Kara agreed with the overall approach, saying that there is a real problem for callers who do not expect to get interrupted. But Brauner was concerned about how someone looking at sendmsg(), which is clearly interruptible, would be able to recognize that in some contexts it can be called in such a way that it is not interruptible. Howells acknowledged that could be a problem.

Chinner suggested having a variant of sendmsg() that is not interruptible, but Howells said that there are multiple calls like sendmsg() that are affected. "The documentation of the uninterruptible state is completely decoupled from where we need to apply that state", Chinner said. It would require large comments wherever these functions are being called, describing how that can happen and what code paths are affected. "Otherwise it is unmaintainable."

Viro said that it sounds to him like what Howells wants is to be able to suspend signal delivery in the network code at times. The TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE states are for sleeping processes, Viro continued, which get changed when the task gets woken up, but the state that is really desired "smells like 'I want signal delivery suspended'" until the end of the code region.

Ts'o agreed, noting that the change could be done without adding new infrastructure and a task flag. Howells said that manipulating the signal mask would affect other threads in the process, though, which Ts'o acknowledged as a problem. Viro said that an alternative might be to simply skip the thread in question when doing signal delivery; "basically it is a 'don't bother me'".

The session ran out of time as that was being discussed, but the picture that emerged is that patches are needed to focus the discussion. As of yet, there is no video for this session in the 2024 LSFMM+BPF playlist at YouTube.

Comments (9 posted)

Tracing the source of filesystem errors

By Jake Edge
August 7, 2024

LSFMM+BPF

There are lots of places in the kernel where an EINVAL can be returned to user space, but it is often unclear what the actual underlying problem is because the errno error codes are too generic. That is the problem that Miklos Szeredi wanted to discuss in a filesystem session that he led remotely at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit. He would like to help those who are trying to debug problems trace where in the kernel a particular error code is being generated.

Filesystem mounting is an example of where this problem can occur, Szeredi said; there are lots of places where EINVAL is returned, so it does not really tell anyone anything. If he is debugging a kernel filesystem and receives an error, he wants to know where in the code that occurred. The strace tool is useful for debugging, so ideally whatever is done to help show where errors are coming from would integrate with it.

He does not think it would be difficult to add something along those lines, though it would be best if any solution did not require root privileges. He mentioned the existing solutions, which includes messages in dmesg, but those are "not ideal for debugging". For filesystems that use the new mount API, there is a file descriptor returned from fsopen() that can be used to read error messages.

Another possibility is using ftrace, which can be used to trace the source of certain errors, but the tracefs interface is difficult to use, he said. The trace-cmd tool is a more user-friendly interface, but it does not yet support the funcgraph-retval option of the function_graph tracer, which is the way to get the information needed. It would also be nice if trace-cmd had an option to filter on negative return values, which should be easy to add. But it needs root privileges and has a global scope, which makes it difficult to integrate with strace.

He explored what an strace-friendly solution might look like. He suggested that an error descriptor could be added to struct task_struct so that when errors occurred, the code could use current->err_desc to store a string with information about where the error was generated. It should not be performance sensitive, he said, because the error path should not be followed frequently. He wondered if any messages that were added would need to become part of the kernel ABI, thus be unchangeable, or if they could contain, say, a source file name and line number, which will obviously change.

As an experiment, he tried redefining EINVAL and the other error codes as macros that would create a tracepoint with the source file and line number, but ran into multiple problems doing so. For example, those values are used in switch statements, conditional expressions, and in pre-setting a variable with an error code; in each case, a macro replacement will cause compilation or other problems.

The alternative is to use different macros for the different kinds of uses of the error codes, so perhaps ERR_TRACE() in a return statement to place the tracepoint. More examples of the macros can be seen in his slides as displayed in the YouTube video of the session. The problems with that kind of change are that it would have to be done manually, would add complexity, and would cause a lot of code churn.

But Amir Goldstein thought that the code churn would be "localized to the person of interest" because it would only be done to a subsystem by its maintainer if the maintainer is interested in getting the extra information. Kent Overstreet said that the refactoring could be done using Coccinelle, rather than manually. He also thought there might be some overlap with the infrastructure recently merged for the memory-allocation profiling work that he has been doing; he uses a large number of error codes in bcachefs that effectively encode the source file and line number in them, though they are mapped to regular errno values before being returned to user space.

Aleksa Sarai was not entirely sure how useful it would be to simply get the single data point of where the error code was set. In his debugging experience, seeing the entire call stack, as you can with ftrace, has normally been needed.

Ted Ts'o said that it was not clear what the use cases were for this feature; was it meant to be for end users in some future RHEL kernel, for filesystem developers, or for some other use case entirely? There are different tradeoffs depending on the use cases, he said. As a user-space developer, Omar Sandoval said he would like to see an easy way to get a string that indicates where the EINVAL was generated, without having to deal with tracepoints or ftrace at all.

Szeredi said that he sees the feature as being targeted at developers who are debugging these problems, possibly remotely. That is the use case that he has personally encountered most frequently. He can see that it might also be useful for returning information to applications.

Christian Brauner said that he was probably responsible for adding 50 or more EINVAL returns to the kernel over the years; each time he does so, he wonders if he should also add a pr_info() call with some extra information. Traditionally, other kernel developers complain when an extra call with more information gets added, he said, but if he was working on his own project in user space, he would add them every time. They are seen as dmesg noise by some, however.

Goldstein wondered how the kernel could provide the C library (libc) with access to the string containing the source file and line number. User-space programs access errno and use strerror() to get more information, so somehow the kernel needs to provide any extra information via a mechanism that libc can access. Szeredi had proposed putting a string in struct ptrace_syscall_info, but libc does not have access to that.

Overstreet said that the new error codes for bcachefs have "been enormously useful"; he tries not to reuse the codes at all so they effectively indicate the code location. David Howells said that the libc mechanism could be a new system call to retrieve the additional information; or user space could register a string buffer per thread with the kernel where that information could be stored. Ts'o returned to the use-case question, though; some users are only going to be interested in a high-level summary message, while developers may want a series of low-level failure messages. Those two use cases have different requirements and he was concerned that the discussion was getting complicated because it was trying to solve both at once.

Speaking of complications, Howells noted that there can also be errors coming from a remote source, for example from a network filesystem. That seemed to bring the discussion to a close, though it is rather unclear what, if anything, had been decided. The session was toward the end of the third and final day of the summit, so attendees may well have been worn out at that point.

Comments (11 posted)

CircuitPython: Python for microcontrollers, simplified

August 6, 2024

This article was contributed by Sam Sloniker

CircuitPython is an open-source implementation of the Python programming language for microcontroller boards. The project, which is sponsored by Adafruit Industries, is designed with new programmers in mind, but it also has many features that may be of interest to more-experienced developers. The recent 9.1.0 release adds a few minor features, but it follows just a few months after CircuitPython 9.0.0, which brings some more significant changes, including improved graphics and USB support.

CircuitPython is a fork of MicroPython (previously covered on LWN) with several changes designed to make the language easier for beginners to use. CircuitPython has a universal hardware API and is more compatible with the CPython standard library, for example. CircuitPython has some limitations compared to MicroPython, particularly with respect to concurrency, but is otherwise just as powerful.

CircuitPython 9.0.0, which was released in March, includes new graphics modules, such as jpegio, which decodes JPEG images, and bitmapfilter, which provides simple image-manipulation tools. That release also improves USB compatibility with Android devices, allowing users of these devices to more easily edit code on CircuitPython boards. Another new feature, involving both graphics and USB, is the usb_video module, which emulates a USB webcam and sends images back to the host computer.

History and goals

The first stable version, CircuitPython 1.0.0, was released in 2017, though the project was announced with a beta release earlier in the year. Since then, new versions have been periodically released; beyond just updates to CircuitPython, changes incorporated from MicroPython have generally been included as well.

Information on Adafruit's original reasons for forking MicroPython, rather than simply using it, is scarce. In an email reply to my inquiry, Phillip Torrone, Adafruit's managing director, passed along some information from the development team. The original purpose for creating a fork was to add support for SAMD21 microcontrollers. However, beyond just adding this new port, other features were released, including a new hardware API.

At the time the CircuitPython project was started, MicroPython's hardware API was not consistent across ports. To improve code portability, Adafruit's team decided to create a new, unified API. The MicroPython project, however, opted not to use the new API "because it prevents special chip-specific behavior from being exposed in the common API" and "can be slightly slower [...] due to an extra C function call".

This difference in APIs make sense when looking at the main goals of each project. MicroPython describes the language as "a lean and efficient implementation of the Python 3 programming language that includes a small subset of the Python standard library and is optimised to run on microcontrollers and in constrained environments". In contrast, Adafruit's "What is CircuitPython?" page says: "CircuitPython is a programming language designed to simplify experimenting and learning to program on low-cost microcontroller boards." Adafruit recommends CircuitPython for users who "want to get up and running quickly" or "are new to programming".

The Adafruit team explained that, despite the differences between the two projects, they are still similar internally, and Adafruit does contribute changes to MicroPython when it makes sense to do so; Torrone also mentioned that the company financially supports the MicroPython project. Overall, since the projects' goals would be difficult to reconcile in one codebase, it seems that having two separate but related projects is a reasonable solution for both user communities.

Main features

CircuitPython supports most of the core language features of CPython, although some features are missing due to the limited resources found on microcontroller boards compared to the much more powerful computers on which CPython typically runs. Many of those missing features will be the same as those on a comparison between CPython and MicroPython reports. In addition, as CircuitPython's main README explains: "Modules with a CPython counterpart, such as time, os and random, are strict subsets of their CPython version. Therefore, code from CircuitPython is runnable on CPython but not necessarily the reverse."

This is a change from MicroPython, which prefixes module names with u (e.g. CPython's os becomes uos) and often adds incompatible features. This practice increases the difference from standard Python implementations, making it more difficult to move from one to the other. In contrast, CircuitPython's design of using one-way compatible modules for functionality that overlaps with CPython, and separate modules for additional features, helps users to learn which functionality is standard and what has been added by CircuitPython.

Of course, the main reason to run code on a microcontroller board is to interact with hardware. CircuitPython provides a unified API for all boards. For example, on any supported board with a built-in LED, the following code (taken from a tutorial) will blink the LED on and off for a half-second each:

    import board
    import digitalio
    import time

    led = digitalio.DigitalInOut(board.LED)
    led.direction = digitalio.Direction.OUTPUT

    while True:
        led.value = True
        time.sleep(0.5)
        led.value = False
        time.sleep(0.5)

A previous article on LWN, covering a talk given by software engineer Nina Zakharenko at PyCon 2019, gives examples of some more things CircuitPython can do, including controlling Adafruit's NeoPixel line of addressable RGB LEDs.

CircuitPython's APIs are extensively documented; additionally, the Adafruit web site provides a wide variety of tutorials and guides showing how to use various devices and build many practical projects with CircuitPython. (MicroPython's API is also well-documented, and appears to be more unified now than it was in 2016, when Adafruit decided to create a new API for CircuitPython, although there are still some differences between ports.)

CircuitPython has almost all of the language features that MicroPython does, although often with different APIs. The only major limitation is CircuitPython's support for concurrency; "interrupts and threading are disabled", as explained by the project's documentation. On certain boards, though, asynchronous programming is supported with the async and await keywords, and some modules do provide concurrency for limited tasks, "such as audio file playback".

In addition to the standard library, there are many other libraries that can be used with CircuitPython; as with the standard library, each of these (most of which are device drivers) has its own documentation. All of them can be downloaded in a bundle from the CircuitPython web site.

Editing code

On most boards, editing code files is quite simple and requires no special software; after installing CircuitPython, the user can simply plug the board into a computer running any operating system that supports USB mass storage. This is the same protocol used by USB flash drives, so virtually all systems support it. The board will appear as a mass storage device called CIRCUITPY, and the user can place code on this device in a file called code.py. Saving a new version of that file will cause the board to restart and run it; it will also run whenever the board is power-cycled.

Having a serial console is also important for debugging and read–eval–print loop (REPL) access. Adafruit recommends that beginners use the Mu editor, which includes an integrated serial console. I prefer to use Neovim for file editing and the pyserial-miniterm utility provided by pySerial for serial access; any pairing of editor and console should work fine.

On a few boards, however, this standard workflow is not available; in addition, some users may want to develop code using a wireless connection, particularly for mobile devices (although it is possible to edit files over USB for Android devices, starting with CircuitPython 9.0.0, and for iOS devices, starting with iOS 13). For these and other cases where USB is unsuitable, two alternatives are available: the Bluetooth Low Energy (BLE) and Web workflows.

Both of these are supported by the online CircuitPython Code Editor; the difference is in the connection protocol. The BLE workflow, of course, connects using Bluetooth, while the Web workflow connects to the board using HTTP requests. The Code Editor also supports USB; the Web workflow should work in all modern browsers, while BLE and USB are only supported in Chrome-based browsers. I did not actually test the Code Editor, however.

Supported hardware

CircuitPython is available for a wide range of microcontroller boards, including the popular and inexpensive Raspberry Pi Pico and Pico W (a similar board with WiFi connectivity), and some Arduino boards, such as the Arduino Zero and several boards in the "Nano" family like the Arduino Nano 33 BLE. The classic Arduino Uno is not supported, however; its limited resources make a port of CircuitPython infeasible.

Additionally, many of CircuitPython's libraries have been ported to run on CPython on Linux through the Blinka project; this enables code written for microcontroller boards using CircuitPython to be used with few or no modifications on single-board computers (SBCs) such as the Linux-based Raspberry Pi boards.

Development

CircuitPython development takes place in a GitHub repository, forked from the MicroPython repository. It appears that most commits are from Adafruit employees and contractors, but pull requests are also merged from outside users.

CircuitPython's documentation has several pages explaining how to contribute to the project. The main "Contributing" page has useful information on several topics, including licensing (CircuitPython, like MicroPython, is released under the MIT license), code guidelines (the project largely follows MicroPython's guidelines to allow for upstreaming, but also has its own guidelines), and suggestions of ways to contribute.

Users interested in low-level work, particularly porting CircuitPython to new boards, will find helpful information on the Porting page. The page describes three core pieces that make up CircuitPython. The first is the Python virtual machine (VM), which is mostly copied from MicroPython. The VM is designed in such a way that there is typically little work needed to port it to a new processor.

The next component is referred to as the "supervisor". It initializes the hardware and filesystem, starts the user's code, then "monitors and facilitates" the Python virtual machine. A large portion of the work involved in porting CircuitPython pertains to the supervisor, because many of its tasks are closely related to the hardware.

Finally, there is the Common Hardware Abstraction Layer (HAL), which interfaces between the hardware and the standardized CircuitPython APIs. There is a separate page on porting this layer, which allows developers using CircuitPython to write code that will run on any supported board.

Conclusion

Overall, CircuitPython is a great choice for users who are new to programming. However, for projects that do not require concurrency or the higher performance of a compiled language, there is no reason for more experienced developers to avoid it; the simple API and easy development workflows benefit anyone who wants to get a hardware project working quickly. Complicated projects may be best served by using MicroPython directly, or by writing bare-metal code in a compiled language, perhaps using a development environment like the C++-based Arduino platform. However, for simpler projects, it definitely makes sense to consider CircuitPython.

Comments (8 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>


Copyright © 2024, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds