LWN.net Weekly Edition for October 27, 2022
Welcome to the LWN.net Weekly Edition for October 27, 2022
This edition contains the following feature content:
- An ordered set for Python?: unlike other fundamental Python data types, sets do not implement ordering. The community is considering changing that through the addition of an OrderedSet type to the standard library.
- The Ghost publishing system: a free system for newsletter-oriented self-publishers.
- Would you like signs with those chars?: a fundamental change to the C language model used by the kernel.
- More flexible memory access for BPF programs: infrastructure to allow BPF programs to manage memory regions whose size is not known at load time.
- Accessing QEMU storage features without a VM: the qemu-storage-daemon makes a number of interesting things possible.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
An ordered set for Python?
Python has lots of different options for mutable data structures, both directly in the language and in the standard library. Lists, dictionaries (or "dicts"), and sets are the foundation, but two of those maintain an order based on how the elements are added, while sets do not. A recent discussion on the Python Discourse forum raised the idea of adding an ordered variant of sets; while it does not look like there is a big push to add the feature, the discussion did show some of what is generally needed to get new things into the language—and could well lead to its inclusion.
By their very nature, Python lists have always been ordered; they can also be indexed like arrays. On the other hand, Python dicts started off as unordered, so that adding two entries to a dict could result in either order when, say, iterating over the keys. Dicts would normally maintain the same order if no additions or deletions were made to them, but it was not guaranteed by the language. That all changed when a new implementation of dicts for Python 3.6 maintained the insertion order as a side-effect of a more memory-efficient algorithm. In Python 3.7, ordered dicts were adopted as part of the Python language, so all implementations have to support that feature.
There is also the longstanding collections.OrderedDict implementation in the standard library; it is optimized for reordering efficiency, which makes it a good choice for least-recently-used (LRU) caches, for example. The standard dict is optimized for mapping operations and insertion speed, so there are (still) good reasons to have both available. But, since the existence of OrderedDict pre-dated the switch to ordered dicts in the language, to some it seems like it might provide a precedent for an OrderedSet data structure in the standard library as well.
But sets in both math and Python are just containers for some items—objects—with no duplicates. The operations available for sets are what would generally be expected: membership (using "in"), union, intersection, difference, subset, and so on. The order of the elements in a set is effectively random.
>>> { 'foo', 'bar', 'baz' } {'baz', 'foo', 'bar'} >>> { 'foo', 'bar', 'baz', 0.2 } {'baz', 'foo', 0.2, 'bar'} >>> { 'foo', 'bar', 'baz', 0.2, None } {0.2, 'baz', 'foo', 'bar', None} >>> s = { 'foo', 'bar', 'baz', 0.2 } >>> s.add(None) >>> s {0.2, 'foo', 'baz', 'bar', None}As can be seen, even adding the same elements in the same order can result in differences in the representation of the set.
OrderedSet
Some developers have use cases for sets that maintain the order that its elements were added in, however. Two Python Package Index (PyPI) modules, ordered_set and orderedset, came up in the discussion as evidence that the feature is useful. They both provide an OrderedSet class that acts like something of a cross between a set and a list; no duplicates are allowed, but, since the order is maintained, indexing can be done. Both started from an OrderedSet recipe by Raymond Hettinger, but modified it with different tradeoffs (one to a simpler Python implementation and the other using Cython for speed).
The idea of adding OrderedSet to the language was raised (this time) back in December by "Pyprohly". They noted that the lack can be worked around using a dict but that it is ugly to have to do so. The message may have escaped much notice due to the holidays, but the topic was revived by Justin Gerber on October 21. He noted the PyPI packages, but thought it was unfortunate that a project would have to add an external dependency simply for an ordered set. The workarounds are suboptimal as well:
Say my ordered set contains 'apples' and 'oranges'. I can doordered_set = dict() ordered_set['apples'] = None ordered_set['oranges'] = Noneordered_set will then act in some nice ways like an ordered set. For example we can do things like for element in ordered_set... and it will behave as expected. One major shortcoming is when we want to printout the ordered set: print(ordered_set) will reveal the ugly dictionary structure. This ordered_set variable also lacks typical set syntax like ordered_set.add('bananas'). I doubt set operations like unions or intersections are easily available either.
He asked what kind of rationale would be required in order to add
OrderedSet to collections. But, "Laurie O" wondered
why adding external dependencies was such a problem; there are many
benefits and only one drawback, dependency resolution conflicts, that they
could think of. Steven D'Aprano noted
that external dependencies have more disadvantages than indicated; there
are organizations that cannot easily add them to their projects for legal
or procedural reasons, as well as
people who lack the reliable (and unfiltered) internet
access needed. "Whether due to economics or politics or some other
reason, the ability to just run pip install ... is not a
universal
privilege.
"
In another message, D'Aprano mentioned some of the use cases he sees for ordered sets, but he did recognize that there is a cost to adding the feature. Pyprohly also described their use case. Most of the uses rely on the predictable behavior for iteration or removing (via pop()) elements from the set that would come with an ordered version.
Gerber agreed
with D'Aprano's assessment of the downsides of external dependencies; he
had to submit a request to add a PyPI dependency for his project, for one
thing, but he also ran into some problems building and installing the
Cython-based package. "Suffice it to say I am having to pay a time cost
to include OrderedSet in my project. And the question remains, why
must I pay a cost for OrderedSet but not
OrderedDict?
" He also outlined his particular use case, where
it would be convenient to get things out of the set in the same order they
were put in.
Feature addition
The process for turning an idea, like the one in the discussion, into an
actual feature in the language was
unclear to Gerber; he is new to Python development and wanted to
understand what "the cultural norm/bar is for including something in the
standard library and what sorts of considerations are typically made
".
D'Aprano described
the path for a change of this sort:
- Gather as much community feedback as you can, at a minimum here on Discuss or the Python-Ideas mailing list. (You can also discuss it on places like Reddit, or any other forum you like.)
- If the community seems to get behind the idea, or at least not be strongly opposed, the next step is to ask for a sponsor from the core developers.
- If at least one core dev is willing to act as sponsor, you (or somebody) should write a PEP proposing the feature.
- The PEP then gets sent back here for additional rounds of feedback.
- If there is sufficient community interest in the PEP, the author then asks the Steering Council to accept the PEP.
- If they accept it, then somebody (often the PEP author) will implement it and add it to the std lib.
He did caution that a lack of a volunteer implementer could derail the process even after the council approves it. Gerber said that he would use PEP 372 ("Adding an ordered dictionary to collections") as a model for what would be needed in a PEP for the ordered-set feature. He also reported that the maintainer of the Cython-based OrderedSet PyPI package was in favor of adding the feature to the language, though obviously not the Cython version.
Pyprohly weighed
in again with a summary of the use cases and some additional reasons
why the feature makes sense, including that Java has a similar construct
available. Gerber reported
that the maintainer of the other PyPI package was also in favor of the
feature though it "should probably be different than the implementation
in that package and that it should derive from new default dict
orderedness
".
Paul Moore pointed
out
that there were now two third-party implementations of the feature that
did not want their code added to the standard library but
"who are in favour of someone else writing a stdlib
implementation
". That's completely reasonable, of course, but still
leaves open the question of who will do it:
This feels to me like a borderline case at the moment. No-one is particularly against the idea, but no-one is offering to do the work. I suspect that this will go nowhere unless/until someone writes an implementation.Maybe at this point the next step should be to create a PR [pull request] adding OrderedSet to collections. That might get some core dev attention and maybe a sponsor for a PEP.
That's where things stand at this point, but it has only been two days or so since Moore's response. Given Gerber's interest in seeing this feature get added, and that he would like to get involved with Python development, it seems likely that he will be looking to follow-up on that plan. The lack of an ordered set is perhaps a bit of a language wart, and the feature should not be all that difficult to add (and maintain); perhaps Python 3.12, which is due next October, will come with collections.OrderedSet right next to its much older OrderedDict cousin.
The Ghost publishing system
Part of the early appeal of the World Wide Web was the promise that anybody could create a site and publish interesting content to the world. A few decades later, that promise seems to have been transformed into the ability to provide content for a small number of proprietary platforms run by huge corporations. But, arguably, the dream of widespread independent publishing is enjoying a resurgence. The Ghost publishing platform is built around the goal of making publishing technology — and the ability to make money from it — available with free software.Ghost is an MIT-licensed application written primarily in JavaScript; it has been under development since 2013. The project is owned by the non-profit Ghost Foundation, which appears to be mainly funded by an associated Ghost(Pro) hosting business. The entire platform appears to be free software; this is not an open-core offering.
At a first glance, Ghost looks like yet another blogging platform, providing the ability to create, edit, and publish articles. Tied closely to that platform, though, is the ability to send articles via email newsletters; Ghost seems to be firmly positioned as a free alternative to operations like SubStack. Support for paid subscriptions is also built in, with the ability to define multiple subscriber levels. Much of the documentation and in-system help provided by Ghost is aimed at helping users create and monetize their content with the platform.
Getting started with Ghost
Your editor decided to give it a try, following the provided instructions on an Ubuntu 22.04 platform; the result can be seen (through the end of October 2022 or so) at ghost.lwn.net. (Note that not a lot of effort has gone into fine-tuning that site's appearance, so the result is arguably even uglier than the regular LWN site). The instructions are clear enough and almost worked; the MySQL user account was not set up properly and had to be fixed by hand. Everything else, including the (seemingly) inevitable curl|bash step and the procurement and installation of a Let's Encrypt SSL certificate, simply worked.
The documentation falls down slightly at this point, though, in that it doesn't provide a next step. That step is to go to the site's dashboard page (under /ghost on the newly installed site), but nothing appears to provide the link to that page. Once that page is found, it becomes possible to play with themes, configure the look of the site, manage user accounts, and get into many of the other relevant details.
Posting an article starts in a fairly straightforward web-based editor. It
works well enough, especially once one learns that it is necessary to
highlight a range of text before the small set of formatting options
becomes visible. The interface is mouse-heavy in general; there are a few
keyboard shortcuts, but they fall far short of a proper text editor.
Perhaps more interesting is the ability to insert "cards" to bring other
types of media into an article. They can be as simple as Markdown or HTML
text, but also extend to images, bookmarks, audio content, video content,
or things like a "call to action" or "product recommendation". Widgets
from many of the popular online services can be embedded as well.
Articles can be previewed and published in a straightforward manner; it is also possible to schedule the posting of an article for some later time. Posts can be fully public or reserved for subscribers. It is also possible to fine-tune how the article will appear when posted to various social-media sites. In general, Ghost has clearly been developed with an eye toward social-media propagation, with the ability to add all of the usual links to encourage readers to spread the word.
Ghost allows readers to post comments on articles, but that functionality is disabled by default. Comments seemingly can only be plain text, with no ability for any sort of fancier formatting.
Going deeper
Going beyond the simple posting of public articles requires more setup, including the creation of accounts with other providers. For example, Ghost is designed to send email newsletters, but it will not do so directly; instead, it requires the use of a paid Mailgun account. The documentation warns that sending large amounts of email directly will only result in the site being blacklisted, so Ghost doesn't even try. LWN's experience is that the situation is not quite that grim — but running an independent email system on today's net is certainly a challenge. In any case, there is no ability to use any other bulk-email service, but that appears to be a result of nobody having gotten around to implementing the integration rather than some sort of special deal.
Stripe, instead, is presented as the "exclusive" payment provider. The integration looks simple enough to configure, though your editor, lacking a Stripe account, was unable to try it out. Given the "exclusive" wording, it would be interesting to see what would happen if somebody submitted a pull request adding support for another provider. There is no support for PayPal in the base install, but that support can be added later.
The site dashboard provides information on how the site's content is doing,
including how many subscribers there are and how many of those have
"engaged" with the site recently. Information on individual subscribers
includes their activity on the site and, for better or for worse, whether
and when they read their email newsletters. Tracking images are used to
obtain that information. There does not appear to be a way for site owners
to control the acquisition or retention of this data, and users are not
asked to consent to its collection. It could, as a
result, be difficult at best to run a Ghost site in a way that is compliant
with regulations like the GDPR.
User management on Ghost is a bit strange in general, in that there are two entirely different types of accounts. Ordinary users have whatever access to content their subscriber level allows, but no other access. There are no passwords associated with these accounts; logging in is done by providing an email address, then clicking on the link sent to that address. "Staff" accounts, instead, have the (configurable) ability to create posts, publish them, modify other users' posts, or exercise administrative access to the site. They must be created by an existing administrative account, and access is controlled by a password. An attempt to log into a staff account via the normal login form will fail; one seemingly has to go to a privileged page then provide an email address and password. There is no way to turn a normal user account into a staff account.
Internally, Ghost is split between the content-management system and the web front end; it is possible to use the former without the latter. There is an API that can be used to access the back end... actually, there are two of them. The API that is easily found from the front page provides read-only access to data in the system. More serious work requires using the admin API, which is rather harder to find, instead. Among other things, the APIs are useful to implement a vast number of integrations with proprietary services across the net.
Community and closing
Ghost is certainly an active project, producing a release every few days. An Internet-facing application like this needs to prioritize security; the Ghost Security page says many of the right things in that regard. The last release that mentions a security issue is 4.15.1 from September 2021. The idea that an application of this nature has had no security problems in over a year seems like a bit of a stretch, but one never knows. Updating a Ghost installation appears to be a relatively easy process.
The 4.20.0 release happened on October 22, 2021 — almost exactly one year ago. Since then, the project has added 16,130 non-merge changesets from 261 developers, which is certainly a respectable level of activity. Over half of those patches come from a relatively small number of Ghost employees, as would be expected, but there is also clearly a long tail of contributors from outside the project. Ghost development appears to be centered on its public GitHub site, with company employees submitting pull requests there like everybody else.
In other words, Ghost may be a project that is dominated by a single company, but it appears to have a healthy community and to be developed in an open manner. This doesn't look like software that was tossed over the wall and forgotten about. It is worth noting that the project's contributor license agreement allows the Ghost Foundation to release contributions under a proprietary license — though the project's MIT license poses few obstacles to that in any case.
The writing of LWN's site code began in early 2002, with the subscription support added in a development frenzy by your highly motivated editor toward the end of that year. Twenty years ago, there were not really any alternatives to creating something from scratch; LWN, it seems, was ahead of its time. If we were starting today, the situation would be completely different; there are a number of options out there for people who are crazy enough to try to make a living writing on the Internet. It is not certain that Ghost would be chosen to host a site like LWN, but it would undoubtedly be on the short list of contenders.
Would you like signs with those chars?
Among the many quirks that make the C language so charming is the set of behaviors that it does not define; these include whether a char variable is a signed quantity or not. The distinction often does not make a difference, but there are exceptions. Kernel code, which runs on many different architectures, is where exceptions can certainly be found. A recent attempt to eliminate the uncertain signedness of char variables did not get far — at least not in the direction it originally attempted to go.As a general rule, C integer types are signed unless specified otherwise; short, int, long all work that way. But char, which is usually a single byte on current machines, is different; it can be signed or not, depending on whatever is most convenient to implement on any given architecture. On x86 systems, a char variable is signed unless declared as unsigned char. On Arm systems, though, char variables are unsigned (unless explicitly declared signed) instead.
The fact that a char variable may or may not be signed is an easy
thing for a developer to forget, especially if that developer's work is
focused on a single architecture. Thus, x86 developers can get into the
habit of thinking of char as always being signed and, as a result,
write code that will misbehave on some other systems. Jason Donenfeld
recently encountered this sort of bug and, after fixing it, posted a
patch meant to address this problem kernel-wide. In an attempt to
"just eliminate this particular variety of heisensigned bugs
entirely
", it added the
-fsigned-char flag to the compiler command line, forcing the bare
char type to be signed across all architectures.
This change turned out to not be popular. Segher Boessenkool pointed
out that it constitutes an ABI change, and could hurt performance on
systems that naturally want char to be unsigned. Linus Torvalds
agreed,
saying that: "We should just accept the standard wording, and be aware
that 'char' has indeterminate signedness
". He disagreed, however, with
Boessenkool's suggestion to remove the -Wno-pointer-sign option
used now (thus enabling -Wpointer-sign warnings). That change
would enable a warning that results from the mixing of
pointers to signed and unsigned char types; Torvalds complained
that it fails to warn when using char variables, but produces
a lot of false positive warnings with correct code.
Later in the discussion, though, Torvalds wondered
whether it might be a good idea to nail down the signedness of
char variables after all — but to force them to be unsigned
by default rather than signed. That, he said, shouldn't generate worse
code on any of the commonly used architectures. "And I do think that
having odd architecture differences is generally a bad idea, and making the
language rules stricter to avoid differences is a good thing
".
Amusingly, he noted that, with this option, code like:
const unsigned char *c = "Subscribe to LWN";
will still, with the -Wpointer-sign option, generate a warning,
since a string constant pointer is still considered to be a bare
char *
type, which is then treated as being different from an explicit
unsigned char * type. "You *really* can't win this
thing. The game is rigged like
some geeky carnival game
".
Donenfeld saw
merit in the idea, even though he thinks that the potential to break
some code exists. He sent out a new
patch adding -funsigned-char to the compiler command line
to effect this change. He had suggested that it could perhaps be merged immediately,
given that there is time to fix any fallout before the 6.1 release, but
Torvalds declined
that opportunity: "if we were still in the merge window, I'd probably apply this,
but as things stand, I think it should go into linux-next and cook
there for the next merge window
". He added that any problems that
result from the change are likely to be subtle and to be in driver code
that isn't widely used across architectures. The core kernel code,
instead, has always had to work across architectures, so he does not
believe that problems will show up there.
So Donenfeld's patch is sitting in linux-next instead, waiting for the 6.2 merge window in December. That gives the community until late February to find any problems that might be caused by forcing bare char variables to be unsigned across all architectures supported by Linux. That is a fair amount of time, but it is also certainly not too soon to begin testing this change in as many different environments as possible. It is, after all, a fundamental change to the language in which the kernel is written; a lack of resulting surprises would, itself, be surprising.
One way to identify potential problems is to find the places where the generated code changes when char is forced to be unsigned. Torvalds has already made some efforts in that direction, and Kees Cook has used a system designed for checking reproducible builds to find a lot of changes. Many of those changes will turn out to be harmless, but the only way to know for sure is to actually look at them. Meanwhile, the posting of one fix by Alexey Dobriyan has caused Torvalds to request that the char fixes be collected into a single tree. As those fixes accumulate, the result should be a sign of just how much disruption this change is actually going to cause.
More flexible memory access for BPF programs
All memory accesses in a BPF program are statically checked for safety using the verifier, which analyzes the program in its entirety before allowing it to run. While this allows BPF programs to safely run in kernel space, it restricts how that program is able to use pointers. Until recently, one such constraint was that the size of a memory region referenced by a pointer in a BPF program must be statically known when a BPF program is loaded. A recent patch set by Joanne Koong enhances BPF to support loading programs with pointers to dynamically sized memory regions.
Verifying kernel pointers in BPF programs
In order to safely load a BPF program, the verifier must validate that no memory access in the program will ever crash the kernel. This is a complex task, as "memory" can refer to a variety of different contexts in a program. For example, some pointers may reference the BPF program's stack, whereas other pointers, such as kptrs, may reference a structure that was passed from the main kernel via a kfunc. Both of these types of pointers have different scenarios in which an access would or would not be safe, and thus require separate logic in the verifier to ensure that any accesses to them are safe. For the stack pointer, the verifier needs to ensure that the offset of any read is within the program's active stack region. For kptrs returned from a kfunc, the verifier must ensure that the offset of any read is within the bounds of the structure as specified by the structure's BPF Type Format (BTF) information (write accesses are much more strictly controlled).
Yet, while the bounds of these two different memory regions may differ, they both require that all reads to them must take place at static offsets in order for the verifier to be able to ensure that the access is safe. This restriction, of course, precludes any use cases requiring a pointer to a dynamically sized data region. For example, the BPF ring-buffer map type allows BPF programs to write entries into a ring buffer for consumption by user space. If all memory references need to be statically known at run time, the BPF program would only be able to write entries whose sizes were statically known when the program was loaded. It would be useful, however, to be able to write entries whose sizes can be specified dynamically at run time.
dynptrs – Referencing dynamically sized memory
Koong's patch set adds support for accessing dynamically sized regions of memory in BPF programs with a new feature called dynptrs. The main idea behind dynptrs is to associate a pointer to a dynamically sized data region with metadata that is used by the verifier and some BPF helper functions to ensure that accesses to the region are valid. Koong's patch set creates this association in a newly defined type called struct bpf_dynptr. This structure is opaque to BPF programs; within the kernel it is represented by:
/* the implementation of the opaque uapi struct bpf_dynptr */ struct bpf_dynptr_kern { void *data; u32 size; u32 offset; } __aligned(8);
The size of the dynamic region is stored in a 32-bit, unsigned integer, with the upper eight bits being reserved for metadata about the dynptr itself. The highest-order bit specifies whether the dynptr is read-only, and the next seven highest-order bits describe the type of memory that is referenced by the dynptr. This leaves 24 bits for the size, implying that a dynptr can point to a region no larger than 16MB. The patch set adds support for two types of dynptrs: BPF_DYNPTR_TYPE_LOCAL, which points to memory that is local to the program such as a map value, and BPF_DYNPTR_TYPE_RINGBUF, which points to data in a BPF_MAP_TYPE_RINGBUF map.
Dynptrs are created and accessed using a series of helper functions. A dynptr may be read using the bpf_dynptr_read() helper, or written using bpf_dynptr_write() for writeable dynptrs. bpf_dynptr_read() will copy memory from the dynptr data region into a buffer specified by the calling program, whereas bpf_dynptr_write() will copy data from a program buffer into the dynptr data region. Before performing the copy, the helper functions verify that the proposed length and offsets refer to a valid part of the dynptr memory region. If the user requires direct access to the memory region contained in the dynptr, they can use the bpf_dynptr_data() helper though, in this case, the size of the memory area being requested must be static so that the verifier can ensure that any accesses to it are valid.
Local memory dynptrs
BPF_DYNPTR_TYPE_LOCAL, or local dynptr support, is added by the second patch of the series. "Local memory" in BPF can refer to several different types of memory used by a program, including, for example, map values, map keys, and stack memory. Koong's patch set allows local dynptrs to be created via a new helper function called bpf_dynptr_from_mem(). Despite the existence of a wide variety of local memory types, the initial patch set only adds support for creating local dynptrs to a map value. This restriction is presumably because the verifier already provides a guarantee to helper functions that receive a pointer to a map value that it will be properly initialized and sized, thus allowing the initial implementation of dynptrs to be as simple as possible.
Other local memory types could be supported in the future as well, though each of these memory types would require additional logic in the verifier for validating the input arguments to bpf_dynptr_from_mem(). While there was no indication in the patch series about when (or whether) other types of local memory will be added, it seems prudent to add support for them so as to provide a more consistent experience in using the API. In the initial implementation, a user will have no way of knowing that a local dynptr only supports map values until their program is rejected by the verifier.
Dynamically sized ring-buffer entries using dynptrs
As mentioned above, the static-sizing constraint forces the size of ring-buffer entries published by the kernel in BPF_MAP_TYPE_RINGBUF maps to be statically known when the program is loaded. To address this problem, Koong included a patch that defines a new BPF_DYNPTR_TYPE_RINGBUF type of dynptr. The patch includes the bpf_ringbuf_reserve_dynptr() helper function for reserving a dynamically sized ring-buffer entry, as well as bpf_ringbuf_submit_dynptr() and bpf_ringbuf_discard_dynptr() for posting the entries to the ring buffer or discarding them respectively. These APIs closely match the existing APIs for reserving and posting statically sized ring-buffer entries.
Dynptrs are also used in the new BPF_MAP_TYPE_USER_RINGBUF map type patch set, recently merged into bpf-next, that I wrote. This map type, which allows user space to publish ring-buffer entries to BPF programs, provides a bpf_user_ringbuf_drain() helper function that allows a BPF program to consume entries from the ring buffer, and invoke some specified callback on each of those entries. This callback receives a dynptr to the ring-buffer entry as its first argument. In order to read the entries, the BPF program can simply use bpf_dynptr_read() or bpf_dynptr_data(), as described above.
Holding off on a kmalloc() type dynptr
One thing to note is that none of the above supported dynptr types refer to memory that was allocated via kmalloc(). This would have seemed like an obvious use case at first glance, and was in fact proposed in an earlier version of the patch set via a BPF_DYNPTR_TYPE_MALLOC dynptr type. The type was eventually dropped, however, following discussions that revealed some subtle, yet fundamental, issues that would need to be addressed before it could be supported.
For example, in response to the patch set, Daniel Borkmann raised the question of which memory control group (memcg) should be charged for the allocated memory. This point is relevant; allocations in the kernel that are done on behalf of a user-space process need to be charged to the memcg containing the allocating process. But identifying that process is not always straightforward. The memcg of the process that loaded the program would seem to fit that profile, but as Alexei Starovoitov pointed out, that process (and its memcg) do not necessarily persist after the program has been loaded.
Another question, posed by Starovoitov, is whether memory allocated by a BPF program should be charged to a memcg at all. Most kmalloc() calls in the kernel are not charged in this way and, as was reinforced at the 2022 Linux Kernel Maintainers Summit, BPF programs are instances of kernel programs, not user programs. Borkmann responded that, perhaps, the solution was to allow users to specify the memcg to charge explicitly, rather than implicitly relying on the loading task's memcg as BPF currently does. This would involve the user obtaining a file descriptor to a memcg and passing it to the kernel when a program is loaded. If no descriptor is passed, the default behavior would be to not charge the memory to any memcg. This suggestion was well received by both Starovoitov and Andrii Nakryiko, though the conversation tapered off without a firm conclusion, and Koong eventually sent a follow-on patch that replaced BPF_DYNPTR_TYPE_MALLOC with BPF_DYNPTR_TYPE_LOCAL.
The ability to dynamically allocate from BPF programs is an interesting prospect, so it seems likely that the feature will be revisited once the solution for memory accounting has been clarified.
Ongoing work with dynptrs
Work is currently ongoing that adds new dynptr types to support further BPF use cases in the networking stack. In one recent patch set, Koong proposes adding two new types of dynptrs, one whose underlying memory region contains a socket buffer, and the other whose memory region contains an eXpress Data Path (XDP) buffer. The benefits of these dynptr types are the same for both types of buffers, with the main one being that it allows BPF programs to use more ergonomic APIs for reading and mutating memory in the buffers.
Consider, for example, if a user wanted to parse a type-length-value (TLV) header in a TCP packet contained in a struct xdp_md buffer. This structure contains data and data_end fields that represent the start and end of the packet's data region, respectively. A TLV header contains a header entry that encodes a type, followed by a length of the header value, and then the value which is that specified length. The length entry in the header is a value that can vary at run time between different packets and header types, so iterating over the headers in a packet requires non-static pointer offsets. Without dynptrs, a user would have to code explicit checks for every single read of the packet header to ensure that it fits within the data and data_end fields of the xdp_md buffer. With dynptrs, getting a pointer to the next header TLV is simply a matter of calling bpf_dynptr_data() with an offset calculated from the prior header TLV with the unknown type, and then checking that the pointer received from the helper is non-NULL.
While this doesn't enable entirely new use cases, it does address a significant usability concern in BPF networking programs that is a frequent source of complaints. Additionally, it makes the generated BPF program code more robust to changes in Clang and LLVM, which can sometimes cause the verifier to reject a previously safe program.
So far, the patches haven't received any strong pushback, and it seems unlikely that they will. At this time, yet another patch set has also been submitted upstream adding even more dynptr helper functions. Those functions may be the subject of another article in the future.
Accessing QEMU storage features without a VM
The QEMU emulator has a sizable set of storage features, including disk-image file formats like qcow2, snapshots, incremental backup, and storage migration, which are available to virtual machines. This software-defined storage functionality that is available inside QEMU has not been easily accessible outside of it, however. Kevin Wolf and Stefano Garzarella presented at KVM Forum 2022 on the new qemu-storage-daemon program and the libblkio library that make QEMU's storage functionality available even when the goal is not to run a virtual machine (VM).
Like the Linux kernel, QEMU has a block layer that supports disk I/O, which it performs on behalf of the VM and supports additional features like throttling while doing so. The virtual disks that VMs see are backed by disk images. Typically they are files or block devices, but they can also be network storage. Numerous disk-image file formats exist for VMs and QEMU supports them, with its native qcow2 format being one of the most widely used. The QEMU block layer also includes long-running background operations called blockjobs for migrating, mirroring, and merging disk images.
The QEMU process model
Wolf began by describing QEMU's process model where each VM is a separate QEMU process, complete with its own block layer, which can be seen in the diagram below. A JSON-RPC-like management interface called QEMU Monitor Protocol (QMP) provides a multitude of commands for manipulating disk images while the QEMU process is running. QMP commands allow storage migration, incremental backups, and so on. One catch is that the QEMU process must be running and that makes it difficult to use the QMP commands while the VM is shut down.
Another limitation of the one-VM-per-QEMU-process model is that disk images can only be shared read-only between VMs to avoid the data corruption that occurs when multiple QEMU processes update a shared disk image without coordination. This problem is relevant when several VMs were created from the same template "backing files". Those backing files must remain unmodified as long as two or more VMs are sharing them.
For these reasons, disk-image functionality in QEMU has been largely limited to active VMs and a few specific tools (qemu-img and qemu-nbd) until now.
qemu-storage-daemon
The new qemu-storage-daemon program makes disk-image functionality available outside the confines of the one-VM-per-QEMU-process model. qemu-storage-daemon runs as a separate process without any VM at all and offers the same QMP commands for manipulating disk images as previously found only in QEMU. qemu-storage-daemon can act as a server to export disk images for clients including, but not limited to, QEMU VMs.
Wolf described two ways of thinking about qemu-storage-daemon. It can be seen as an advanced qemu-nbd that supports QMP monitor commands and additional export types. Alternatively, it can be see as QEMU without the ability to run a VM. Both the command line and the available QMP commands closely resemble those of QEMU.
The following qemu-storage-daemon command serves a Network Block Device (NBD) export of the raw image file test.raw so that the disk image can be read over the network:
$ qemu-storage-daemon \ --nbd-server addr.type=inet,addr.host=0,addr.port=10809 \ --blockdev file,filename=test.raw,node-name=disk \ --export nbd,id=exp0,node-name=disk
Several use cases for qemu-storage-daemon were presented. Separating storage from the actual running of VMs makes sense when the two are managed independently. A storage-management tool should not need access to a VM's QMP interface and a VM-management tool should not need access to qemu-storage-daemon's QMP interface. Furthermore, separating storage makes it possible to apply tighter sandboxing to both the QEMU VM and qemu-storage-daemon so that a security compromise in one of these programs is less likely to affect other parts of the system.
Running qemu-storage-daemon as the sole process on the system that accesses a disk image unlocks use cases that were impossible with the one-VM-per-QEMU-process model. As seen in the diagram above, QEMU processes can be connected to qemu-storage-daemon so that VMs access the disk image through the daemon instead of directly from the QEMU process. It then becomes possible to modify backing files used from multiple VMs by using qemu-storage-daemon so that there is effectively only one process in the system with write access to the shared backing files.
Users who were unable to perform certain disk-image operations while the VM was shut down can now launch qemu-storage-daemon to cover that situation. While the VM is running, it accesses the disk image through qemu-storage-daemon so that there is no conflict between the running VM and activity taking place inside qemu-storage-daemon. When the VM is shut down, qemu-storage-daemon can still service requests to manipulate the disk image. This makes long-running operations like committing backing files safe across VM shutdown, which is useful because the commit operation might be performed by the cloud provider while the VM shutdown is performed independently by an end user. Should the VM be started again before the commit operation finishes, it still accesses its disk through qemu-storage-daemon and the commit operation continues to make progress.
The one-VM-per-QEMU-process model also has limitations when polling is enabled to increase performance. On a machine with many QEMU processes, each one performs its own polling and this consumes CPU time. qemu-storage-daemon can consolidate I/O processing into a single process that polls for multiple VMs, leaving more CPUs available for running VMs.
Another use case arises when QEMU's user-space NVMe PCI driver is used to squeeze the most performance out of a device. The NVMe device can only be accessed by one process, so normally only one running VM can have it open. If qemu-storage-daemon runs the user-space NVMe PCI driver instead of QEMU, then multiple VMs can connect to it and a single NVMe device can be sliced up into smaller virtual devices for the VMs.
Perhaps the most interesting use case is that qemu-storage-daemon makes QEMU's storage functionality available to other applications besides just QEMU. Backup applications, forensics tools, and other programs can use qemu-storage-daemon to access disk images, manipulate them, and take snapshots. Initially released in QEMU 5.0.0, qemu-storage-daemon can be found in the qemu-system-common package in Debian-based distributions and the qemu-img package in Fedora-based distributions.
Commands for taking snapshots, adding/removing exports at run time, managing dirty bitmaps for incremental backups, and more can be sent over a Unix domain socket using the QMP protocol. QMP is a control channel and not suitable for actually accessing the contents of disk images or dirty bitmaps. Instead, qemu-storage-daemon offers several ways to connect to disk images through its export types.
Block export types
The Network Block Device (NBD) protocol has a long history in Linux as a fairly simple way to access block devices over the network. Given that QEMU already contains an NBD server and qemu-nbd tool, it's no surprise that qemu-storage-daemon can export disk images via NBD. Programs can connect directly and there is also a Linux kernel driver that attaches NBD exports as block devices.
A Linux Filesystem in Userspace (FUSE) export type is also available in qemu-storage-daemon. The mounted FUSE filesystem looks like a regular file, but the underlying storage is actually a qcow2 file. qemu-storage-daemon handles the qcow2 file-format specifics so that it appears like a raw file that programs like fdisk, dd, and others know how to access. At this point, the implementation is synchronous and therefore it does not perform as well as other export types. Wolf mentioned that the FUSE export type offers an easy way to present a disk image as a raw file to programs that can only access regular files.
The vhost-user-blk export type is a Unix domain socket protocol that QEMU supports. Unlike NBD, it does not work over the network, but it takes advantage of shared memory so qemu-storage-daemon can read and write from and to the disk into guest RAM. This makes vhost-user-blk the natural choice for connecting QEMU VMs to qemu-storage-daemon as it is the most efficient export type. Other applications can also use this export type through the new libblkio library that was introduced later in the talk.
The vDPA Device in Userspace (VDUSE) export type processes I/O requests from the relatively new vDPA driver framework in the kernel. When the virtio_vdpa kernel module is loaded, the export appears as a virtio_blk device that can be used like any other Linux block device. qemu-storage-daemon acts as the user-space server for the vdpa-blk device similar to the way it can act as a FUSE filesystem server. Alternatively, the export appears as a Linux vhost device that can be added to QEMU VMs as virtio-blk devices when the vhost_vddpa kernel module is loaded on the host. The VDUSE export type therefore serves dual purposes of exposing storage both to the host and to VMs. Note that there is some overlap in functionality with the other export types here and those people who need VDUSE will know they need it, while others are likely to stick to the more traditional export types.
libblkio
While qemu-storage-daemon provides the server, the libblkio library offers a client API for efficiently accessing disk images. Since implementing vhost-user-blk and other protocols for accessing qemu-storage-daemon exports is involved, it's handy to have a library that provides this functionality and saves applications from having to duplicate it.
The libblkio 1.0 release uses Linux io_uring for file I/O, NVMe io_uring cmd primarily for use with NVMe benchmarking, and virtio-blk (vhost-user and vhost-vdpa) for connecting to qemu-storage-daemon and accessing vdpa-blk devices. This selection of drivers allows the library to be used both for connecting to qemu-storage-daemon as well as for directly accessing files or NVMe devices.
A full overview of libblkio was left for another KVM Forum talk; YouTube video and slides are available for those wishing to learn more. The main message, however, was that applications wishing to use qemu-storage-daemon can use libblkio to connect via vhost-user-blk. Packages of the library are not yet as widely available as qemu-storage-daemon, but that situation should improve over time.
Conclusion
QEMU's process model has made certain configurations hard to achieve, but qemu-storage-daemon offers a dedicated process for storage functionality that augments the traditional QEMU process model, which reduces those problems greatly. Furthermore, qemu-storage-daemon exposes QEMU's array of storage features to any program wishing to use them, even where VMs are not involved. libblkio offers the client side of the qemu-storage-daemon picture and allows programs to connect to storage. Like qemu-storage-daemon, libblkio is used by QEMU but is designed for general use by other programs unrelated to QEMU.
Where qemu-storage-daemon and libblkio will be used besides QEMU remains to be seen, but extracting functionality from QEMU and making it available for external consumption has opened the door to new developments in this area.
A YouTube video of the presentation, as well as the slides, are available for those looking for further information.
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Briefs: Netfilter workshop; Ubuntu 22.10; UKI; Python 3.11; RIP Wolfgang Denk; Quotes; ...
- Announcements: Newsletters, conferences, security updates, patches, and more.