Leading items
Welcome to the LWN.net Weekly Edition for May 20, 2021
This edition contains the following feature content:
- A bunch of releases from the Pallets projects: Flask is only the beginning of this set of modules for web applications.
- Calling kernel functions from BPF: the line between BPF and the rest of the kernel grows thinner.
- Sticky groups in the shadows: making negative group permissions really work properly in user namespaces.
- Exported-symbol changes in 5.13: what the changes in the module interface show about this development cycle.
- The misc control group: a new controller for simple, countable resources.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
A bunch of releases from the Pallets projects
May 11 marked a new major release for the Python-based Flask web microframework project, but Flask 2.0 was only part of the story. While the framework may be the most visible piece, it is one of a small handful of cooperating libraries that provide solutions for various web-development tasks; all are incorporated into the Pallets projects organization. For the first time, all six libraries that make up Pallets were released at the same time and each had a new major version number. In part, that new major version indicated that Python 2 support was being left behind, but there is plenty more that went into the coordinated release.
Pallets
While Flask is pretty well-known and has even been written about here before, the Pallets umbrella organization has flown a bit under the radar, at least for me. The Jinja2 template engine, a Pallets component that is used by Flask, is also fairly high-profile, but the other pieces of the puzzle are less so. The only other Pallets library I had heard of was the Werkzeug library for supporting Web Server Gateway Interface (WSGI) applications. It is used to connect Flask applications to web servers.
There are three more libraries on the pallet, but those are smaller and
more specialized: MarkupSafe, which
provides a text object that escapes characters interpreted by HTML, ItsDangerous, which
provides helpers to cryptographically sign data that will be moved between
trusted and untrusted environments, and
the
Command Line Interface
Creation Kit, or Click, which is used for "creating beautiful command line interfaces in a composable way with as little code as necessary
". The
coordinated release was
announced
on the Pallets blog; it is based on two years of work, though there have
been other fairly substantial releases in that time span (e.g. Flask 1.1
in July 2019, Jinja 2.11
in January 2020, Werkzeug 1.0
in February 2020). Beyond Flask 2.0, which was mentioned above, the
release also included:
All of the projects now only support Python 3.6 and above, which was
something that Pallets had announced
back at the end of 2019. "Removing the compatibility code makes the
code faster, as well as easier to maintain and contribute to.
"
Another cross-release feature is type
annotations that have been added throughout the libraries. Beyond that, various tools have been
used to enforce a consistent style on the entire code base.
Changes
Flask now supports asynchronous views and error handlers, so those functions can be defined with async def and Flask will run them in a separate thread. It is not entirely clear how much additional support for async will be added to Flask, as the Quart project already provides an asynchronous web framework with the Flask API. Quart developer Philip Jones is one of the Pallets maintainers; he wrote a blog post about the subject and filed the GitHub issue on async for Flask back in 2019.
Blueprints can now be nested in Flask 2.0, which affords more flexibility in organizing a web application. The development server (i.e. "flask run") has better error handling and no longer defers errors that are detected when it starts up. The "flask shell", which provides a Python read-eval-print loop (REPL) within the context of the Flask application, now has tab completion when Readline is available.
The highlights for Werkzeug 2.0 contain several features that make it
more flexible for supporting async. Local variables are managed with the ContextVar
type so that they can be shared between coroutines and not just threads.
The Flask Request
and Response
classes have been refactored to remove the
BaseRequest/Response parent classes and moved all of the
mixins into those (now) base classes. In
addition, a new API
is being created that removes the code that is WSGI- or
I/O-dependent from those
classes. "This will allow us to better support sync and async use cases
in the future.
" In particular, it will allow Quart and other Asynchronous Server Gateway
Interface (ASGI) frameworks to use Werkzeug.
In addition, many of the datetime objects returned from Werkzeug are now time-zone-aware. The parsing of multipart/form-data, which is used for file uploads, has been improved significantly, leading to up to 15x better performance. The URL routing now understands WebSocket schemes (ws:// and wss://); the library does not have support for WebSocket directly, but this feature will allow projects to use the Werkzeug routing.
Support for async environments and rendering no longer requires patching Jinja with the release of version 3.0. In addition, the NativeEnvironment, which allows templates to produce native Python types rather than only strings, has also been fixed to support async environments. Blocks in templates can now be marked as required, which means they must be defined somewhere in the template hierarchy. Translation contexts (via pggettext and npgettext) can now be used to determine translation strings in the i18n extension.
Click 8.0 brings a bunch of improvements to argument handling, which makes
sense for a toolkit to build command-line tools. The tab completion
feature has been completely rewritten in order to allow each
"command, group, parameter, and type to provide custom
completion
"; completion has other improvements as well. Support has been
added for colors specified by the 256-color palette or using RGB values,
as has the ability to do italics, strike-through, and other text effects.
Help text formatting has been improved and messages for users are now
able to be translated.
ItsDangerous, which provides an easy-to-use mechanism to generate signatures for things like cookie values, has added support for key rotation in version 2.0. A list of keys can be passed, from oldest to newest; the newest will be used for signing, while all of the keys will be tried when verifying the signature. It has also made its datetime objects time-zone-aware. MarkupSafe 2.0 added a whole slew of binary wheels for various different combinations of Python version, operating system, and architecture—33 in all.
This overview just touches on some of the changes in these libraries that caught my eye. Looking at the announcement in more detail, as well as the highly detailed changelogs, will fill in lots of other details on improvements that have been made.
History
Beyond providing a nice set of tools that can be used in the development of web applications, the Pallets projects have another thing in common: they all were started by Armin Ronacher, who has been a prolific contributor in the Python community. He introduced the Pallets projects in a blog post exactly six years after he released an April Fools joke that turned into Flask:
On the first of April 2010, I released a joke microframework called denied which made fun of the fact that all microframeworks at the time decided to forgo with dependencies and bundle up everything they need in a single Python file. What I did was embed all of Jinja2 and Werkzeug in a base64 encoded zip file within the framework's only Python file. The response to it was interesting in a few ways because on the one hand quite a few people did not really understand that it was an April fools joke to begin with and on the other, there was a discussion where there were no microframeworks that actually did use dependencies and encouraged it.One month later there was a new project by the name of "Flask" which actually gave this concept a real shot. It launched with the tagline "a microframework for Python based on Werkzeug, Jinja 2 and good intentions" and six years later it's the most starred Python framework on GitHub.
Ronacher is still part of
Pallets today, along with the three people he started it with and
others, like Jones, who have joined since. Those seven listed people
are, naturally, backed up by lots of others: "The total list of
people involved is much larger as they consist of countless of
contributions of many individuals over the years.
" As with pretty
much every project (or set of projects) out there, Pallets would be happy
to have more contributors should any of the libraries pique the interest of
any readers.
Calling kernel functions from BPF
The kernel's BPF virtual machine allows programs loaded from user space to be safely run in the kernel's context. That functionality would be of limited use, however, without the ability for those programs to interact with the rest of the kernel. The interface between BPF and the kernel has been kept narrow for a number of good reasons, including safety and keeping the kernel in control of the system. The 5.13 kernel, though, contains a feature that could, over time, widen that interface considerably: the ability to directly call kernel functions from BPF programs.The immediate driver for this functionality is the implementation of TCP congestion-control algorithms in BPF, a capability that was added to the 5.6 kernel release by Martin KaFai Lau. Actual congestion-control implementations in BPF turned out to reimplement a number of functions that already exist in the kernel, which seems less than fully optimal; it would be better to just use the existing functions in the kernel if possible. The new function-calling mechanism — also implemented by Lau — makes that possible.
Making functions available to BPF
On the BPF side, using a kernel function is now just a matter of declaring it extern and calling it like any other C function. Within the kernel, instead, a bit more work has to be done. BPF programs are meant to only have access to a specific set of allowed functions, and that set is only available to the intended BPF program type; code inside the kernel must thus make those functions available in the right context. So, for example, this commit makes tcp_slow_start() available to BPF — but only for congestion-control programs.
"Exporting" functions to BPF programs is done by adding a new function to the bpf_verifier_ops structure associated with the program type:
bool (*check_kfunc_call)(u32 kfunc_btf_id);
This function will be called by the BPF verifier when it encounters an external call; kfunc_btf_id is the BPF type format (BTF) ID assigned to the function that the BPF program wants to call. The function should return true if the call should be allowed. If tcp_slow_start() were the only function to be made available in this way, that function could be written as:
static bool bpf_tcp_ca_check_kfunc_call(u32 id)
{
return id == BTF_ID(func, tcp_slow_start);
}
If there are many functions to export, there are easier ways than a long list of if statements to do the checking; see the above-linked commit for an example.
Beyond checking that the function is meant to be available, the BPF verifier carries out a number of other checks. For example, the arguments passed to the function and their types must be correct, or the program will be rejected. The call is only allowed if the verifier can convince itself that it is safe, though the verifier obviously cannot really know what is going on inside the called function or the ways in which things could go wrong.
Some questions
So far, congestion-control programs are the only program type to make use of this feature, but it is not hard to imagine that others will come in the future. There are a number of interesting questions that are raised by this capability and how it might be used going forward.
The first of those might be: how does this capability differ from the BPF helper mechanism that has been part of BPF for years? The changelog does not address that question, so your editor has to guess. BPF helper programs must be written explicitly for use from BPF programs, must be declared specially, and require a bpf_func_proto structure to be filled in and made available to the verifier; see the setup for bpf_map_lookup_elem() for an example. Making an existing kernel function available as a BPF helper means writing a wrapper function, then going through this whole dance.
To make a kernel function callable, instead, is just a matter of defining a "check" function that allows the call to happen, and the BPF subsystem does the rest. One could argue that helpers should have been implemented that way in the first place, but there is a lot of necessary infrastructure that only showed up years after the helper mechanism was developed. Without BTF, this would not be possible; the BPF Linux security module (formerly KRSI) also brought some of the necessary support. Had that infrastructure existed at the beginning, it's possible that there would never have been a need to add BPF helpers.
That said, BPF helpers have the advantage of existing solely for use by BPF programs; kernel functions are there to be called by the rest of the kernel. There is no stable ABI within the kernel, so it would not be surprising to see the interface to BPF-exported kernel functions change more often than the interface to BPF helpers. The commit adding the function-calling capability makes a clear statement that there are no ABI guarantees:
The white listed functions are not bounded to a fixed ABI contract. Those functions have already been used by the existing kernel tcp-cc. If any of them has changed, both in-tree and out-of-tree kernel tcp-cc implementations have to be changed.
It will be interesting to see what happens if an internal kernel change breaks a high-profile BPF program and users start to complain. It is generally understood that functionality provided to BPF is not part of the kernel ABI, but that policy has never been explicitly blessed by Linus Torvalds or seriously tested.
BPF helpers are also designed to be safely called from the BPF context — from outside of the kernel itself, in other words. Regular kernel functions are not written with a possibly hostile caller in mind. The BPF subsystem as a whole goes to great lengths to ensure that a BPF program cannot crash or compromise the system, but that subsystem cannot know what happens inside some kernel function and cannot guarantee that the arguments to any given function call make sense. If the wrong functions are made available to BPF, an erroneous or hostile program could use them to make an unpleasant mess.
Finally, this mechanism looks a bit like a backdoor way to export kernel symbols outside of the kernel itself. The exporting of symbols to modules requires an EXPORT_SYMBOL() declaration next to the relevant code and often attracts a fair amount of attention and debate over whether kernel internals should be exposed in that way. Exporting of functions to BPF programs is a lower-profile activity that can happen far away from the definition of the functions involved. In an extreme case, there does not appear to be anything to prevent somebody from registering a checking function like this:
static bool export_the_world(u32 kfunc_btf_id)
{
return true;
}
The result of adding this function would be to make almost any kernel function callable from a BPF program of the right type. That is unlikely to be seen as a good outcome. In theory such a function would be caught in review, but it is worth asking how many people have reviewed the test code for function calls from BPF that has been added (as part of this patch series) to the (entirely unrelated) traffic-control classifier program type; this (harmless) code will be present in all systems with traffic control enabled. It does not seem that it would be hard to add a severe bug, intentionally or otherwise, by exporting the wrong function to BPF programs.
Some of these concerns could perhaps be mitigated by registering a list of allowed kernel functions with the BPF core rather than supplying a function that makes its own decisions. That is not how this feature was implemented, though.
Be that as it may, the BPF function-calling mechanism has been merged and will be included in the 5.13 release. Presumably there will be enough vigilance to keep kernel functions from being inappropriately exported in the mainline kernel in future releases. Properly managed, this feature could be used to make a great deal of functionality available to BPF programs, significantly growing the set of useful things that can be done with BPF. It will be interesting to see where this feature goes from here.
Sticky groups in the shadows
Group membership is normally used to grant access to some resource; examples might include using groups to control access to a shared directory, a printer, or the ability to use tools like sudo. It is possible, though, to use group membership to deny access to a resource instead, and some administrators make use of that feature. But groups only work as a negative credential if the user cannot shed them at will. Occasionally, some way to escape a group has turned up, resulting in vulnerabilities on systems where they are used to block access; despite fixes in the past, it turns out that there is still a potential problem with groups and user namespaces; this patch set from Giuseppe Scrivano seeks to mitigate it through the creation of "shadow" groups.There are two ways to prevent access to a file based on group membership. One of those is to simply set the group owner of the file to the group that is to be denied, then set the permissions to disallow group access. Members of the chosen group will be denied access to the file, even if the world permissions would otherwise allow that access. The alternative is to use access control lists to explicitly deny access to the intended group or groups. Once again, any process in any of the designated groups will not be allowed access.
By way of a refresher, it's worth remembering that Linux has two separate concepts of group membership. The "primary group" or "effective group ID" is the group that will be attached to new files in the absence of other constraints. This was once the only group associated with a process in Unix systems, and is set with setgid(). The "supplementary" groups are a newer addition that allow a process to belong to multiple groups simultaneously; the list of supplementary groups can be changed with setgroups(). Negative access-control decisions are usually (but not necessarily) based on supplementary group membership.
The "negative groups" access-control technique will prove porous, though, if processes are allowed to shed group membership at will. Both setgid() and setgroups() are privileged operations so, in normal circumstances, group membership is not under the control of the process involved. At least, that is true until user namespaces enter the picture.
Back in 2014, a problem with user namespaces and groups came to light. Any user can create a user namespace and run as root within that namespace; as a result, actions that are blocked outside of the namespace (setgroups(), for example) become possible inside. User namespaces thus made it easy for a user to evade being hampered by membership in the wrong group; all that was needed was to create a namespace, then call setgroups() to drop membership of that group inside the namespace. User namespaces were designed to not confer any extra privilege outside of the namespace, but the removal of a credential was not originally seen as raising privilege.
The solution adopted at the time was to add a control file (called setgroups) to each process's /proc directory. Writing "deny" to that file will disable setgroups() to all processes within the user namespace containing the target process — and to all descendant user namespaces as well — while writing allow will enable setgroups(). This action must be performed before setting the group-ID map for the namespace, otherwise writing the group-ID map will enable setgroups(). This policy was chosen for ease of verification; there is exactly one place where setgroups() can be enabled for a namespace. If setgroups() is explicitly disabled, it will remain that way for the life of the namespace; there is no way to enable it again.
This fix solved the problem, but at a cost: setgroups() exists for a reason, and there are legitimate workloads that would like to make use of it. Denying setgroups() will keep processes from escaping unwanted groups within a namespace, but it also prevents any other change of supplementary groups.
To remedy this issue, Scrivano's patch set adds another possible value, "shadow", for the setgroups file. Writing shadow has the same effect as writing allowed, in that the setgroups() system call will be allowed inside the namespace, but there is a difference. When the setgroups file is written, the kernel will make a copy of the target process's supplementary groups at that time and store it with the namespace. Whenever setgroups() is called, this list of "shadow" groups will be appended to the groups provided by the caller, essentially requesting continued membership in all of those groups.
In other words, the "shadow" mode makes the initial set of supplementary groups sticky. Suitably privileged processes within the namespace can use setgroups() to add new groups; they can also remove groups that were not part of the initial set. But the supplementary groups that the namespace was created with will always be there, even though the process can no longer see them — getgroups() will not list the groups that have not been explicitly requested with a setgroups() call.
This particular patch has been around for a while; in the past, its complexity has not seemed to be justified by the benefits it brings. A recent user request for this feature has brought this work back to light, though. Whether it will clear the bar this time remains to be seen, but it seems likely that there will always be users who want to have both the ability to use negative group protections and to change group memberships within user namespaces.
Exported-symbol changes in 5.13
There have been many disagreements over the years in the kernel community concerning the exporting of internal kernel symbols to loadable modules. Exporting a symbol often exposes implementation decisions to outside code, makes it possible to use (or abuse) kernel functionality in unintended ways, and makes future changes harder. That said, there is no authority overseeing the exporting of symbols and no process for approving exports; discussions only tend to arise when somebody notices a change that they don't like. But it is not particularly hard to detect changes in symbol exports from one kernel version to the next, and doing so can give some insights into the kinds of changes that are happening under the hood.The kernel has many thousands of functions and data structures; most of those are private to a given source file, while others are made available to the kernel as a whole. Loadable modules are special, though; they only have access to symbols that have been explicitly exported to them with EXPORT_SYMBOL() (or one of a few variants); many symbols that are available to code built into the kernel image are unavailable to loadable modules. The intent of this limitation is to keep the interface to modules relatively narrow and manageable.
It is far from clear that this objective has been achieved, though. The 5.12 kernel exported 31,695 symbols to modules, which does not create an impression of a narrow interface. That number grew to 31,822 in 5.13-rc1. That is an increase of 127 symbols, but the actual story is a bit more complicated than that; 244 exported symbols were removed over this time, while 371 were added. The curious can see the full sets of added and removed symbols on this page.
Some changes qualify more as a renaming than a removal or an addition. For example, pmbus_do_probe() is no longer exported in 5.13, at least in that form; it is now listed (using a notation your editor made up on the spot) as PMBUS::pmbus_do_probe(). In other words, this symbol has been moved out of the global namespace into a subsystem-specific one. Namespacing for exported kernel symbols was added in 2018, but uptake has been relatively slow. The 5.13 kernel adds one new namespace (PMBUS) and that subsystem's exported symbols are moving into it. There are now 18 namespaces for symbols in the kernel:
CRYPTO_INTERNAL, FIRMWARE_LOADER_PRIVATE, LTC2497, MCB, NVME_TARGET_PASSTHRU, PMBUS, SND_INTEL_SOUNDWIRE_ACPI, SND_SOC_SOF_INTEL_HDA_COMMON, SND_SOC_SOF_MERRIFIELD, SND_SOC_SOF_HDA_AUDIO_CODEC, SND_SOC_SOF_HDA_AUDIO_CODEC_I915, SND_SOC_SOF_INTEL_HIFI_EP_IPC, SND_SOC_SOF_INTEL_HIFI_EP_IPC, SND_SOC_SOF_ACPI_DEV, SND_SOC_SOF_PCI_DEV, SND_SOC_SOF_XTENSA, SOUNDWIRE_INTEL_INIT, and TEST_FIRMWARE.
The sound subsystem has clearly been the most enthusiastic user of symbol namespaces thus far.
Many other changes in exported symbols are the result of code refactoring within the kernel. Some optimizations in the bit-finding library caused functions like find_first_bit() to be turned into inline functions in header files, which need not be exported. But they fall back to functions like _find_first_bit(), which now do need to be exported. The generic-sounding vmem_map symbol was specific to the ia64 architecture; it went away when ia64 dropped support for the VMEMMAP memory model. Various wimax_ symbols disappeared along with the unloved WiMAX drivers that exported them. Functions like rt_mutex_destroy() were deleted because they were unused.
Many of the new symbols correspond to new features; alloc_pages()
came with batch page allocation, for
example. Others are a bit less clear; what, for example, is
dotdot_name? The commit that added this
export explains it as "useful constants: struct qstr for
'..'
", which may be seen by some as less than fully enlightening.
It provides a shortcut for filesystem code wanting to refer
to directories named ".." without going to the trouble of wrapping it in
the "quick
string" structure used to pass strings around in the virtual filesystem
layer. Several filesystems make use of it in 5.13.
As a general rule, kernel symbols should not be exported unless there is a user of them in the mainline kernel. That rule is generally respected, but there are exceptions. As an example, zynqmp_pm_pinctrl_get_function() was exported in 5.13-rc1, but has no in-kernel users. The other zynqmp_ (all related to functionality on Xilinx Zynq systems-on-chip) symbols that have been exported are not widely used and would be good candidates for hiding within their own namespace. Another exported-but-unused symbol is __cfi_slowpath_diag(), which is part of the Clang control-flow integrity implementation that was merged in this cycle. The reason for the exporting of this symbol is not entirely clear. __cpu_dying_mask() was also introduced and exported in 5.13 with no in-kernel users. There are many others as well; "export it just in case" seems to be a fairly common reflex for kernel developers.
The 5.13 kernel saw the addition of eleven devm_ exports, plus two with the internal __devm_ prefix. Not all of these are used either, but they do represent the type of symbol that one would expect to be exported to modules. These "managed device" functions are intended to make device drivers easier to write and safer by taking care of the freeing of allocated resources when a device is shut down. There are over 300 of these functions exported to modules now, and the list looks likely to continue to grow.
The direct rendering manager (DRM) graphics subsystem added 17 drm_ exports this time around. DRM is clearly one of the most complex driver APIs in the kernel, with no less than 850 exported symbols in 5.13. One begins to understand why the developers of this subsystem have prioritized documentation; this API would be unapproachable without it. That, of course, is a reflection of the problem space; graphics processors are complex devices.
Given that it requires nearly 32,000 exported symbols for a "limited" module interface, the kernel as a whole is also a complex environment. That complexity is reflected in the increasing size of the interface it offers to user space, but also in the growing interface it presents to loadable modules. This interface has increased significantly in size over the years, often without a lot of review. The good news is that, as an internal kernel interface, the set of exported symbols can be changed at any time. So perhaps this list might shrink someday, but that will not happen in the 5.13 cycle.
The misc control group
Control groups (cgroups) are meant to limit access to a shared resource among processes in the system. One such resource is the values used to specify an encrypted-memory region for a virtual machine, such as the address-space identifiers (ASIDs) used by the AMD Secure Encrypted Virtualization (SEV) feature. Vipin Sharma set out to add a control group for these ASIDs back in September; based on the feedback, though, he expanded the idea into a controller to track and limit any countable resource. The patch set became the controller for the misc control group and has been merged for Linux 5.13.
The underlying idea is to allow administrators (or cloud orchestration systems) to enforce limits on the number of these IDs that can be consumed by the processes in a control group. In a cloud setting, those processes could correspond to virtual machines being run under KVM. The initial posting for ASIDs was met with a suggestion from Sean Christopherson to expand the reach of the controller to govern more types of encryption IDs beyond just those used by AMD SEV. Intel has an analogous Trust Domain Extensions (TDX) feature that uses key IDs, which are also a resource that may need limiting. The s390 architecture has its secure execution IDs (SEIDs), as well; those are far less scarce than the others, but could still benefit from a controller to limit the consumption of them.
All of that led Sharma to make the controller more generic so that it could be used by TDX IDs and SEIDs as well. By January, the "encryption ID controller" patch set had reached version 4, but maintainer Tejun Heo was concerned about enshrining hardware-specific control knobs in the control-groups subsystem:
I'm very reluctant to ack vendor specific interfaces for a few reasons but most importantly because they usually indicate abstraction and/or the underlying feature not being sufficiently developed and they tend to become baggages after a while.
He and Sharma talked past each other a bit in the discussion,
but eventually Heo said that
because the landscape for encryption IDs is still immature, he would prefer
a different approach. Instead of tying the controller to encryption IDs,
it would be for miscellaneous (misc) items that can be tracked by number up
to a maximum for the control group (or system as a whole).
"So, behavior-wise, not that different
from the proposed code. Just made
generic into a misc controller.
"
Sharma agreed with that plan and posted an RFC patch for a misc controller in mid-February. There have been three subsequent versions posted, but the form of the controller has stayed basically the same throughout. The patches also add SEV ASIDs and the related (but distinct) SEV Encrypted State (SEV-ES) IDs as two quantities to be controlled. In a kernel built with CONFIG_CGROUP_MISC on a suitably equipped AMD CPU, the root control group will have a misc.capacity file that shows the number of available IDs in each category:
$ cat misc.capacity
sev 50
sev_es 10
More generally, a system that has two resources managed by the controller with the names "res_a" and "res_b" will display them in the files in the control-group hierarchy. The misc.capacity root file is read-only and reflects the amount of the resources for the whole system; two other files appear in the non-root control groups, which can be used to limit the resources and to monitor their use:
$ cat misc.current
res_a 3
res_b 0
$ cat misc.max
res_a 10
res_b 4
As might be guessed, misc.current reports on the current usage by
the group, while misc.max reports the setting for the maximum
allowed for the group. misc.max is a read-write file, unlike the
other two, so setting the maximum can be done as follows:
# echo res_a 1 > misc.max
# echo res_b max > misc.max
The first sets the maximum for res_a to one, while the second sets
res_b to the maximum allowed for the group (which could be less than
the system maximum due to limits in one of its parent control groups).
The patch adding the two types of SEV ASIDs shows the steps needed to add other resources, such as TDX IDs or SEIDs, to the controller. An entry gets added to the misc_res_types enum and a corresponding name is added to the misc_res_names array. Before the resource can be used, the initialization must set the system-wide capacity using:
int misc_cg_set_capacity(enum misc_res_type type, unsigned long capacity);
When one of the resources is needed or released, the charge and uncharge API is used:
int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg,
unsigned long amount);
void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg,
unsigned long amount);
One thing to note is that, in keeping with the ideas behind version 2 of control groups, migrating a process to a different control group does not change the accounting. The control group that contained the process when the resource was acquired will continue to be charged for it until it is freed. The caller needs to track the control group that was charged so that the uncharge can be done for the proper group. Having the charge follow the process as it migrates came up in review; Jacob Pan asked about adding charge migration because he was looking at using the misc controller to limit ASIDs used for I/O via DMA (IOASIDs). Heo was clear that adding charge migration to the misc controller was unlikely:
Please note that cgroup2 by and large don't really like or support charge migration or even migrations themselves. We tried that w/ memcg on cgroup1 and it turned out horrible. The expected usage model as [described] in the doc is using migration to seed a cgroup (or even better, use the new clone call to start in the target cgroup) and then stay there until exit. All existing controllers assume this usage model and I'm likely to nack deviation unless there are some super strong justifications.
As it turns out, there may be no real use cases for migrating processes after they have acquired IOASIDs, so Pan plans to use the misc controller, at least for now.
In truth, the misc controller is "a bit of cop-out
", as Heo put it. He
does not believe that these hardware features are necessarily going to be
around "forever" so he
is loath to tie the control-groups subsystem to them for the long term:
My take is that the underlying hardware feature isn't mature enough to have reasonable abstraction built on top of them. Given time, maybe future iterations will get there or maybe it's a passing fad and people will mostly forget about these.
But, cop-out or no, the misc controller is now open for business. It would seem there are several other candidates for being added to it; others may well arise in the coming months. For simple resources that just need to be tracked and limited based on their count, the misc controller seems like it will do the job.
Page editor: Jonathan Corbet
Next page:
Brief items>>
