LWN.net Weekly Edition for July 4, 2019

Welcome to the LWN.net Weekly Edition for July 4, 2019

This edition contains the following feature content:

Debian and code names: what's the best way to refer to specific Debian releases?
OpenPGP certificate flooding: an ancient vulnerability in the keyserver network rears its head.
Providing wider access to bpf(): a controversial approach to relaxing access to the bpf() system call.
The io.weight I/O-bandwidth controller: a new approach to I/O bandwidth control.
TurboSched: the return of small-task packing: scheduler changes to make better use of a CPU's "turbo" mode.
Fedora mulls its "python" version: another discussion on what the python command should mean.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Debian and code names

By Jake Edge
July 3, 2019

Debian typically uses code names to refer to its releases, starting with the Toy Story character names used (mostly) instead of numbers. The "Buster" release is due on July 6 and you will rarely hear it referred to as "Debian 10". There are some other code names used for repository (or suite) names in the Debian infrastructure; "stable", "testing", "unstable", "oldstable", and sometimes even "oldoldstable" are all used as part of the sources for the APT packaging tool. But code names of any sort are hard to keep track of; a discussion on the debian-devel mailing list looks at moving away from, at least, some of the repository code names.

The issue was raised by Ansgar Burchardt, who wondered if it made sense to move away from the stable, unstable, and testing suite names in the sources.list file used by APT. Those labels, except for unstable, change the release they are pointing at when a new release gets made. Currently stable points to "~~Jessie~~ Stretch" (Debian 9), while testing points to Buster. Soon, stable will point to Buster, testing will point at "Bullseye", which will become Debian 11.

He asked about using the release code names directly, instead, so that pointing a system at Stretch would continue to get packages from that release. But he also thought it would be nice to completely route around the code names, which "confuse people". He suggested lines that looked like the following in sources.list:

    deb http://deb.debian.org/debian debian11 main
    deb http://security.debian.org/debian-security debian11-security main

He noted that Ubuntu does not use suite names but the code-name problem still rears its head: "[...] having to map 'Ubuntu 18.04' to 'bionic' instead of just writing the version in sources.list is annoying".

Andrei Popescu pointed out some of the perils of using stable and testing around the time of a release. Many users have (mostly incorrectly) trained themselves to do a apt-get dist-upgrade instead of a regular apt-get upgrade to pick up new package versions, but that can go badly awry if stable has just changed. Those who stick with testing, he said, are really looking for a rolling release, so perhaps testing should be renamed to "rolling"—"or 'roling' to emphasize that it's incomplete :p".

The code names are particularly hard for those only tangentially connected to Debian, Simon McVittie said. He notes that there will be three releases with code names that start with "B" before too long; after Buster and Bullseye comes Bookworm. Originally, part of the reason for code names was because it was not clear whether the next release would be considered a point release or not: "we didn't know whether etch would be released as Debian 3.2 or Debian 4.0". Now each release is a major version bump, so it would help outsiders to use those more:

With more emphasis on the version numbers, my non-Debian colleagues would still have to learn (or look up) which release is the current stable, but given that information they would immediately also know which release was the previous one (subtract 1) and which release is under development (add 1).

But Adam Borowski is adamant that the code names are easier to work with. He complained that he had to look up whether Debian 11 was Buster or Bullseye, but others have the opposite problem. Going back in time is difficult as well, as Wouter Verhelst related:

Potato was followed by sarge, but I think there was something in between (although I'm not sure). There's an etch somewhere, and a lenny.

But what were the orderings again? I honestly don't remember.

Philip Hands also has trouble remembering code names, but he (and others) suggested: "Can we not just have both?" Certainly the testing suite name is considered useful by many, so it will presumably remain in any plan that might emerge. And Ian Jackson thinks that the other suite names should stick around for reference purposes but that using "debian11" in sources.list should be the default going forward:

For now, the 'debian11' can be an alias. At some future point this should become the canonical suite name, replacing the codename. But I think this should not be done retrospectively to old suites, because there is software outside the archive that wants to name things by a single canonical name, and changing that name for an existing suite will cause trouble.

Michael Stone summed up the thinking of many in the thread with regard to names, rather than versions, in sources.list:

Having "stable" in sources.list is broken, because one day stuff goes from working to not working, which requires manual intervention, at which point someone could have just changed the name. Having codenames in sources.list is broken, because even people who have been developers for two decades can't remember which release is which without looking it up.

Several lamented that it was painful to actually do that lookup at times. Wikipedia has a good reference page, but something closer to home would be useful. McVittie pointed to the distro-info-data package, which has those mappings in a CSV file and provides bindings for various languages; it is used by the lsb-release package to report distribution version information. That helps, but the consensus still seems to be that providing a version-number-based system (in addition to the existing suite names) makes sense.

In fact, Jackson suggested a call for rough consensus as was done by Debian project leader Sam Hartman for the dh discussion in June. Hartman seconded that call, suggesting that Burchardt do so "as the developer who started the discussion". So far, that has not occurred, however.

Perhaps inevitably, the idea of moving to year-based version numbers (or "release identifiers") was raised; Tomas Pospisek said that the code names are getting confusing and that "sequential release numbers are devoid of any semantics except for their monotonically increasing character". Year-based version numbers is, of course, what Ubuntu uses, but there is a big difference between the two: Ubuntu sets its release date in advance, while Debian release dates are rather more fluid. The pros and cons of that approach were discussed, with the idea of alphabetically increasing code names thrown in for good measure, but few beyond Pospisek seemed wildly enthused by the idea. As Paul Wise put it:

At this point in the thread it is very clear that which identifier one prefers is very individual and dependent on use-cases. So we should add support for more individuals and use-cases in order to accommodate everyone's preferences. Killing off use-cases by switching between identifier styles isn't the right way to go.

That's where things stand at this point. In general, code names are most useful within a community, rather than as a marketing or other tool to communicate with outsiders. Even when those names follow a clear pattern (e.g. Ubuntu, Android) they still seem to confuse to some extent. Numbers have the advantage of predictability, though without some kind of real-world mapping (e.g. date-based) the advantage is somewhat limited. But defaulting to a world where sources.list does not point to "stable", which suddenly changes at release time, seems like a goal worth pursuing. Outsiders and insiders alike can perhaps agree on that part.

Comments (41 posted)

OpenPGP certificate flooding

By Jake Edge
July 2, 2019

A problem with the way that OpenPGP public-key certificates are handled by key servers and applications is wreaking some havoc, but not just for those who own the certificates (and keys)—anyone who has those keys on their keyring and does regular updates will be affected. It is effectively a denial of service attack, but one that propagates differently than most others. The mechanism of this "certificate flooding" is one that is normally used to add attestations to the key owner's identity (also known as "signing the key"), but because of the way most key servers work, it can be used to fill a certificate with "spam"—with far-reaching effects.

The problems have been known for many years, but they were graphically illustrated by attacks on the keys of two well-known members of the OpenPGP community, Daniel Kahn Gillmor ("dkg") and Robert J. Hansen ("rjh"), in late June. Gillmor first reported the attack on his blog. It turned out that someone had added multiple bogus certifications (or attestations) to his public key in the SKS key server pool; an additional 55,000 certifications were added, bloating his key to 17MB in size. Hansen's key got spammed even worse, with nearly 150,000 certifications—the maximum number that the OpenPGP protocol will support.

The idea behind these certifications is to support the "web of trust". If user Alice believes that a particular key for user Bob is valid (because, for example, they sat down over beers and verified that), Alice can so attest by adding a certification to Bob's key. Now if other users who trust Alice come across Bob's key, they can be reasonably sure that the key is Bob's because Alice (cryptographically) said so. That is the essence of the web of trust, though in practice, it is often not really used to do that kind of verification outside of highly technical communities. In addition, anyone can add a certification, whether they know the identity of the key holder or not.

The problem with the keys for Gillmor and Hansen that are stored in the SKS key-server network is that GNU Privacy Guard (GnuPG or GPG), the most widely used OpenPGP implementation, cannot even import the certificates into its newer keybox (.kbx) format due to the number of certifications they have. But using the older format (.gpg) leads to performance problems for any operation that uses the key. That leads to failures with the Enigmail Thunderbird add-on, for Git to take minutes to PGP sign a tag, for authentication to Monkeysphere to use an enormous amount of CPU, and probably other things as well, Gillmor said.

All of that is a pain for him, but it is worse than that. Anyone who has his key on their keyring will run into problems if they practice good "key hygiene" by periodically updating the keys on their keyring from the SKS servers. That will pick up key revocations, new subkeys, and the like, so their keyring will be up to date. But if they have Gillmor or Hansen on their keyrings and use the older format, that will lead to strange and hard to diagnose problems. If they use the newer format, they won't get updates for those keys, which may lead to other problems.

As Hansen details in his post on the subject, the problem can be traced to a decision made in the early 1990s about how key servers should function. There was concern that repressive governments would try to "coerce" key-server operators into substituting keys for ones controlled by the government. In order to combat that, key servers would not delete any information, so certificates and any information about them (e.g. certifications) would live forever. Multiple key servers were run in widely disparate locations and information would be synchronized between them regularly. If a key server did remove or modify a certificate, that would be recognized and the previous state would be restored.

That well-intentioned choice was fine for the time, but that time has passed. Unfortunately, according to Hansen, the SKS software is effectively unmaintained, in part because it was written in an idiosyncratic dialect of OCaml, but also because it was created as a proof of concept for a Ph.D. thesis:

[...] for software that's deployed in the field it makes maintenance quite difficult. Not only do we need to be bright enough to understand an algorithm that's literally someone's Ph.D thesis, but we need expertise in obscure programming languages and strange programming customs.

He noted that the idea of the immutability of certificate information is wired deeply into the guts of SKS. Changing that would be a much bigger job than simply fixing a bug or adding a small feature. He recommended that high-risk users not use the SKS key server network any longer. Both he and Gillmor pointed out that the new, experimental keys.openpgp.org key server is resistant to the certificate flooding attack. The keys.openpgp.org FAQ explains:

A "third party signature" is a signature on a key that was made by some other key. Most commonly, those are the signatures produced when "signing someone's key", which are the basis for the "Web of Trust". For a number of reasons, those signatures are not currently distributed via keys.openpgp.org.

The killer reason is spam. Third party signatures allow attaching arbitrary data to anyone's key, and nothing stops a malicious user from attaching so many megabytes of bloat to a key that it becomes practically unusable. Even worse, they could attach offensive or illegal content.

The keys.openpgp.org server is also set up to separate the identity and non-identity information of certificates. It will not distribute identity information (i.e. "user IDs" that include a name and email address) unless the owner verifies the email address. Meanwhile, the non-identity information (the key material and metadata) will be stored and distributed freely, though without the certifications. GnuPG and other OpenPGP software can refresh the key material regularly from the server for keys they already have on their keyrings, even if the owner has not consented to the distribution of the key's user ID information.

The News entry for keys.openpgp.org mentions some of the reasons behind starting a new key server. It noted that the SKS key server suffered from a number of problems with abuse, performance, and privacy (including GDPR compliance). Beyond that, SKS is not really being developed any longer as Hansen also pointed out. So keys.openpgp.org is based on a completely new key server, Hagrid, written in Rust using the Sequoia OpenPGP library. The plan is to eventually add more servers to create a pool, but it will not be an open federation model like SKS due to privacy and reliability concerns.

Gillmor's first post has some pointers on other ways to handle certificate flooding, including a link to an internet draft he created back in April for "Abuse-Resistant OpenPGP Keystores". His followup blog post reiterates many of those options as part of an effort to look at the impact of certificate flooding on the community. Hansen was rather more blunt in his post and in a followup that described some of the fallout he sees from the attack. He is particularly angry that advice he gave to Venezuelan activists and others to check his signature on documents may now lead them to effectively break their GnuPG installation.

Both Gillmor and Hansen are obviously distressed, personally, over the fact that their keys are affected by all of this. OpenPGP is already hard enough to use that adding unexpected hurdles impacts other people in ugly ways. As Gillmor put it:

But from several conversations i've had over the last 24 hours, i know personally at least a half-dozen different people who i personally know have lost hours of work, being stymied by the failing tools, some of that time spent confused and anxious and frustrated. Some of them thought they might have lost access to their encrypted e-mail messages entirely. Others were struggling to wrestle a suddenly non-responsive machine back into order. These are all good people doing other interesting work that I want to succeed, and I can't give them those hours back, or relieve them of that stress retroactively.

Other flooded keys have been found; in an LWN comment, Gillmor said that Tor project keys have been spammed. More keys undoubtedly will be flooded and the nature of the SKS key servers makes them a permanent part of the key store. Illegal content could perhaps be attached as certifications; those would also be "impossible" to remove. It is a little hard to see the SKS network surviving this incident in its current form—that may be for the best, in truth.

Daniel Lange has some suggestions on how to easily clean up these spammed keys, though it takes a huge amount of CPU time (he reports 1h45m for Hansen's key). That is rather distressing in its own right; as Filippo Valsorda put it:

Someone added a few thousand entries to a list that lets anyone append to it.

GnuPG, software supposed to defeat state actors, suddenly takes minutes to process entries.

How big is that list you ask? 17 MiB. Not GiB, 17 MiB. Like a large picture.

Clearly GnuPG needs a fix for that; Gillmor filed a bug to that effect. He also filed bugs in other components that failed, which he links from his post.

But apparently the problems with the SKS key server have been known for a long time. It is a bit reminiscent of the state of OpenSSL development pre-Heartbleed; SKS was largely unmaintained, chugging along in the background but not really having much attention paid to it. Those "in the know" were aware of its flaws, but did not have the resources to fix them. As with Heartbleed, it makes one wonder what other projects are out there, seemingly humming along, when, in reality, that hum may be the quiet ticking of a time bomb.

Comments (24 posted)

Providing wider access to bpf()

By Jonathan Corbet
June 27, 2019

The bpf() system call allows user space to load a BPF program into the kernel for execution, manipulate BPF maps, and carry out a number of other BPF-related functions. BPF programs are verified and sandboxed, but they are still running in a privileged context and, depending on the type of program loaded, are capable of creating various types of mayhem. As a result, most BPF operations, including the loading of almost all types of BPF program, are restricted to processes with the CAP_SYS_ADMIN capability — those running as root, as a general rule. BPF programs are useful in many contexts, though, so there has long been interest in making access to bpf() more widely available. One step in that direction has been posted by Song Liu; it works by adding a novel security-policy mechanism to the kernel.

This approach is easy enough to describe. A new special device, /dev/bpf is added, with the core idea that any process that has the permission to open this file will be allowed "to access most of sys_bpf() features" — though what comprises "most" is never really spelled out. A non-root process that wants to perform a BPF operation, such as creating a map or loading a program, will start by opening this file. It then must perform an ioctl() call (BPF_DEV_IOCTL_GET_PERM) to actually enable its ability to call bpf(). That ability can be turned off again with the BPF_DEV_IOCTL_PUT_PERM ioctl() command.

Internally to the kernel, this mechanism works by adding a new field (bpf_flags) to the task_struct structure. When BPF access is enabled, a bit is set in that field. If this patch goes forward, that detail is likely to change since, as Daniel Borkmann pointed out, adding an unsigned long to that structure for a single bit of information is unlikely to be popular; some other location for that bit will be found.

The next step is the addition of little function to determine whether the current process is capable of performing BPF operations:

    static inline bool bpf_capable(int cap)
    {
	return test_bit(TASK_BPF_FLAG_PERMITTED, &current->bpf_flags) ||
	    capable(cap);
    }

Calls to bpf_capable() then replace the various capable(CAP_SYS_ADMIN) (or sometimes CAP_NET_ADMIN) calls that currently protect access to BPF functionality. While the cover letter says that access is provided to "most of" the available BPF features, the patch appears to change every capable() call in the kernel/bpf directory.

The end result of all this work is that a system administrator could, for example, create a new group called bpf; that group would be the group owner of the /dev/bpf file. The permissions on /dev/bpf would be set to allow group read access (write access is not required to make the ioctl() calls); thereafter, any process with membership in the bpf group would be able to use the bpf() system call.

It's worth noting that most interesting things that can be done with BPF involve subsystems beyond the BPF virtual machine itself. Attaching a BPF program to a tracepoint requires the cooperation of the tracing code, for example, and using BPF programs in networking necessarily involves the networking subsystem. There are usually permission checks in those subsystems as well; tracepoint access requires the ability to call perf_event_open(), for example, which may be restricted depending on the system's configuration. This patch does not change those checks, with one exception: the restrictions on what can be done with BPF socket-filter programs are removed if the BPF capability has been turned on.

In summary, what this patch is doing is creating a new capability bit that exists outside of the normal Linux capability mechanism, and which can be turned on or off by any process with read access to /dev/bpf. This new capability is recognized within the BPF subsystem, and in one place in the networking code; it seems highly likely that its use could expand to other parts of the kernel as well. This is a bit of a twist on the usual kernel security model.

There are reasons why one might not want to just add another capability bit instead (CAP_SYS_BPF, say). Existing capability-aware programs would not know what the new bit means and may well mishandle it, for example. But it is not clear that creating what is essentially a capability bit in a separate guise improves on that situation.

It seems likely that, at some point, somebody will want to be able to enable BPF functionality with finer-grained control. The good news is that the low-level machinery to do that is already there in the form of a set of Linux security module (LSM) hooks. Given the increasing use of LSMs to give administrators control over security policies in the kernel, it's perhaps surprising that an LSM-based approach was apparently not considered for this case. That could perhaps change as this patch set moves beyond the BPF community and is reviewed more widely.

Comments (11 posted)

The io.weight I/O-bandwidth controller

By Jonathan Corbet
June 28, 2019

Part of the kernel's job is to arbitrate access to the available hardware resources and ensure that every process gets its fair share, with "its fair share" being defined by policies specified by the administrator. One resource that must be managed this way is I/O bandwidth to storage devices; if due care is not taken, an I/O-hungry process can easily saturate a device, starving out others. The kernel has had a few I/O-bandwidth controllers over the years, but the results have never been entirely satisfactory. But there is a new controller on the block that might just get the job done.

There are a number of challenges facing an I/O-bandwidth controller. Some processes may need a guarantee that they will get at least a minimum amount of the available bandwidth to a given device. More commonly in recent times, though, the focus has shifted to latency: a process should be able to count on completing an I/O request within a bounded period of time. The controller should be able to provide those guarantees while still driving the underlying device at something close to its maximum rate. And, of course, hardware varies widely, so the controller must be able to adapt its operation to each specific device.

The earliest I/O-bandwidth controller allows the administrator to set maximum bandwidth limits for each control group. That controller, though, will throttle I/O even if the device is otherwise idle, causing the loss of I/O bandwidth. The more recent io.latency controller is focused on I/O latency, but as Tejun Heo, the author of the new controller, notes in the patch series, this controller really only protects the lowest-latency group, penalizing all others if need be to meet that group's requirements. He set out to create a mechanism that would allow more control over how I/O bandwidth is allocated to groups.

io.weight

The new controller works by assigning a "weight" value to each control group. Consider, for example, the simple hierarchy shown to the right. If group A is given a weight of 100 for a specific block device and group B has a weight of 300, then B will be allowed to use 75% of the [control-group hierarchy] available bandwidth. Absolute weights do not matter, each group's actual portion of the available bandwidth will be determined by its weight relative to the sum of all weights at that level in the hierarchy.

That leaves open the question of just how the controller determines how much of the device's capacity each group is using. Simply counting I/O operations or total bandwidth turns out to be inadequate, since some requests can be quite a bit more expensive than others. So the new controller uses a "cost model" that tries to better estimate how much of a device's time will be required to satisfy any given request. This model is relatively simple. First, it determines whether a request is sequential or random; in the former case, the operation will complete much more quickly (especially on rotating drives) than in the latter. The operation is given a fixed cost based on this determination, plus an incremental cost for each page to be transferred. The resulting total cost is an estimate of how long it will take for the request to be executed.

By default, the controller will observe the actual behavior of each device to work out what the cost parameters should be. The administrator can override this behavior by writing some commands to the io.weight.cost_model file in the root-level control group. For each drive, the maximum throughput, along with the maximum number of sequential and random operations that can be performed per second, can be specified. Different costs can be used for read and write operations if appropriate.

The default cost model apparently works pretty well. But, should somebody encounter a situation where that model falls apart, there is, inevitably, a hook to run a BPF program that can calculate the cost in whatever way makes sense.

`vtime`

The controller works by establishing a virtual clock (called vtime) for each device; that clock normally advances at the usual rate of one second per second. Each control group also has a vtime clock that determines when it can submit another I/O operation. Once the cost of an operation has been determined, it is added to the group's vtime; the operation can only be sent to the device once that device's vtime is ahead of the group's vtime. The weights assigned to each group are implemented by scaling the cost of each operation proportionally to that group's share of the total bandwidth. If group A, above, has 25% of the available bandwidth, the cost of its operations will be multiplied by four. In a sense, control groups live in a relativistic universe where lower-weight groups have slower-moving clocks.

To avoid situations where a device sits idle when there are operations pending, the controller will take note when any given group is not using the full bandwidth available to it and temporarily lower that group's weight to match its actual usage, in effect "lending" the unused bandwidth to other groups that are performing I/O. There is a mechanism to allow a group to quickly grab back its lent bandwidth should it start to need it.

There is one little remaining problem: the vtime mechanism is designed to issue requests at the speed that the device can handle them. But the cost model is unlikely to be perfect, and the performance of any given device can vary over time. If the cost model is off, the controller could dispatch too many requests (increasing latencies) or not enough requests (leaving some bandwidth unused). That, naturally, is a situation worth avoiding.

Should the controller notice that request-completion times are increasing, it takes that as a signal that too many requests are being sent. That situation is addressed by slowing down the vtime associated with the overloaded device, so that requests will be dispatched at a slower rate. Similarly, if the device is not completely busy, its vtime will be advanced more quickly so that more requests will go out.

The controller will try to tune this scaling automatically, but that may not be adequate for some situations. Write operations, in particular, can be queued within the device itself and completed in an order chosen by the device, meaning that the controller loses some control over the latency of any given request. In cases where that is a problem, it may be desirable to slow down request dispatch more aggressively to reduce the latency of request completion, even at the possible cost of leaving some bandwidth unused. There is another control knob in the root group, called, io.weight.qos, that can be used to specify what the desired latency ranges are and how much the device's vtime can be adjusted to achieve those ranges.

See the comments at the top of this patch for more details on the various control knobs and how they work.

Heo notes that the controller does a reasonable job of enforcing each group's weight using the default parameters — for read requests, at least. When there are a lot of writes involved, some playing with the parameters may be needed to get the best results. Tools and documentation to help administrators working to tune this controller are promised. Meanwhile, there has not been a huge amount of feedback on this controller since it was posted on June 13. Expecting it for 5.3 seems optimistic, but it may well be ready for a merge window shortly thereafter.

Comments (8 posted)

TurboSched: the return of small-task packing

By Jonathan Corbet
July 1, 2019

CPU scheduling is a difficult task in the best of times; it is not trivial to pick the next process to run while maintaining fairness, minimizing energy use, and using the available CPUs to their fullest potential. The advent of increasingly complex system architectures is not making things easier; scheduling on asymmetric systems (such as the big.LITTLE architecture) is a case in point. The "turbo" mode provided by some recent processors is another. The TurboSched patch set from Parth Shah is an attempt to improve the scheduler's ability to get the best performance from such processors.

Those of us who have been in this field for far too long will, when seeing "turbo mode", think back to the "turbo button" that appeared on personal computers in the 1980s. Pushing it would clock the processor beyond its original breathtaking 4.77MHz rate to something even faster — a rate that certain applications were unprepared for, which is why the "go slower" mode was provided at all. Modern turbo mode is a different thing, though, and it's not just a matter of a missing front-panel button. In short, it allows a processor to be overclocked above its rated maximum frequency for a period of time when the load on the rest of system overall allows it.

Turbo mode can thus increase the CPU cycles available to a given process, but there is a reason why the CPU's rated maximum frequency is lower than what turbo mode provides. The high-speed mode can only be sustained as long as the CPU temperature does not get too high and, crucially (for the scheduler), the overall power load on the system must not be too high. That, in turn, implies that some CPUs must be powered down; if all CPUs are running, there will not be enough power available for any of those CPUs to go into the turbo mode. This mode, thus, is only usable for certain types of workloads and will not be usable (or beneficial) for many others.

A workload that would work well in turbo mode is one where the system as a whole is not fully utilized (so that some CPUs can be shut down), and where a relatively small number of processes can benefit from the higher CPU speeds. But that benefit will only be realized if the turbo mode can actually be used. The CPU scheduler in current kernels balances a great many requirements, but "make sure that some CPUs can go into turbo mode" has not been expressed as a need to be balanced in the past. It's thus unsurprising that the scheduler's operation is not optimal for systems with turbo mode and workloads that want to take advantage of that mode.

One problem in particular is that the scheduler is designed to keep the system as responsive as possible and to make the fullest use of the available CPUs. That goal is reflected in how processes are placed on CPUs throughout the system. If a sleeping process wakes up and needs to execute, the scheduler will try to place that process on an idle CPU, thus allowing it to execute immediately rather than waiting in a run queue. That is the right thing to do much of the time, but it is not ideal if your objective is to keep some CPUs powered down so that the others can run in turbo mode. In such cases, it might be better to put a newly awakened process onto a CPU that is already busy and let sleeping CPUs lie.

Getting the scheduler to pack more processes into running CPUs is the objective of the TurboSched patch set. But such packing needs to be done carefully; otherwise, scheduling latency could increase significantly and system utilization could be reduced. To avoid such problem, TurboSched limits this packing behavior to "jitter" processes — those that run sporadically for limited periods of time and which do not have significant response-time requirements. These processes are often doing some sort of housekeeping work and do not suffer from having to share a CPU with other work.

A question that immediately comes to mind is: how does the scheduler decide which processes fit into this "jitter" category? The answer is that it doesn't; such processes need to be specifically marked by user space. Specifically, TurboSched is built on top of the (still unmerged) scheduler utilization clamping patch set, which allows an administrator to impose limits on how much load any given process appears to put on the system. By putting an upper limit on the apparent load, the administrator can keep a given process from forcing a CPU's frequency to increase, even if that process will happily run 100% of the time. Processes marked this way already have a reduced claim to system CPU resources; TurboSched extends this interpretation and concludes that a sleeping CPU should not be powered up for processes whose maximum utilization is clamped.

The logic as implemented in the patch set actually goes a little beyond that, in that jitter processes will not be placed onto a CPU that is running at less that 12.5% of its capacity. The reasoning is that an underutilized CPU might well go idle soon; putting a new process there could prevent that from happening, which would be an undesirable result. Of course, it would also not be good to overload the running CPUs with jitter tasks, so there is an upper limit to how much jitter load can be placed on any given CPU.

This approach may seem familiar; it is reminiscent of the small-task packing algorithms that have been discussed since (at least) 2012. Small-task packing has never made its way into the mainline, so one might wonder why this variation would be different. The biggest difference this time is in the explicit marking of jitter tasks, which will effectively make TurboSched be a no-op on the bulk of the systems out there. In the absence of clamped tasks, the scheduler will run as it does now, so there should be no performance regressions for any existing workloads.

Meanwhile, the benefit for some workloads can be up to a 13% performance increase, according to some impressive ASCII graphics in the patch posting. This increase won't happen with all workloads, but on dedicated systems with well-understood and tuned workloads with the right mix of processes, TurboSched should make things run better. That, along with the relatively noninvasive nature of the patch set, suggests that it might just clear the high bar for scheduler changes.

Comments (8 posted)

Fedora mulls its "python" version

By Jake Edge
July 3, 2019

There is no doubt that the transition from Python 2 to Python 3 has been a difficult one, but Linux distributions have been particularly hard hit. For many people, that transition is largely over; Python 2 will be retired at the end of this year, at least by the core development team. But distributions will have to support Python 2 for quite a while after that. As part of any transition, the version that gets run from the python binary (or symbolic link) is something that needs to be worked out. Fedora is currently discussing what to do about that for Fedora 31.

Fedora program manager Ben Cotton posted a proposal to make python invoke Python 3 in Fedora 31 to the Fedora devel mailing list. The proposal, titled "Python means Python 3", is also on the Fedora wiki. The idea is that wherever "python" is used it will refer to version 3, including when it is installed by DNF (i.e. dnf install python) or when Python packages are installed, so installing "python-requests" will install the Python 3 version of the Requests library. In addition, a wide array of associated tools (e.g. pip, pylint, idle, and flask) will also use the Python 3 versions.

The "Requests" link above does point to a potential problem area, however. It shows that Requests ~~for Python 3~~ III is not fully finished, with an expected release sometime "before PyCon 2020" (mid-April 2020), which is well after the expected October 2019 release of Fedora 31. The distribution already has a python3-requests package, though, so that will be picked up as python-requests in Fedora 31 if this proposal is adopted. There may be ~~other~~ packages out there where Python 3 support is not complete but, at this point, most of the major libraries have converted. [Update: As noted in the comments, this paragraph is based on complete confusion about the version numbers of Requests.]

The proposal is owned by Miro Hrončok and Petr Viktorin, who work on the Python team at Red Hat. Viktorin presented at the 2016 and 2018 Python Language Summits on the problems inherent in the transition to Python 3 for distributions. Both have also been involved in reworking the recommendations that the Python core development team provides to distributions. Those are contained in PEP 394 ("The 'python' Command on Unix-Like Systems"). The rework consists of loosening those guidelines to allow distributions to choose Python 3 for their python; it has garnered the support of the Python steering council.

One of the concerns about the proposal is the specter of another similar transition for Python down the road. As "stan" put it: "Wouldn't this mean that when python4 is released, there is a replay of the python2 -> python3 transition experience?" But Hrončok pointed out that the packages are not being renamed, just that unadorned "python" will resolve to "python3"; shebang ("#!") lines for Python code shipped in packages will need to directly invoke python3, not simply python.

That led Cotton to wonder what the advantage of having the unadorned "python" was; why couldn't there simply be python3 (and python2 if it is installed)? Hrončok noted that users attempting to use Python on Fedora systems might find the following sequence to be suboptimal:

    $ python
    bash: python: command not found

Oh right!

    $ sudo dnf install python
    No match for argument: python
    Error: Unable to find a match

Oh well?

He continued: "Fedora is equipped with fully operational, ready to use and develop on Python interpreter, why not let people find it?" He also pointed out that the upstream Python team was not pushing for this change, but simply allowing for the possibility. Fedora could still decide to not have a python and it would not run afoul of the recommendations. But he strongly suggested that, in any case, python no longer point to Python 2 (even though that would not run afoul of the recommendations either).

The idea of having no python was not particularly popular. Administrators could make that happen on their systems, if they wished, or point python to Python 2 for that matter. But, as Gerald Henriksen put it:

[...] someone sitting down to use a tutorial to learn Python is going to expect to type python and get something they can use, particularly going forward when Python3 really just [becomes] Python as Python2 fades from memory.

In answer to a question from Adam Williamson, Hrončok said that the reasoning for making the change for Fedora 31 was mostly due to the dates involved, as the proposal describes: "The name 'Python' will not refer to software that will be unmaintained upstream for most of Fedora 31's lifetime and retired from Fedora 32." The latter piece refers to the plan to stop shipping Python 2 as python2 in Fedora 32. A legacy python27 package will be created (and may be allowed as a dependency for a small number of packages).

Zbigniew Jędrzejewski-Szmek asked about perhaps postponing the proposed change to Fedora 32 as well:

I think that having this middle state where python2 is available but python points to python3 for exactly one release will be more confusing that switching directly to the final state where python2 is gone and python simply means python3.

Stephen John Smoogen thought that having a release with no python at all would be useful to suss out packages that have still not fixed their shebang lines and such. But Michael Catanzaro said that not having a /usr/bin/python would wreak havoc for software that needs to run on both Linux and macOS, such as WebKit.

If Fedora has /usr/bin/python, then at least we have a *chance* to make the scripts we care about work in both python2 and python3 (our current plan). Whereas without /usr/bin/python, we're really out of options. So I very much support /usr/bin/python -> /usr/bin/python3. It will cause some problems for us and we'll have a difficult transition period, but at least there will exist the possibility of transitioning.

Others will also want to be able to simply use #!/usr/bin/python in their scripts. The distribution can pick a particular version to target in its scripts, but users may have different agendas. Mátyás Selmeci said:

I work on multiple platforms so I make my utility scripts work on both Python 2 and 3 and only use the standard library. It would be super annoying if I had to have multiple versions just because of the shebang line.

There was a fair amount of discussion but pointing python to Python 3 seems like a relatively uncontroversial move. There really wasn't any impassioned argument against it, which is perhaps a little surprising. But Fedora has been moving steadily down the path to retiring Python 2 more or less in lockstep with the Python core developers. One wonders when it will get to the point that legacy Python 2 users will have to get their interpreter from outside of Fedora entirely—maybe by 2025?

Comments (36 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Package hardening; SKS attack; Mageia 7; FreeDOS turns 25; Fuschia OS; Quotes; ...
Announcements: Newsletters; conferences; security updates; kernel patches; ...

Next page: Brief items>>