Leading items

Welcome to the LWN.net Weekly Edition for October 24, 2019

This edition contains the following feature content:

BPF and the realtime patch set: the incoming realtime code and BPF are currently incompatible with each other; why is that and how can the problem be fixed?
Changing the Python release cadence: two options for getting new Python features to users more quickly.
"git request-pull" and confusing diffstats: why pull requests can contain surprising diffstats.
Really fixing getrandom(): the early-boot entropy problem has apparently been solved in 5.4, but is the result truly random?
Implementing alignment guarantees for kmalloc(): a change to internal memory allocations in the kernel gets through despite opposition.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

BPF and the realtime patch set

By Jake Edge
October 23, 2019

Back in July, Linus Torvalds merged a patch in the 5.3 merge window that added the PREEMPT_RT option to the kernel build-time configuration. That was meant as a signal that the realtime patch set was moving from its longtime status as out-of-tree code to a fully supported kernel feature. As the code behind the configuration option makes its way into the mainline, some friction can be expected; we are seeing a bit of that now with respect to the BPF subsystem.

The thread started with a patch posted by Sebastian Andrzej Siewior to the BPF mailing list. The patch mentioned three problems with BPF running when realtime is enabled and added Kconfig directives to only allow BPF to be configured into kernels that did not have PREEMPT_RT:

Disable BPF on PREEMPT_RT because

it allocates and frees memory in atomic context
it uses up_read_non_owner()
BPF_PROG_RUN() expects to be invoked in non-preemptible context

Siewior said that he had tried to address the memory allocation problems "but I have no idea how to address the other two issues". In that thread, he also gave an overview of what is needed to "play nicely" with the realtime patch set.

Daniel Borkmann replied that the simple approach Siewior took would not actually disable all of BPF, as there are other BPF-using subsystems that would not be affected by the change. Siewior asked for feedback on one possible way to solve that, but David Miller made it clear that he does not think this approach makes sense: "Turning off BPF just because PREEMPT_RT is enabled is a non-starter it is absolutely essential functionality for a Linux system at this point."

However, as Siewior said, there are fundamental incompatibilities between the implementation of BPF and the needs of the realtime patch set. Thomas Gleixner provided more detail on the problem areas in the hopes of finding other ways to deal with them:

#1) BPF disables preemption unconditionally with no way to do a proper RT substitution like most other infrastructure in the kernel provides via spinlocks or other locking primitives.
#2) BPF does allocations in atomic contexts, which is a dubious decision even for non RT. That's related to #1
#3) BPF uses the up_read_non_owner() hackery which was only invented to deal with already existing horrors and not meant to be proliferated.

Miller replied that BPF is needed by systemd and the IR drivers already; "We're moving to the point where even LSM modules will be implemented in bpf." In his earlier message, he said that turning off BPF would disable any packet sniffing so that tcpdump and Wireshark would not function. To a certain extent, he oversold the need for BPF as Gleixner pointed out; Gleixner was running Debian testing with Siewior's patch applied and not encountering any systemd or other difficulties. Furthermore, even though packet sniffing is not using BPF, thus requiring a copy to user space for each packet, it does still work, Gleixner said, so that is not really an argument for requiring BPF either.

Beyond that, though, he was really looking for feedback on "how to tackle these issues on a technical level". Some of that did start to come about in a sub-thread with BPF maintainer Alexei Starovoitov, who wondered about disabling preemption and noted that he is a "complete noob in RT". Gleixner explained the situation at some length.

Essentially, the realtime kernel cannot disable preemption or interrupts for arbitrarily long periods of time. The realtime patches substitute realtime-aware locks for the spinlocks and rwlocks that do disable preemption or interrupts in the non-realtime kernel. Those realtime-aware locks can sleep, however, so they cannot be used from within code sections that have explicitly disabled preemption or interrupts.

As Starovoitov explained, BPF disables preemption "because of per-cpu maps and per-cpu data structures that are shared between bpf program execution and kernel execution". But, he said, BPF does not call into code that might sleep, so there should be no problems on that score. But that is only when looking at the BPF code from a non-realtime perspective, Gleixner said; because of the lock substitution, code that does not look like it could sleep actually can sleep since the realtime locks (e.g. sleeping spinlocks) do so. That's what makes using preempt_disable() (and local_irq_disable()) problematic in the realtime context. He said that the local_lock() mechanism in the realtime tree might be a way forward to better handle the explicit preemption disabling in BPF.

But, he said, there is still the outstanding problem of BPF making calls to up_read_non_owner(), which allows a read-write semaphore (rwsem) to be unlocked by a process that is not the owner of the lock. That breaks the realtime requirement that the locker is the same as the unlocker in order to deal with priority inheritance correctly.

Starovoitov also said that BPF does not have unbounded runtime within the preemption-disabled sections, since it has a bound on the number of instructions that can be in a BPF program. But the limit on the number of instructions was recently raised from 4096 to one million, which will result in unacceptable preemption-disabled windows as Gleixner noted:

Assuming a instruction/cycle ratio of 1.0 and a CPU frequency of 2GHz, that's 500us of preempt disabled time. Out of bounds by at least one order of [magnitude] for a lot of RT scenarios.

Even the earlier limit of 4096 would result in 2µs of preemption-disabled time, which may be problematic Clark Williams said; "[...] there are some customer cases on the horizon where 2us would be a significant fraction of their max latency".

The local_lock() scheme seemed viable to Starovoitov, but he thought the overall approach taken by Siewior's patch was backward:

But reading your other replies the gradual approach we're discussing here doesn't sound acceptable ? And you guys insist on disabling bpf under RT just to merge some out of tree code ? I find this rude and not acceptable.

If RT wants to get merged it should be disabled when BPF is on and not the other way around.

But Gleixner did not see things that way at all; he noted that he was planning to investigate local locks as a possible way forward and that there was no insistence on anything. In addition, turning off realtime when BPF was enabled was always an option; "[...] I could have done that right away without even talking to you. That'd have been dishonest and sneaky." He also lamented that the discussion had degraded to that point.

For his part, Starovoitov said that he simply thinks disabling an existing in-kernel feature in order to ease the path for the realtime patches is likely to "backfire on RT". He suggested getting the code upstream without riling other subsystem developers was a better way forward:

imo it's better to get key RT bits (like local_locks) in without sending contentious patches like the one that sparked this thread. When everyone can see what this local_lock is we can figure out how and when to use it.

That's where things were left at the time of this writing. It is not clear that the practical effect of which subsystem disables the other makes any real difference to users. For now, users of BPF will not be able to use the realtime patches and vice versa. Fixing the underlying problems that prevent the two from coexisting certainly seems more important than squabbling over who disables who; there will be users who want both in their systems after all. For those who run mainline kernels, though, that definitely cannot happen until realtime gets upstream; once it does, a piecemeal approach to resolving the incompatibilities between BPF and realtime can commence.

Comments (23 posted)

Changing the Python release cadence

By Jake Edge
October 23, 2019

There has been discussion about the release cadence of Python for a couple of years now. The 18-month cycle between major releases of the language is seen by some core developers as causing too much delay in getting new features into the hands of users. Now there are two competing proposals for ways to shorten that cycle, either to one year or by creating a rolling-release model. In general, the steering council has seemed inclined toward making some kind of release-cycle change—one of those Python Enhancement Proposals (PEPs) may well form the basis of Python's release cadence moving forward.

Some history

The idea had undoubtedly been discussed before, but the current push began with a 2018 Python Language Summit presentation on changing to a yearly major release that was given by Łukasz Langa—the release manager for Python 3.8 and 3.9. Roughly a year later, he proposed doubling the release frequency in the first posting of PEP 596 ("Python 3.9 Release Schedule"). That would mean a major Python release every nine months. Around the time that PEP 596 was being discussed, Nick Coghlan created PEP 598 ("Introducing minor feature releases"), which was meant to speed up feature releases; it was an early progenitor of the rolling-release model.

Nine months, as proposed in PEP 596, was too fast of a cycle for many, but the steering council did indicate support for a change of the release cycle; a one-year cycle was seemingly a better fit for most commenters. The council thought that the 3.9 schedule PEP was the wrong place to propose this kind of change, however; it suggested that a new PEP be created to propose the cadence change. To that end, Langa created PEP 602 ("Annual Release Cycle for Python"), returning to his original one-year cadence that was better received than the shorter cycle.

In the discussion of the PEP, Steve Dower suggested an even faster pace of continuous alpha releases. Coghlan was intrigued by Dower's proposal, so they teamed up to create PEP 605 ("A rolling feature release stream for CPython"). When it was posted for discussion, Coghlan noted that he had withdrawn his earlier proposal in favor of the new one.

PEP 605 itself is something of a wall of text, but it is also accompanied by a lengthy design discussion posting as well. As part of the discussion on both PEPs, though, it became clear that there was some confusion about the background and, in particular, what problems they were both trying to solve. That, naturally, led to yet another PEP, which was authored by Langa, Dower, and Coghlan: PEP 607 ("Reducing CPython's Feature Delivery Latency"). Looking at that PEP would seem to be a good starting point for disentangling the various pieces here.

Rationale

PEP 607 lays out a number of reasons for "delivering smaller collections of features to Python’s users more frequently"; it also describes the reasoning for making major releases at roughly the same time of year (every year for PEP 602 and every other year for PEP 605). The size of the batches of features that are delivered is one of the areas that the PEPs are trying to address. Right now, roughly 18 months worth of changes are delivered at once, which increases the chances that there will be problems for users while also complicating the process of tracking down where any problem came from because there are more interacting features to wade through. PEP 602 simply reduces the 18-month window to a 12-month window, while PEP 605 will regularly deliver new features every two months for "a subset of Python’s user base that opts in to running a rolling stream of beta releases".

Then there is the problem of latency between the development of features and their delivery to users. Given that missing a particular major release means that a feature will languish for another 18 months, developers may push changes before they are truly ready. Similarly, features that could be in the hands of users are simply hanging around in the repository unused. Once again, the PEP 602 solution is simply shortening the cycle;

PEP 605 proposes to address it by actively creating a community of Python users that regularly install and use CPython beta releases, providing an incentive for core developers to start shipping changes earlier in the pre-release cycle, in order to obtain feedback before the feature gets locked down in a stable release.

Regularity in the release schedule is helpful to coordinate with other projects (e.g. Linux distributions) and events in the Python world (e.g. PyCon US, core developer sprints), but it is also helpful to regular users—and developers. Rather than consulting a calendar or PEP, the dates of interest will be predictable: major releases in October every year for PEP 602 or August of every other year for PEP 605. Other dates (feature freezes and the like) can be derived from that baseline.

It is also generally the case that major changes in the CPython interpreter or the standard library are not actually tested by users until after a major release. That means the feedback on new features is limited until after they have been solidified, requiring a multi-year deprecation cycle to correct any problems. Both of the PEPs are targeting ways to get feedback earlier in the cycle. PEP 602 would start the alpha period for a new release immediately after the major release (i.e. October of each year), while PEP 605 has a wider scope:

PEP 605 proposes to address this problem by actively promoting adoption of CPython pre-releases for running production workloads (not just for library and application compatibility testing), and adjusting the pre-release management process as necessary to make that a reasonable thing to do.

There are, of course, risks when changing the release cadence, some of which both PEPs have considered. Users and distributors may either skip some major releases or may update every time the language does. PEP 602 may cause some of the "skippers" to skip two releases instead of just one, but there is a prolonged deprecation cycle that will hopefully ensure that doing so will not miss out on deprecation warnings. For non-skippers, the support periods will remain roughly the same as they are today under PEP 602 (roughly 18 months of full support plus 42 months of security-fix-only support).

PEP 605 takes a different approach, obviously. Skippers may simply end up moving to the non-skipper bucket since there will be 24 months between major releases. Non-skippers should find that each release will have prolonged full support (roughly 24 months) with a shorter period of bug-fix-only support (36 months) added on. In addition, the idea is that the two-month beta releases will be more widely used, thus tested, which will lead to "more stable final releases with wider ecosystem support at launch".

Meanwhile, those who follow the CPython master branch should see no real impact, though PEP 605 is trying to entice them into switching to a better tested rolling beta stream. Third-party libraries are obviously a major component of the Python ecosystem; these include things like NumPy, SciPy, Django, and all manner of libraries available in the Python Package Index (PyPI). For those, the major pain point is the number of different Python versions that need to be simultaneously supported. This is an area where it would seem that PEP 605 has a clear advantage in that there will be fewer major releases made. The overall support period for a given release is the same in each PEP (60 months), so there will be roughly twice as many active major releases under PEP 602, though it will only be a minor increase from the current situation.

Which is not to say that PEP 605 is without flaws, though. It is a big change from what the Python ecosystem is used to, while PEP 602 is a fairly straightforward shortening of the existing state of affairs. The main complaint in the PEP 602 discussion thread (aside from those that spawned PEP 605) was about the October date, which is poorly situated for Fedora. For the most part, commenters were either in favor or neutral. In fact, most of the comments were about the still-under-discussion ideas that eventually resulted in PEP 605.

Rolling ... or roiling?

In the PEP 605 discussion, things were more mixed. Langa, unsurprisingly, was opposed to the approach. Paul Moore was skeptical that the larger projects in the ecosystem would participate in the rolling releases, leading to simply a slower release cycle. The term "beta" was discussed, as well, with some wondering if it had baggage that would worry potential adopters. But Coghlan said that there may be some confusion about what he and Dowers are aiming for:

There seems to be a fundamental disconnect here, where folks seem to think Steve and I want everyone to start using the rolling releases. We don’t - we only want folks that are thoroughly dissatisfied with the slower full release cadence to use them.

[...]

Globally, my guess is that the latter group would number somewhere in the thousands (somewhere between 0.1% and 1% of our user numbers), which should be sufficient for more robust pre-release feedback. It wouldn’t be the millions that use the stable releases, but that’s a large part of the point: encouraging a larger group of self-selected early adopters to contribute to improving API designs before we lock them down in a stable release for the rest of our multiple-orders-of-magnitude larger overall user community.

PEP 605 proposes that the two-month rolling releases be broken up into alpha releases, which could contain changes to the CPython ABI, and beta releases that would maintain ABI compatibility. That would mean that C-language extensions might need to be reworked and rebuilt for the alpha releases, but should not need to be for the beta releases. The "alpha" and "beta" terms are being used rather differently than most expect. It also leads to a rather unexpected pattern of release numbers, both of which may be contributing to the confusion. For example, the first 3.9 rolling release would presumably be an alpha, thus 3.9.0a1; after that, there would likely be betas for the next few bimonthly releases: 3.9.0b2, 3.9.0b3, ... If there were an ABI breaking change for the fourth release, though, it would be 3.9.0a4.

That unconventional pattern of release versions was considered potentially confusing by several in the discussion. Fedora Python maintainer Miro Hrončok noted that the distribution would have to hack things to make the versions sort correctly. Moore agreed:

Overall, mixing alphas and betas like this seems like a fairly gratuitous violation of an extremely common convention, and while we can argue all we like that it won’t affect anyone in practice, it’s still going to break a lot of assumptions.

The discussion continues as of this writing. At some point, the council will need to decide between the two—or to stay with the status quo. As council member Brett Cannon said in early October, there will be a need to make a decision fairly soon in order to nail down the 3.9 schedule, which effectively started running on October 14 with the release of Python 3.8. There is also an upcoming election for a new steering council, which will "basically freeze up PEP decision making until mid-December". It should be noted that Coghlan is also a member of the council, but seems likely to recuse himself from a decision on a PEP that he has co-authored.

Though the rolling model seems like a huge departure, it does really only affect a self-selected subset of the Python community. The 24-month major release cycle will certainly have more widespread effects, but is not a huge stretch. In fact, while Python has been released on an 18-month cycle for some time, it is documented to be an 18-24-month cycle, so PEP 605 does not really make any changes to that—just to "recent" practice. One might guess that the council will lean toward the least "radical sounding" proposal, thus PEP 602, but one never knows. Langa is the Python 3.9 release manager, which may also be a factor, but certainly PEP 605 and the status quo should not be counted out. It will be interesting to see how it all plays out.

Comments (10 posted)

"git request-pull" and confusing diffstats

By Jonathan Corbet
October 21, 2019

When a kernel subsystem maintainer has a set of commits to send up the chain toward the mainline, the git request-pull command is usually the right tool for the job. But various maintainers have noticed over the years that this command can sometimes generate confusing results when confronted with anything but the simplest of histories. A brief conversation on the linux-kernel mailing list delved into why this situation comes about and what maintainers can do in response.

While git request-pull is intended to be a general-purpose command, it's no secret that its output is designed for the needs of one specific consumer; it contains almost everything Linus Torvalds needs to know when considering a request to pull a tree into the mainline kernel. This information includes the commit upon which the tree is based, the timestamp for the most recent commit in the tree, the tree to pull the commits from, the maintainer's description of the changes, a list of commits (in the git shortlog format), and a diffstat showing which files will be touched. See a typical pull request to see how all of those elements are worked into a single email.

That example is generated from a relatively straightforward development history that looks something like this:

Generating both the commit log and the diffstat for this history is relatively straightforward, and the pull requests looks exactly as one would expect.

Recently, Will Deacon ran into a more complex situation, though. His tree was initially based in 5.4-rc1, but then required a merge of 5.4-rc2 to obtain the dependencies for a fix. The history ended up looking something like this:

When one runs git request-pull for a tree like this, the commit-log portion will look exactly as expected — it contains the commits in the local tree. The diffstat, though, is likely to reflect a large number of unrelated changes, making the pull request look like a scary beast indeed. In essence, that diffstat will reflect not just the local changes, but also everything that was pulled into the local tree when 5.4-rc2 was merged. That makes it hard, at best, to see what the pull request will actually do.

Deacon was surprised that the commit log was correct while the diffstat was wrong. Torvalds explained it this way:

So logs and diffs are fundamentally different.

A log is an operation on a _set_ of commits (that's the whole point - you don't list the beginning and the end, you list all the commits in between), while a diff is fundamentally an operation on two end-points and shows the code difference between those two points.

And the summary versions of those operations (shortlog and diffstat) are no different.

He went on to say that, when there are only two endpoints and the history is simple, it is not hard for a tool like Git to calculate the difference between them. Throwing another merge into the mix complicates the situation, though, by adding another endpoint. The end result is the useless diffstat included in the pull request.

Deacon resolved the issue by merging the current mainline with the tree to be pulled:

The diffstat could then be generated manually choosing the mainline and the merged tree as the two endpoints; the only differences that will be visible will be those that are not in the current mainline — the changes to be pulled from Deacon's tree, in other words. The clean diffstat was then manually patched into the pull request. The merge itself can then be thrown away; it should not be part of the commit stream sent upstream. As Torvalds explained, performing the merge reduced the diffstat problem back to two simple endpoints, so the result became as one would expect.

Should maintainers perform this sort of dance before sending pull requests upstream? It seems that Torvalds appreciates the effort; he described the result as a "good quality" pull request. He also noted, though, that he often gets pull requests with confusing diffstats and doesn't really have a hard time dealing with them. Still, maintainers who want to be sure that their pull requests are as pleasing to the recipient as possible may want to go to the extra effort.

The best solution, of course, would be to fix git request-pull to do the right thing in this sort of situation. Depending on how complex the merge history is, though, "the right thing" may not always be entirely obvious. It might also, like the merge described above, require changing the repository, which the request-pull command does not currently do. But, as Ingo Molnar noted, it should at least be possible for git request-pull to detect this situation and issue a warning when it arises. Then, at least, developers would not be surprised by a bogus diffstat — something that can easily happen immediately after having sent it upstream.

Comments (10 posted)

Really fixing getrandom()

By Jonathan Corbet
October 17, 2019

The final days of the 5.3 kernel development cycle included an extensive discussion of the getrandom() API and the reversion of an ext4 improvement that was indirectly causing boot hangs due to a lack of entropy. Blocking filesystem improvements because they are too effective is clearly not a good long-term development strategy for the kernel, so there was a consensus that some sort of better solution had to be found. What was lacking was an idea of what that solution should be. It is thus surprising that the problem appears to have been dealt with in 5.4 with little in the way of dissent or disagreement.

The root of the problem in 5.3 was the blocking behavior of getrandom(), which will prevent a caller from proceeding until enough entropy has been collected to initialize the random-number generator. This behavior was subjected to a fair amount of criticism, and few felt the need to defend it. But changing getrandom() is not easy; any changes that might cause it to return predictable "random" numbers risks creating vulnerabilities in any number of security-sensitive applications. So, while various changes to the API were proposed, actually bringing about those changes looks like a multi-year project at best.

There is another way to ensure that getrandom() doesn't block during the bootstrap process: collect enough entropy to ensure that the random-number generator is fully initialized by the time that somebody needs it. That can be done in a number of ways. If the CPU has a hardware random-number generator, that can be used; some people distrust this solution, but mixing entropy from the hardware with other sources is considered to be safe by most, even if the hardware generator is somehow suspect. But not all CPUs have hardware random-number generators, so this solution is not universal in any case. Other potential solutions, such as having the bootloader or early user space initialize the random pool, can work in some situations but are not universal either.

Another possible solution, though, is jitter entropy: collecting entropy from the inherently nondeterministic nature of current hardware. The timing of even a simple sequence of instructions can vary considerably as the result of multiple layers of cache, speculative execution, and other tasks running on the system. Various studies into jitter entropy over the years suggest that it is real; it might not be truly nondeterministic, but it is unpredictable, and data generated using jitter entropy can pass the various randomness test suites.

Shortly after the 5.3 release, Thomas Gleixner suggested using a simple jitter-entropy mechanism to initialize the entropy pool. Linus Torvalds described this solution as "not very reliable", but he clearly thought that the core idea had some merit. He proposed a variant of his own that, after some brief discussion, was committed into the mainline at the end of the 5.4 merge window.

Torvalds's patch adds a new function, try_to_generate_entropy(), which is called if somebody is requesting random data and the entropy pool is not yet fully initialized. It is simple enough to discuss in detail:

    static void try_to_generate_entropy(void)
    {
	struct {
	    unsigned long now;
	    struct timer_list timer;
	} stack;

This function starts by declaring a timer on the stack, which is a relatively rare occurrence in the kernel. Putting it onto the stack was a deliberate choice, though; if the timer function runs on a different CPU, it will cause cache contention (and unpredictable timing) for accesses to this function's stack space.

	stack.now = random_get_entropy();

	/* Slow counter - or none. Don't even bother */
	if (stack.now == random_get_entropy())
	    return;
	timer_setup_on_stack(&stack.timer, entropy_timer, 0);

On most architectures, random_get_entropy() just reads the timestamp counter (TSC) directly. The TSC increments for every clock cycle, so two subsequent calls should always return different values. Just how different they will be, though, is where the unpredictability comes in. The code above is simply verifying that the system on which the kernel is running does indeed have a high-frequency TSC; without that, this algorithm will not work. Most current hardware does have a TSC, fortunately. Assuming the TSC check passes, the timer is prepared for use.

	while (!crng_ready()) {
	    if (!timer_pending(&stack.timer))
		mod_timer(&stack.timer, jiffies+1);
	    mix_pool_bytes(&input_pool, &stack.now, sizeof(stack.now));
	    schedule();
	    stack.now = random_get_entropy();
	}

This loop is the core of the jitter-entropy algorithm. The timer declared above is armed to expire in the near future (the next "jiffy", which be anytime between 0ms and 10ms); that expiration will happen by way of an interrupt and may occur on a different CPU. The expiration of the timer adds complexity to the instruction stream (and, hopefully, more randomness as well).

Each time through the loop, the TSC is queried and its value is mixed into the entropy pool. If the timer has expired, it is re-armed to run again. Then the scheduler is invoked, adding more complex code and an unpredictable amount of execution before that call returns. The loop itself runs until the entropy pool is deemed to be initialized. Given that a lot can happen even in one jiffy, this loop can be expected to run quite a few times and mix many TSC readings into the entropy pool.

	del_timer_sync(&stack.timer);
	destroy_timer_on_stack(&stack.timer);
	mix_pool_bytes(&input_pool, &stack.now, sizeof(stack.now));
    }

The post-loop cleanup gets rid of the timer and mixes in the last bit of entropy.

The timer function itself looks like this:

    static void entropy_timer(struct timer_list *t)
    {
	credit_entropy_bits(&input_pool, 1);
    }

In other words, every time the timer expires, the entropy pool is deemed to have gained one bit of entropy. It takes 128 bits of entropy for the pool to be considered ready, so the jitter loop may have to run for up to 128 jiffies — potentially just over one second on a system with a 100HZ tick frequency — if the system in question has no other sources of entropy. That could result in a perceivable pause during the boot process, but it is far better than blocking outright.

When Torvalds decided to merge this code, he suggested that it might not be the definitive solution to the problem:

I'm not saying my patch is going to be the last word on the issue. I'm _personally_ ok with it and believe it's not crazy, and if it then makes serious people go "Eww" and send some improvements to it, then it has served its purpose.

He was thus following the time-honored practice of submitting a patch in the hope that it would inspire somebody to create a better one. Thus far, though, there has been little in the way of commentary on this change — somewhat surprising, given the diversity of opinions expressed earlier in the discussion. Unless somebody comes along with a better idea or shows that the entropy produced by this algorithm is somehow predictable, this change seems likely to stand. Hopefully it achieves the goal of preventing getrandom() blocking while, at the same time, ensuring that random numbers from the kernel are always truly unpredictable.

Comments (31 posted)

Implementing alignment guarantees for kmalloc()

October 18, 2019

This article was contributed by Marta Rybczyńska

kmalloc() is a frequently used primitive for the allocation of small objects in the kernel. During the 2019 Linux Storage, Filesystem, and Memory Management Summit, Vlastimil Babka led a session about the unexpected alignment problems developers face when using this function. After a few months he has come back with the second version of a patch set implementing a natural alignment guarantee for kmalloc(). From the strong opposition it faced initially, it seemed that the change would not get accepted. However, it ended up in Linus Torvalds's tree. Let's explore what happened.

The issue Babka wanted to fix is the fact that kmalloc() sometimes returns objects that are not naturally aligned (that is, aligned to the object size if that size is a power of two). Most of the time, though, kmalloc() does return naturally aligned objects and some drivers and subsystems have come to depend on that property. The exceptions are when SLUB debugging is enabled or when the SLOB allocator is used. kmalloc() is essentially a shell around the SLAB, SLUB or SLOB allocator, depending on the kernel configuration; interested readers may wish to read an article on the reasons SLUB was introduced and look at a LinuxCon 2014 slide set [PDF] on the three allocators. Unexpectedly returning an unaligned object can cause data corruption and other errors. In response to that problem, Babka proposed to guarantee natural alignment for allocated objects with power-of-two size, so that all alignment expectations are fulfilled.

For and against `kmalloc()` alignment

In the patch set discussion, Christopher Lameter (the creator of the SLUB allocator) disagreed with the idea of adding natural alignment and noted that kmalloc() has its own alignment limit (KMALLOC_MINALIGN) for a reason: to allow optimized memory layout without wasting memory. The SLOB allocator is an example; it is designed for small embedded systems and to incur minimal overhead. The patch from Babka would change that expected behavior. Also, any future allocators would have to take those new constraints into account and that would prevent them from implementing certain optimizations in their memory layout.

Matthew Wilcox was in favor of Babka's proposal, as there are many subsystems that already depend on the implied alignment behavior. He mentioned examples like the persistent-memory (pmem) and RAM-disk drivers. The XFS filesystem, without an alignment guarantee, would need slab caches for each object size between 512 bytes and PAGE_SIZE, and it may need even more of them depending on what kmalloc() does guarantee.

Dave Chinner agreed with providing alignment for small objects and spoke for further alignment of large objects (bigger than a page) to page boundaries. This need was seen when using pmem with KASAN. He suggested, though, using a GFP flag to tell the allocator to return a naturally aligned object, and to fail if it cannot. That would avoid the need for higher-level subsystems to create additional caches. Babka and other developers preferred to deal with the issue without a separate flag.

A heated debate followed about the severity of the issue. Lameter disagreed that the misalignment cases are frequent, or even seen in practice, as the drivers affected are enabled in distribution test systems that use debug options. The cases of bad alignment should have been seen in that testing, according to him. Christoph Hellwig noted that the breakage often happens under special conditions, like buffers that cross a page boundary.

From a private NAK to the mainline

Following the debate, Babka asked for formal approval or disapproval of the patch set:

So if anyone thinks this is a good idea, please express it (preferably in a formal way such as Acked-by), otherwise it seems the patch will be dropped (due to a private NACK, apparently).

David Sterba commented that he has had to apply workarounds for misalignment cases and would be happy to remove them when the generic code is fixed. Darrick J. Wong seconded Sterba's opinion and expressed his strong preference for open discussion:

Oh, I didn't realize ^^^^^^^^^^^^ that *some* of us are allowed the privilege of gutting a patch via private NAK without any of that open development discussion inconvenience. <grumble>

Lameter followed up stating that the options to detect misalignment have been available for years and are ready to use. Wilcox disagreed, as the issues show up when debugging options are enabled and this is particularly the case when all of the other features should work fine:

People who are enabling a debugging option to debug their issues, should not have to first debug all the other issues that enabling that debugging option uncovers!

Andrew Morton moved the discussion back to the technical subject and asked for verification of the patch's correctness. Lameter confirmed that it is technically fine, while still disagreeing with the intent. That was followed by a number of acknowledgments (Acked-by:) from kernel developers showing their support for Babka's solution.

That series of approvals ended the public discussion; Babka did not resend the patch set or submit a third version. The situation seemed blocked as the patch set had support of multiple developers, but not from the maintainer of the SLUB allocator, which is heavily affected by the patch set. However, the patch was included in Morton's tree and was merged to the mainline on October 7th.

Summary

This discussion shows an example of the kernel community working on a change that affects a behavior that has been present for a long time. It is not a surprise that not all developers agreed with the solution — however, in this case, the one disagreeing was the maintainer of one of the modified subsystems. The final result shows that such changes can be accepted into the mainline since there was wide support from kmalloc() users and other memory-management developers.

Comments (7 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>