Leading items

Welcome to the LWN.net Weekly Edition for November 18, 2021

This edition contains the following feature content:

Rollercoaster: group messaging for mix networks: an attempt to improve security and latency for group communications.
Trojan Source and Python: the Python community contemplates how to respond to the "Trojan Source" problem.
Exposing Trojan Source exploits in Emacs: what's the best way to draw attention to potential exploits in a text editor?
Some upcoming memory-management patches: page-table pages, kvmalloc() flags, memory clearing, and home nodes.
5.16 Merge window, part 2: the remainder of the changes merged for the next kernel release.
No Weekly Edition for November 25: Thanksgiving is here.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Rollercoaster: group messaging for mix networks

By Jake Edge
November 17, 2021

Even encrypted data sent on the internet leaves some footprints—metadata about where packets originate, where they are bound, and when they are sent. Mix networks are meant to hide that metadata by routing packets through various intermediate nodes to try to thwart the traffic analysis used by nation-state-level adversaries to identify "opponents" of various kinds. Tor is perhaps the best-known mix network, but there are others that make different tradeoffs to increase the security of their users. Rollercoaster is a recently announced mechanism that extends the functionality of mix networks in order to more efficiently communicate among groups.

Tor uses multiple relay nodes, each of which only knows its predecessor and the node to pass the message on to. It relies on the difficulty of tracking messages through that path, but a sophisticated and well-placed adversary can do various kinds of traffic analysis to potentially match up traffic between two endpoints, thus drawing conclusions about the participants in the communication. To minimize latency, Tor nodes forward packets as quickly as they can, which may help eavesdroppers correlate the traffic.

The Rollercoaster developers, Daniel Hugenroth, Martin Kleppmann, and Alastair R. Beresford from the University of Cambridge, used the Loopix mix network to validate their work. Loopix is different from Tor in that sacrifices latency in order to make traffic analysis even more difficult. The client endpoints in such a mix network send fixed-sized packets at a fixed rate; if there is no outbound traffic, a cover packet is sent that is indistinguishable from normal traffic. The packets are sent to the relay nodes, which independently delay each packet before passing it on to the next relay. All of that makes it much more difficult to correlate the traffic and identify communicating endpoints.

According to the Rollercoaster announcement, mix networks "work well for pairwise communication", but do not scale well when there are multiple participants who need to get updates from each other. A single message or change needs to be sent to each of the other participants in various types of online group activity:

Such group communication encompasses both traditional chat groups (e.g. WhatsApp groups or IRC) and collaborative editing (e.g. Google Docs, calendar sync, todo lists) where updates need to be disseminated to all other participants who are viewing or editing the content. There are many scenarios where anonymity requirements meet group communication, such as coordination between activists, diplomatic correspondence between embassies, and organisation of political campaigns.

While the traffic shaping does an excellent job of disguising the traffic, it becomes a major bottleneck when trying to send realtime updates to multiple participants. The rate limiting means that packets queue up waiting to be sequentially sent to each participant. Meanwhile the client cannot send any other updates until that queue is flushed.

Rollercoaster uses a similar technique to the age-old phone tree to have other participants forward each message, rather than requiring the original sender to individually contact each one. That reduces the latency of getting the message to everyone:

Like a chain reaction, the distribution of the message gains momentum as the number of recipients grows. In an ideal execution of this scheme, the number of users who have received a message doubles with every round, leading to substantially more efficient message delivery across the group.

Since most of the participants are likely not actively sending traffic at any given point in time, the Rollercoaster messages can use bandwidth that is effectively being wasted on the cover packets. Beyond that, "Rollercoaster does not require any changes to the existing Mix network protocol and can benefit from the existing user base and anonymity set".

As with phone trees, where members may not be reachable (or fail to do their part), Rollercoaster needs to be able to deal with clients that are offline or acting maliciously. The algorithm for Rollercoaster chooses a branching factor that determines how many other members a given member needs to send to; which members those are can be calculated at each member node as well. But if one of the group members fails to do its part, some, potentially large, sub-tree of members will not receive their update.

The basic Rollercoaster scheme can be extended to require acknowledgment (ACK) packets from each group member so that failures can be detected and rectified. The original sender coordinates this process and if ACKs are not received before a timeout expires, members that have already received the message can be reassigned to the forwarding role. The timeouts are calculated to take into account the queueing delays and expected mix-node delays; the timer is started once the underlying network actually sends the packet from the originating node.

The original sender has the most incentive to ensure that its messages reach everyone in the group, but if it goes offline for a lengthy period during the process of sending, only some of the members will get the update. In that case, another variant of the algorithm can be used to reconcile the situation. Members periodically choose a random member to send a hash of the messages it has received; if the hashes do not match, a reconciliation protocol can be started that will eventually result in consistency among the members.

In order to evaluate Rollercoaster, the researchers created an open-source Loopix simulator that allowed testing the algorithm in a reproducible fashion. Since the Loopix paper had already demonstrated the "real-world performance" of the mix network, the researchers used a simulator to evaluate Rollercoaster. Simulation allowed them to eliminate external factors, such as network congestion due to unrelated traffic or CPU usage by other processes, while also being able to run faster than realtime, so that more results could be gathered in less time and using fewer resources. In addition, the paper's results can be reproduced by using the same seed for the random-number generator.

The results reported in the paper provide some substantial improvements to the latency, though they may fall short of the expectations for realtime communication. For a group size of 128, Rollercoaster reduces the average latency from 34.9 seconds seen with Loopix sequential unicast to 7.0 seconds—if all members are online. But these kinds of networks are not targeting regular users, instead they target users who are extremely security-conscious, so latency is likely not the primary concern.

The fault-tolerant variant was evaluated using a model of node availability derived from data gathered from volunteers using an Android app by the Device Analyzer project. It too showed marked improvement. Other variants were also evaluated using the simulator. Future plans include actually running collaborative applications using Rollercoaster and providing facilities to change the members of the group.

While the overall concept seems somewhat obvious, perhaps, the bulk of the paper is taken up by analysis of the algorithm and its variants. The paper was presented at the 30th USENIX Security Symposium back in August; the slides and video of the presentation can be found at that site. For those who want to dig in even further, a technical report adds security proofs and other information in more appendices to the paper.

While Rollercoaster (and Loopix) may not be for everyone, there is value in this work even beyond helping to secure those being targeted by powerful adversaries. The simulator, analysis, and proofs will likely be useful for other researchers, perhaps even in somewhat unrelated areas. Eventual consistency is an area of interest for distributed databases, for example. Meanwhile, improving the ability for the vulnerable to communicate more securely is certainly a laudable goal.

Comments (10 posted)

Trojan Source and Python

By Jake Edge
November 16, 2021

The Trojan Source vulnerabilities have been rippling through various development communities since their disclosure on November 1. The oddities that can arise when handling Unicode, and bidirectional Unicode in particular, in a programming language have led Rust, for example, to check for the problematic code points in strings and comments and, by default, refuse to compile if they are present. Python has chosen a different path, but work is underway to help inform programmers of the kinds of pitfalls that Trojan Source has highlighted.

On the day of the Trojan Source disclosure, Petr Viktorin posted a draft of an informational Python Enhancement Proposal (PEP) to the python-dev mailing list. He noted that the Python security response team had reviewed the report and "decided that it should be handled in code editors, diff viewers, repository frontends and similar software, rather than in the language". He agreed with that decision, in part because there are plenty of other kinds of "gotchas" in Python (and other languages), where readers can be misled—purposely or not.

But there is a need to document these kinds of problems, both for Python developers and for the developers of tools to be used with the language, thus the informational PEP. After some adjustments based on the discussion on the mailing list, Viktorin created PEP 672 ("Unicode-related Security Considerations for Python"). It covers the Trojan Source vulnerabilities and other potentially misleading code from a Python perspective, but, as its "informational" status would imply, it is not a list of ways to mitigate the problem. "This document purposefully does not give any solutions or recommendations: it is rather a list of things to keep in mind."

ASCII

It starts by looking at the ASCII subset of Unicode, which has its own, generally well-known, problem spots. Characters like "0" and "O" or "l" and "1" can look the same depending on the font; in addition, "rn" may be hard to distinguish from "m". Fonts designed for programming typically make it easier to see those differences, but human perception can sometimes still be outwitted:

However, what is “noticeably” different always depends on the context. Humans tend to ignore details in longer identifiers: the variable name accessibi1ity_options can still look indistinguishable from accessibility_options, while they are distinct for the compiler. The same can be said for plain typos: most humans will not notice the typo in responsbility_chain_delegate.

Beyond that, the ASCII control codes play a role. For example, NUL (0x0) is treated by CPython as an end-of-line character, but editors may display things differently. Even if the editor highlights the unknown character, putting a NUL at the end of a comment line might be easily misunderstood, as the following example shows:

[...] displaying this example:
# Don't call this function:
fire_the_missiles()
as a harmless comment like:
# Don't call this function:⬛fire_the_missiles()

Backspace, carriage return (without line feed), and escape (ESC) can be used for various visual tricks, particularly when code is output to a terminal. Python allows more than just ASCII in its programs, however; Unicode is legal for identifiers (e.g. function, variable, and class names) as described in PEP 3131 ("Supporting Non-ASCII Identifiers"). But, as PEP 672 notes: "Only 'letters and numbers' are allowed, so while γάτα is a valid Python identifier, 🐱 is not." In addition, non-printing control characters (e.g. the bidirectional overrides used in one of the Trojan Source vulnerabilities) are not allowed in identifiers.

Homoglyphs

But the other Trojan Source vulnerability relates to "homoglyphs" (or "confusables" as the PEP calls them). Characters in various alphabets can be similar or the same as those in other languages: "For example, the uppercase versions of the Latin b, Greek β (Beta), and Cyrillic в (Ve) often look identical: B, Β and В, respectively." That can lead to identifiers that look the same, but are actually different; there are other oddities as well:

Additionally, some letters can look like non-letters:

The letter for the Hawaiian ʻokina looks like an apostrophe; ʻHelloʻ is a Python identifier, not a string.
The East Asian word for ten looks like a plus sign, so 十= 10 is a complete Python statement. (The “十” is a word: “ten” rather than “10”.)

Though there are symbols that look like letters in another language, symbols are not allowed in Python identifiers, obviating the reverse problem. Another surprising aspect might be in the conversion of strings to numbers in functions such as int() and float(), or even in str.format():

Some scripts include digits that look similar to ASCII ones, but have a different value. For example:
>>> int('৪୨')
42
>>> '{٥}'.format('zero', 'one', 'two', 'three', 'four', 'five')
five

The second example uses the indexing feature of str.format() to pick the Nth value from its arguments; in that case, the value of the number is five, even though it looks vaguely like a zero. Then there are the confusions that can arise from bidirectional text.

Bidirectional text

The presence of code containing identifiers in right-to-left order fundamentally changes the way it is interpreted by CPython, which may be puzzling to those who are used to left-to-right ordering; the Unicode bidirectional algorithm is used to determine how to interpret and display such code. For example:

In the statement ערך = 23, the variable ערך is set to the integer 23.

The example above might be clear enough from context for someone reading it who is used to reading left-to-right text, but another of the PEP's examples takes things further:

In the statement قيمة - (ערך ** 2), the value of ערך is squared and then subtracted from قيمة. The opening parenthesis is displayed as ).

Another extended example gets to the heart of the Trojan Source bidirectional vulnerability. It starts by showing the difference a single right-to-left character makes in a line of code, then looks at the effects of the invisible Unicode code points that change or override directionality within a line.

Consider the following code, which assigns a 100-character string to the variable s:
s = "X" * 100 #    "X" is assigned
When the X is replaced by the Hebrew letter א, the line becomes:
s = "א" * 100 #    "א" is assigned
This command still assigns a 100-character string to s, but when displayed as general text following the Bidirectional Algorithm (e.g. in a browser), it appears as s = "א" followed by a comment.
[...] Continuing with the s = "X" example above, in the next example the X is replaced by the Latin x followed or preceded by a right-to-left mark (U+200F). This assigns a 200-character string to s (100 copies of x interspersed with 100 invisible marks), but under Unicode rules for general text, it is rendered as s = "x" followed by an ASCII-only comment:
s = "x‏" * 100 #    "‏x" is assigned

Readers who normally use left-to-right text may find it interesting to paste some of that code into Python or to try working with it in an editor; the behavior is not intuitive, at least for me. The uniname utility may be useful for peering inside to see the code points.

There are other Unicode code points that affect directionality, but the effects of all of them terminate at the end of a paragraph, which is usually interpreted as the end of line by various tools, including Python. Using those normally invisible code points can have wide-ranging effects as seen in Trojan Source and noted in the PEP:

These characters essentially allow arbitrary reordering of the text that follows them. Python only allows them in strings and comments, which does limit their potential (especially in combination with the fact that Python's comments always extend to the end of a line), but it doesn't render them harmless.

Normalization

Another topic covered in the PEP is the normalization of Unicode code points for identifiers. In Unicode, there are often several different ways to generate the same "character"; using Unicode equivalence, it is possible to normalize a sequence of code points to produce a canonical representation. There are four ways to do so, however; Python uses NFKC to normalize all identifiers, but not strings, of course.

There are some interesting consequences stemming from that, which can also be confusing. For example, there are multiple variants of the letter "n" in Unicode, several in mathematical contexts, all of which are normalized to the same value, leading to oddities like:

>>> xⁿ = 8
>>> xn
8

In a followup message, Paul McGuire posted a particularly graphic demonstration of how a simple program can be transformed into something almost unreadable via normalization. Treating strings differently means that functions like getattr() will behave differently than a lookup done directly in the code. An example from the PEP (ab)uses the equivalence of the ligature "ﬁ" with the string "fi" to demonstrate that:

>>> class Test:
...     def ﬁnalize(self):
...         print('OK')
...
>>> Test().finalize()
OK
>>> Test().ﬁnalize()
OK
>>> getattr(Test(), 'ﬁnalize')
Traceback (most recent call last):
  ...
AttributeError: 'Test' object has no attribute 'ﬁnalize'

Similarly, using the import statement to refer to a module in the code will normalize the identifier, but using importlib.import_module() with a string does not. Beyond that, various operating systems and filesystems also do normalization; "On some systems, ﬁnalization.py, finalization.py and FINALIZATION.py are three distinct filenames; on others, some or all of these name the same file."

Reaction

The reaction to the PEP has been quite positive overall, as might be expected. There were some questions about whether it should be a PEP or part of the standard documentation. For now, Viktorin is content to keep it as a PEP, but thinks it may make sense to integrate it into the documentation at some point; "I went with an informational PEP because it's quicker to publish", Viktorin said.

The conversation also turned toward changes that could be made to Python to help avoid some of the problems and ambiguities that arise. Several suggestions that might seem to be reasonable at first blush are too heavy-handed, likely resulting in too many false positives or, even, effectively banning some common languages (e.g. Cyrillic), as Serhiy Storchaka pointed out. For the most part, it was agreed that these kinds of problems should be detected by linters and other tools that can be configured based on the project and its code base. There may be some appetite for disallowing explicit ASCII control codes in strings and comments, however, as Storchaka suggested:

All control characters except CR, LF, TAB and FF are banned outside comments and string literals. I think it is worth to ban them in comments and string literals too. In string literals you can use backslash-escape sequences, and comments should be human readable, there are no [reasons] to include control characters in them. There is a precedence of emitting warnings for some superficial escapes in strings.

As can be seen here and in the PEP, Viktorin provided a whole cornucopia of things for Python developers of various stripes to consider. While the exercise was motivated by the Trojan Source vulnerabilities, the "problems" are more widespread. There is a fine line between supporting various writing systems used by projects worldwide and discovering oddities—malicious or otherwise—in a particular code base. Developers of tools targeting the Python ecosystem will find much of interest in the PEP.

Comments (59 posted)

Exposing Trojan Source exploits in Emacs

By Jonathan Corbet
November 11, 2021

While the "Trojan Source" vulnerabilities have, thus far, generated far more publicity than examples of actual exploits, addressing the problem still seems like a good thing to do. There are several places where defenses could be put into place; text editors, being the place where developers look at a lot of code, are one obvious example. The discussion of how to enhance Emacs in this regard has made it clear, though, that there are multiple opinions about how an editor should flag potential attacks.

For those just tuning in, one of the Trojan Source vulnerabilities takes advantage of the control codes built into Unicode for the handling of bidirectional text. While this article is written in a left-to-right language, many languages read in the opposite direction, and Unicode-displaying applications must be prepared to deal with that. Sometimes, those applications need some help to know the direction to use when rendering a particular piece of text. Unicode provides control codes to reverse the current direction for this purpose; unfortunately, clever use of those codes can cause program text to appear differently in a editor (or browser or other viewing application) than it appears to the compiler. That can be used to sneak malicious code past even an attentive reviewer.

One part of the problem is applications that show code containing overrides in a way that is correct (from a Unicode-text point of view), but which is incorrect in terms of what will actually be compiled. So an obvious solution is to change how applications display such text. It is thus not surprising that a conversation sprung up on the Emacs development list to figure out what the Emacs editor should do.

Emacs maintainer Eli Zaretskii was quick to point out that this problem was not as new as some people seem to think. A variant of it had been discussed on that list back in 2014; at that time, the concern was malicious URLs but the basic technique was the same. Zaretskii explained that, in response, he had added some defenses to Emacs:

As result of these discussions, I implemented a special function, bidi-find-overridden-directionality, which is part of Emacs since version 25.1, released 5 years ago. (Don't rush to invoke that function with the code samples mentioned above: it won't catch them.) My expectation, and the reason why I bothered to write that function, was that given the interest and the long discussion, the function will immediately be used in some URL-related code in Emacs. That didn't happen, and the function is collecting dust in Emacs ever since.

He would be, he said, less than fully enthusiastic about launching into another defensive effort without some sort of assurance that this work would actually find some users.

Others, meanwhile, were thinking about ways to make it clear that there is funny business going on in code containing directional overrides. Daniel Brooks posted an approach using the existing Emacs whitespace-mode, with some extra configuration to mark directional overrides as special types of white space. Gregory Heytings posted a patch with a similar intent that worked by adding a new display table. Stefan Kangas suggested having the Emacs byte compiler raise errors whenever the problematic control codes appear in Emacs Lisp code unless a special flag is set.

Zaretskii was not particularly impressed with any of these approaches. Simply marking the control characters, he said, is just creating "visual noise" that will make reading text harder, and is addressing the wrong problem: "The mere presence of these characters is NOT the root cause. These characters are legitimate and helpful when used as intended". He referenced TUTORIAL.he, the Emacs tutorial translated to Hebrew, which uses overrides for Emacs commands. This version of the file in a GitHub copy of the Emacs repository now helpfully marks the lines containing those overrides (as another Trojan Source defense). Zaretskii's point is that adding warnings to this kind of usage, which is not malicious, is a distraction that trains users to ignore the warnings wherever they appear.

So what should Emacs do? Zaretskii continued:

The challenge, therefore, is not to make these characters stand out wherever they happen, because that would flag also their legitimate uses for no good reason. The challenge is to flag only those suspicious or malicious uses of these characters. And that cannot be done by just changing the visual appearance of those characters, because their legitimate uses are by far more frequent than their malicious uses. To flag only the suspicious cases, the code which does that needs to examine the details of the text whose directionality was overridden and detect those cases where such overriding is suspicious. For example, when a character with a strong left-to-right directionality has its directionality overridden to behave like right-to-left character, that is highly suspicious, because it makes no sense to do that in 99.99% of valid use cases.

Zaretskii quickly committed a patch to the Emacs repository implementing this heuristic. Your editor decided to give it a try, starting with this example of malicious code posted by Brooks:

    (defun main ()
      (let ((is_admin nil))
        ‮⁦ ; begin admins only⁩⁦(when is_admin
          (print "You are an admin."))‮⁦ ; end admins only⁩(
    )

This code contains overrides that cause the when test to be commented out; the effect of the overrides can be seen in the browser by slowly highlighting over the code with the mouse. This code, displayed in a normal Emacs buffer with font-lock turned on, looks like this:

It is worth noting that the suspicious nature of the code is already reasonably clear from the syntax coloring; the when test is colored as a comment. Zaretskii's patch is meant to make this problem stand out even more: when the new highlight-confusing-reorderings command is run on it, the code now looks like this:

That should certainly be enough to cause even a casual reader to wonder what is going on with that code. As intended, this new command does not highlight the overrides used in TUTORIAL.he — except that, amusingly, it found two places where the overrides were used incorrectly.

Heytings didn't like this solution: "When security is at stake, I very much prefer too many false positives to missing one danger". He also pointed out a case that Zaretskii's code failed to catch, citing it as proof that only highlighting the malicious uses is not feasible. (That case did not survive its encounter with the email archives; it can be seen in this page). Zaretskii responded that users who don't care about false positives can highlight all of the relevant control characters in Emacs now; he also applied a fix to detect the case that Heytings found. At that point, Heytings made it clear that he thought his point had been missed and gave up on the discussion.

So now the discussion would appear to be over; Emacs has a mechanism to make suspicious use of Unicode directional overrides easy to see. It may be a while before users benefit from that work, though. It is not at all clear that this change will be backported to current Emacs releases, so it may only be found in development builds for some time. There is also nothing that uses it by default; the highlighting will only happen if the user explicitly asks for it. To make this functionality more available, developers will need to incorporate it into the major modes used with various programming languages.

This fix, assuming it is shown to work over time, is only directly relevant to the small subset of developers who live their lives within Emacs. The approach taken, though, could prove to be useful beyond Emacs. Just waving a red flag at something that might be suspicious is usually not the best solution for security problems, especially if most of the instances that are flagged are legitimate. After a while, we all grow weary of looking past those flags and simply stop seeing them. If it is possible to just shine a light on uses that truly merit a closer look, though, then we might just gain a little security from it.

Comments (78 posted)

Some upcoming memory-management patches

By Jonathan Corbet
November 12, 2021

The memory-management subsystem remains one of the most complex parts of the kernel, with an ongoing reliance on various heuristics for performance. It is thus not surprising that developers continue to try to improve its functionality. A number of memory-management patches are currently in circulation; read on for a look at the freeing of page-table pages, kvmalloc() flags, memory clearing, and NUMA "home nodes".

Freeing page-table pages

When user space allocates memory, the kernel, obviously, must find pages to satisfy that allocation. But it must also allocate page-table pages to handle the mapping of addresses to the newly allocated memory. For a system with 4KB pages and 64-bit addresses, one page-table page is needed for every 512 ordinary pages of memory (assuming huge pages are not in use). For applications with massive address spaces, the amount of memory used just for page-table pages can be significant.

That memory can end up being wasted if the user-space pages are reclaimed, perhaps in response to an madvise() call. Those pages will be removed from the working set, but the page-table pages that mapped them will remain allocated. If all of the pages mapped by a given page-table page have been reclaimed, the page-table page itself will be empty and serving no purpose. Applications that allocate, then free, large ranges of memory can accumulate a lot of these useless page-table pages, which is less than optimal.

This patch set from Qi Zheng aims to fix that problem. It works by adding a reference count for page-table pages (via yet another overloaded field in struct page). Users of the page, such as page-fault handlers, will increment that reference count, ensuring that the page stays around while they do their work. Adding a page-table entry for a user-space page will also increase the reference count, while reclaiming a page will cause the reference count to be decremented. If the reference count drops to zero, the kernel knows that the page-table page does not actually contain any page-table entries and can be freed.

Test results included with the patch set show that, indeed, reclaiming empty page-table pages can free up a lot of memory for other uses. On the other hand, there is an impact on the page-fault handlers that makes itself felt in a couple of ways. One, of course, is the overhead of maintaining the reference counts as page-table entries are added. But there is also a cost to freeing a page-table page, then having to allocate a new one should that part of the address space become populated again. Overall, the result is an approximately 1% performance hit in page-fault handling.

That cost may be more than some users want to bear. For now, though, there is no associated knob to turn this behavior off; if this patch set is merged, all systems will free empty page-table pages. Chances are, in any situation where large numbers of page-table pages are affected, performance gain from freeing all of that memory will exceed the page-fault costs, but that may not hold for other types of applications.

More flags in `vmalloc()`

Kernel memory allocated with vmalloc() must be mapped into a special part of the kernel's address space. Unlike memory from kmalloc() or the page allocator, it is not accessed directly via the kernel's direct memory mapping. Use of vmalloc() was, at one time, discouraged; the address space available in the vmalloc() area was small, and there is an extra cost to creating the additional mappings. Over time, though, use of vmalloc() has grown. The advent of 64-bit systems has eliminated the address-space limitation, and there are increasing numbers of places in the code where multi-page allocations are needed. The chances of successfully allocating a multi-page buffer are much higher if the pages involved need not be located in physically contiguous memory.

The vmalloc() interface, however, has never supported the various GFP_* flags passed to kmalloc() to influence how the memory is allocated; this limitation persists in add-on functions like kvmalloc(), which attempts a kmalloc() call with a fallback to vmalloc() on failure. This has proved to be a real problem for some kernel subsystems, especially filesystems, that need to be able to allocate memory with flags like GFP_NOFS, GFP_NOIO, or GFP_NOFAIL. As a result, some filesystems have avoided kvmalloc(), while others, such as Ceph, have rolled their own memory-allocation functions to work around the problem.

Michal Hocko has addressed this problem with a patch set adding support for the above GFP_ flags to the vmalloc() subsystem, and to kvmalloc() specifically. That makes these functions useful in filesystem settings, and allows the removal of Ceph's special allocation function. As of this writing, one of the precursor patches from that set has made it to the mainline, but the rest have not yet been merged. That may well change before the end of the 5.16 merge window, stay tuned.

Uncached memory clearing

Modern computers make heavy use of memory caches for one simple reason: caches improve performance. So it is interesting to see this patch set from Ankur Arora that claims to improve memory performance by bypassing caching. As one might expect, this is an improvement that only works in specific circumstances.

If the kernel needs to zero out a single page of memory, a series of normal, cached writes will almost certainly be the way to go. That allows the cached writes to be flushed to main memory at the system's convenience. A newly zeroed page is also fairly likely to have other data written to it in short order; having that page in cache will speed those writes, and may eliminate the need to write the initial zeroes out to memory at all. Caching is, thus, a performance win here.

The situation changes, though, when large amounts of memory need to be cleared. If the amount of memory to clear exceeds the size of the last-level cache, it turns out to be faster to just write directly to memory rather than wait for all of those zeroes to be flushed out of the cache. Such a large write will also flush everything else out of the cache, and some of that data is likely to be wanted in the near future. So, for large clearing operations, bypassing the cache seems like the better way to go.

Arora's patch set thus changes the kernel to use uncached writes whenever a huge (2MB) or gigantic (1GB) page is to be cleared. This kind of operation happens frequently in systems running virtualized guests; a new guest starts off with a range of zeroed memory hosted, when possible, on huge or gigantic pages. Test results included with the patch set show performance improvements of 1.6x to 2.7x for virtual-machine creation. That would seem to be good enough to justify making this change.

Setting a home NUMA node

NUMA systems are characterized by the fact that memory located on the local NUMA node (or a nearby node) is faster to access than memory on a remote node. That means there can be considerable scope for performance improvements by controlling which nodes are used for memory allocations. The kernel provides a number of ways of controlling these allocations, including the recently added MPOL_PREFERRED_MANY, which was covered here in July.

Aneesh Kumar K.V. would like to add another one, though, in the form of a new system call:

    int set_mempolicy_home_node(unsigned long start, unsigned long len,
    				unsigned long home_node, unsigned long flags);

This system call will set the "home node" for the len bytes of address space beginning at start to the NUMA node number passed in home_node. The flags argument is currently unused.

The home node is meant to be used in combination with the MPOL_PREFERRED_MANY or MPOL_BIND memory-allocation policies. Those policies can specify a set of nodes that are to be used for new memory allocations, but do not say anything about which of those nodes, if any, is the preferred one. If a home node has been set with set_mempolicy_home_node(), allocations will happen on that node if possible; failing that, the kernel will fall back to one of the other nodes allowed by the in-force policy, preferring the node that is closest to the home node.

The intent is to give applications a bit more control over memory allocations while avoiding memory from slow nodes. No word yet on when the NUMA developers will throw in the towel and just have user space provide a BPF program to direct memory-allocation policies.

Comments (22 posted)

5.16 Merge window, part 2

By Jonathan Corbet
November 15, 2021

Linus Torvalds released 5.16-rc1 and ended the 5.16 merge window on November 14, as expected. At that point, 12,321 non-merge changesets had been pulled into the mainline; about 5,500 since our summary of the first half of the merge window was written. As is usually the case, the patch mix in the latter part of the merge window tended more toward fixes, but there were a number other changes as well.

Changes pulled in the latter part of the merge window include:

Architecture-specific

The PowerPC architecture now sets the STRICT_KERNEL_RWX option by default. This prevents memory from being both executable and writable, increasing hardening overall.
Memory hotplug is no longer supported on 32-bit x86 systems. This feature was marked as broken over one year ago; seemingly nobody complained.

Core kernel

The DAMON operations schemes (DAMOS) patch set has been merged; this mechanism allows the use of DAMON to control page reclaim in user space. DAMON has also gained the ability to perform physical address-space monitoring.
Only the SLUB slab allocator can be selected on systems configured for realtime preemption.

Filesystems and block I/O

The fanotify mechanism has gained the ability to provide notifications when filesystem errors happen; this feature is meant for use by monitoring systems. There is some documentation in this commit, and this commit contains a sample program.
The F2FS filesystem has two new mount options that instruct the kernel to fragment files across the storage device. Most users are unlikely to want to use this option, but it can be helpful for developers working on the performance of fragmented filesystems.

Hardware support

Industrial I/O: Analog Devices ADXL355 and ADXL313 3-axis digital accelerometers, Maxim MAX31865 RTD temperature sensors, Senseair Sunrise 006-0-0007 and SCD4X CO2 sensors, NXP IMX8QXP analog-to-digital converters, and Analog Devices ADRF6780 microwave upconverters.
Miscellaneous: Alibaba elastic network interfaces, ASPEED UART routing controllers, Qualcomm QCM2290 global clock controllers, Qualcomm SC7280 low power audio subsystem clock controllers, Qualcomm SC7280 camera clock controllers, MediaTek MT8195 clocks, NXP i.MX8ULP CCM clock controllers, HiSilicion hi3670 PCIe PHYs, Nintendo switch controllers, Amlogic Meson6/8/8b/8m2 AO ARC remote processors, NXP i.MX DSP remote processors, MStar MSC313 realtime clocks, Cypress StreetFighter touchkey controllers, and Sharp LS060T1SX01 FullHD video mode panels.
PCI: MediaTek MT7621 PCIe host controllers and Qualcomm PCIe endpoint controllers.
Pin control: Qualcomm SM6350 and QCM2290 pin controllers, UniPhier NX1 SoC pin controllers, ZynqMP ps-mode pin GPIO controllers, Mediatek MT7986 pin controllers, and Apple SoC GPIO pin controllers.
Sound: Realtek ALC5682I-VS codecs, NVIDIA Tegra 210 AHUB audio hubs, Nuvoton NAU88L21 audio codecs, Rockchip I2S/TDM audio controllers, Richtek RT9120 Stereo class-D amplifiers, Qualcomm asynchronous general packet router buses, Qualcomm Audio Process Manager digital audio interfaces, and Maxim integrated MAX98520 speaker amplifiers.

Miscellaneous

The zstd compression code bundled into the kernel has been updated to version 1.4.10 — the first update in four years. There have been a lot of changes, including the addition of a new, more kernel-like wrapper API. See this merge commit for more information.

Security-related

The device-mapper subsystem is now able to generate audit events.
The final change pulled before the 5.16-rc1 release completed the task of eliminating implicit fall-throughs in switch statements. Specifically, the -Wimplicit-fallthrough warning has been enabled to flag any attempts to add any new uses of that pattern.

Internal kernel changes

The exported symbols for the DMA-BUF API have been moved into a separate namespace as an indication that they are not intended for general use. This was one of the outcomes of this Maintainers Summit session on accelerator drivers. The Habana accelerator driver work that provoked much of this discussion has also been merged.
The patch set replacing congestion_wait() has been merged. Using congestion to regulate memory reclaim has not worked for years; the relevant code has finally been fixed.
The liblockdep library has been removed from the kernel tree in favor of maintaining it externally going forward.

If the usual nine-week schedule is followed, the 5.16 release can be expected to happen on January 2. Given the presence of the holidays just before that date, it would not be entirely surprising to see the schedule slip by a week. Either way, there is a lot of testing and fixing to be done between now and then.

Comments (7 posted)

No Weekly Edition for November 25

November 25 is the US Thanksgiving holiday and, as tradition would have it, one of the two weeks every year we do not publish the weekly edition. Instead, we will probably be overeating and definitely be celebrating with friends and family, though the giant gatherings of the pre-pandemic times are likely contraindicated at this point. In any case, we wish all of our readers the best, whether they celebrate the holiday or not. One of the things we can be thankful for is you folks, so thanks!

The next weekly edition will be on December 2.

Comments (3 posted)

Page editor: Jake Edge
Next page: Brief items>>