LWN.net Weekly Edition for September 24, 2020 [LWN.net]

Welcome to the LWN.net Weekly Edition for September 24, 2020

This edition contains the following feature content:

OpenPGP in Thunderbird: a 21-year-old feature request is finally satisfied.
Python 3.9 is around the corner: what's coming in the next major Python release.
Removing run-time disabling for SELinux in Fedora: turning off SELinux may get a little harder.
The seqcount latch lock type: an introduction to an obscure kernel concurrency mechanism.
Four short stories about preempt_count(): a change to how preemption is tracked in the kernel has wide implications.
Accurate timestamps for the ftrace ring buffer: a detailed look at what was required to fix a timestamping problem in the kernel's tracing subsystem.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

OpenPGP in Thunderbird

By Jake Edge
September 23, 2020

It is a pretty rare event to see a nearly 21-year-old bug be addressed—many projects are nowhere near that old for one thing—but that is just what has occurred for the Mozilla Thunderbird email application. An enhancement request filed at the end of 1999 asked for a plugin to support email encryption, but it has mostly languished since. The Enigmail plugin did come along to fill the gap by providing OpenPGP support using GNU Privacy Guard (GnuPG or GPG), but was never part of Thunderbird. As part of Thunderbird 78, though, OpenPGP is now fully supported within the mail user agent (MUA).

The enhancement request actually asked for Pretty Good Privacy (PGP) support; PGP is, of course, the progenitor of OpenPGP. The standards effort that resulted in OpenPGP started in 1997. Back in 1999, PGP was the only real choice for email encryption, though the initial version of GnuPG had been released a few months before the request.

Early on, the main concerns expressed in the bug tracker were about the legality of shipping cryptographic code. The US government's attempts to restrict the export of cryptographic systems, known as the "crypto wars", were still fresh in the minds of many. It was not entirely clear that adding "munitions-grade crypto" to a MUA like Thunderbird was legal or wise. Early in 2000, the US revised its export-control regulations, which removed that particular concern.

There was work done toward adding support for OpenPGP and Secure/Multipurpose Internet Mail Extensions (S/MIME), which is another email encryption standard, over 2000 and 2001, but the code never actually landed. Thunderbird (called "mailnews" in those days) was in fire-fighting mode; fixing bugs and getting basic functionality working took precedence over new features like encryption. There was also a need to design a reasonable plugin mechanism.

Eventually, Enigmail showed up, which took some of the pressure off the Mozilla developers. Enigmail could be used on all of the supported platforms for Thunderbird to encrypt and decrypt PGP-style email (either inline or PGP/MIME) using GnuPG. Its initial maintainer, Ramalingam Saravanan, updated the bug with new information about Enigmail several times.

In the bug, multiple people suggested that Enigmail be incorporated into Thunderbird and the Enigmail developers were not opposed. In 2003, Patrick Brunschwig, who was a new maintainer for the plugin, said that doing so would help in getting rid of some of the "hacks" that were done to make Enigmail work with Thunderbird. But nothing like that ever happened.

Thunderbird itself has had something of a checkered past with regard to its parent, Mozilla. On two separate occasions Thunderbird has been spun out from the Mozilla nest. In 2007, it left to allow Mozilla to focus on Firefox. That led to the creation of Mozilla Messaging as the new home for Thunderbird, which was reabsorbed in 2011. But in 2012, support for Thunderbird from Mozilla was reduced and in 2015 Thunderbird was given its walking papers again. Then, in 2017, it was determined that the right place for "Thunderbird’s legal, fiscal and cultural home" was the Mozilla Foundation.

All of that upheaval was likely not entirely conducive to focused development, but plenty of good work was done on the MUA over the intervening years, including adding S/MIME support along the way. However, integrating Enigmail or otherwise supporting OpenPGP never quite made the list. People would periodically pop up in the bug report to ask that it be resolved and occasionally Brunschwig would note that the decision was in the hands of the Thunderbird developers. That went on for many years, until an October 2019 blog post announcing the project's plans with respect to OpenPGP.

The announcement said that Thunderbird will be releasing a version in (northern hemisphere) summer 2020 with support for OpenPGP built right in. It will not be based on Enigmail, which will not be updated to the new Thunderbird plugin (or add-on) interface; Enigmail will effectively be in maintenance mode. It will be supported on the then-current Thunderbird 68 release, until that reaches end of life six months after 78 is released. But Brunschwig will be working on the OpenPGP support for Thunderbird and the plans were to help ensure that Enigmail keys and settings could make the transition.

In addition, the project plans to leave GnuPG behind, as explained by Kai Engert on the tb-planning mailing list. It comes down to licensing, at least in part. GnuPG is available under GPLv3, which means that shipping it as part of Thunderbird, which is under the Mozilla Public License 1.1, could be tricky to do right. But there is also a complexity factor:

When we talked with Patrick Brunschwig about this topic, he advised that based on his experience as the maintainer and developer of Enigmail, the interaction with the external GnuPG software was a constant source for support requests. Frequently, Enigmail didn't behave as intended, and often it was found that the cause of the issue was a nonworking interaction with the separate GnuPG software.

If Thunderbird decided to distribute GnuPG software, the situation might get even more complicated. If users already have a copy of GnuPG installed on their system, we'd have to be careful to avoid any potential conflicts that might occur by having two competing copies of GnuPG installed on a computer.

It may be possible, eventually, to use GnuPG for Thunderbird cryptographic operations, but that is not a priority—except to support OpenPGP smartcards. The RNP library for OpenPGP, which is what is being used for Thunderbird, does not support smartcards, at least yet. In the interim, using GnuPG for smartcards will be supported for Thunderbird.

The OpenPGP wiki page lays out the overall vision for the feature. As planned, OpenPGP support was released as part of Thunderbird 78 in early September. It comes with a migration tool to help Enigmail users make the switch. In addition, the Mozilla Open Source Support program provided a grant to security audit both RNP and the related Thunderbird code. "We are happy to report that no critical or major security issues were found, all identified issues had a medium or low severity rating, and we will publish the results in the future."

There is an extensive HOWTO and FAQ document, a wiki status page, and a discussion forum for the "end-to-end encryption" (e2ee) feature in Thunderbird. The e2ee feature covers both OpenPGP and S/MIME in Thunderbird, though a support document for the feature only covers OpenPGP at the time of this writing.

The main difference, from a user perspective, between OpenPGP and S/MIME is the matter of keys. As with everything in the cryptography world, it seems, key management for email is a difficult problem. S/MIME takes a certificate approach to keys, like with TLS keys for HTTPS; keys are signed by certificate authorities, which can be done in-house or by third parties. OpenPGP depends on the decentralized web of trust, where keys are verified and signed by other users' keys. A key that is signed by a trusted key may also be trusted and those trust relationships can extended in a transitive fashion if desired.

Existing users of Enigmail will encounter some changes. For example, Enigmail "junior mode", which was added by the p≡p foundation, is not supported. Also, OpenPGP in Thunderbird does not support the web of trust directly:

In other words, with Enigmail and OpenPGP some keys of your correspondents might have been automatically accepted for use, if there was a path of trust from your keys, along a path of keys that you had signed, eventually pointing to the key you'd like to use. This indirect trust isn't offered in Thunderbird. Instead, you are currently required to manually accept each recipient key that you'd like to use.

It has been a long time coming, but it seems that OpenPGP has made its way into Thunderbird proper. It would be nice to believe that it will help broaden the adoption of email encryption, but that is probably a forlorn hope. Adding the feature will serve to highlight encryption, however, which may eventually pay dividends. But the key-management problem, in particular, is difficult and is likely the largest barrier to widespread adoption of email encryption.

Comments (17 posted)

Python 3.9 is around the corner

By John Coggeshall
September 22, 2020

Python 3.9.0rc2 was released on September 17, with the final version scheduled for October 5, roughly a year after the release of Python 3.8. Python 3.9 will come with new operators for dictionary unions, a new parser, two string operations meant to eliminate some longstanding confusion, as well as improved time-zone handling and type hinting. Developers may need to do some porting for code coming from Python 3.8 or earlier, as the new release has removed several previously-deprecated features still lingering from Python 2.7.

Python 3.9 marks the start of a new release cadence. Up until now, Python has done releases on an 18-month cycle. Starting with Python 3.9, the language has shifted to an annual release cycle as defined by PEP 602 ("Annual Release Cycle for Python").

A table provided by the project shows how Python performance has changed in a number of areas since Python 3.4. It is interesting to note that Python 3.9 is worse than 3.8 on almost every benchmark in that table, though it does perform generally better than 3.7. That said, it is claimed that several Python constructs such as range, tuple, list, and dict will see improved performance in Python 3.9, though no specific performance benchmarks are given. The boost is credited to the language making more use of a fast-calling protocol for CPython that is described in PEP 590 ("Vectorcall: a fast calling protocol for CPython").

As the PEP explains, Vectorcall replaces the existing tp_call convention which has poor performance because it must create intermediate objects for a call. While CPython has special-case optimizations to speed up this process for calls to Python and built-in functions, those do not apply to classes or third-party extension objects. Additionally, tp_call does not provide a function pointer per object (only per class), again requiring the creation of several intermediate objects when making calls to classes. Vectorcall is faster because it does not have the same intermediate-object inefficiencies that are found in tp_call. Vectorcall was introduced in Python 3.8, but starting with version 3.9 it is used for the majority the Python calling conventions.

New operators and methods

Python 3.9 includes new dictionary union operators, | and |=, which we have previously covered; they are used to merge dictionaries. The | operator evaluates as a union of two dictionaries, while the |= operator stores the result of the union in the left-hand side of the operation:

    >>> z = {'a' : 1, 'b' : 2, 'c' : 3}
    >>> y = {'c' : 'foo', 'd' : 'bar' }
    >>> z | y
    {'a': 1, 'b': 2, 'c': 'foo', 'd': 'bar'}
    >>> z |= y
    >>> z
    {'a': 1, 'b': 2, 'c': 'foo', 'd': 'bar'}

There are many ways dictionaries can be merged in Python, but Andrew Barnert said that the operator is designed to address the "copying update":

The problem is the copying update. The only way to spell it is to store a copy in a temporary variable and then update that. Which you can’t do in an expression. You can do _almost_ the same thing with {**a, **b}, but not only is this ugly and hard to discover, it also gives you a dict even if a was some other mapping type, so it’s making your code more fragile, and usually not even doing so intentionally.

In situations where the two dictionaries share a common key, the last-seen value for a key "wins" and is included in the merge as shown above for key c. While the standard union operator | only allows unions between dict types, the assignment variety |= can be used to update a dictionary with new key-value pairs from an iterable object:

    >>> z = {'a' : 'foo', 'b' : 'bar', 'c' : 'baz'}
    >>> y = ((0, 0), (1, 1), (2, 8))
    >>> z |= y
    >>> z
    {'a': 'foo', 'b': 'bar', 'c': 'baz', 0: 0, 1: 1, 2: 8}

PEP 584 ("Add Union Operators To dict") provides complete documentation of the new operators.

Two new string methods have also been added in version 3.9: removeprefix() and removesuffix(). These convenience methods make it easy to remove an unwanted prefix or suffix from string data. As described in PEP 616 ("String methods to remove prefixes and suffixes"), these functions are being added to address user confusion regarding the str.lstrip() and str.rstrip() methods, which are often mistaken as a means to remove a prefix or suffix from a string. The confusion around str.lstrip() and str.rstrip() comes from its optional string parameter. According to the PEP, the confusion for users stems from the fact that the parameter passed to str.lstrip() and str.rstrip() is interpreted as a set of individual characters to remove, rather than as a single substring. With the additions, the project hopes to provide a "cleaner redirection of users to the desired behavior." Using these new methods is straightforward, as shown below:

    >>> a = "PEP-616"
    >>> a.removeprefix("PEP-")
    '616'

Deprecation and porting

Developers should be aware of some features that are being deprecated and removed in 3.9, as well as some more deprecations that are coming in 3.10. Many Python 2.7 functions that emit a DeprecationWarning in version 3.8 "have been removed or will be removed soon" starting with version 3.9. The project recommends testing applications with the -W default command-line option, which will show these warnings, before upgrading. As we previously covered, certain backward-compatibility layers, such as the aliases to Abstract Base Classes in the collections module, will remain for one last release before being removed in Python 3.10. The complete listing of removals in version 3.9 is available for interested readers. Further, the release includes numerous new deprecations of language features that will be removed in a future release. An additional recommendation is to run tests in Python Development Mode using the -X dev option to prepare code bases for future changes.

Other goodies

As we reported, Python 3.9 ships with a new parsing expression grammar (PEG) parser to replace the current LL(1) parser in version 3.8. In PEP 617 ("New PEG parser for CPython") describing the change, the switch to the PEG parser will eliminate "multiple 'hacks' that exist in the current grammar to circumvent the LL(1)-limitation." This should help the project substantially reduce the maintenance cost for the parser.

Python introduced type hinting in version 3.5; the 3.9 release allows types like List and Dict to be replaced with the built-in list and dict varieties. Type hints in Python are mostly for linters and code checkers, as they are not enforced at run time by CPython. PEP 585 ("Type Hinting Generics In Standard Collections") provides a listing of collections that have become generics. Note that, with version 3.9, importing the types (from typing) that are now built-in is deprecated. It sounds like developers will have plenty of time to update their code, however, as according to the PEP: "the deprecated functionality will be removed from the typing module in the first Python version released 5 years after the release of Python 3.9.0."

Thanks to flexible function and variable annotations, as described in PEP 593 ("Flexible function and variable annotations"), Python 3.9 has a new Annotated type. This allows the decoration of existing types with context-specific metadata:

    charType = Annotated[int, ctype("char")]

This metadata can be used in either static analysis or at run time; it is ignored entirely if it is unused. It is designed to enable tools like mypy to perform static type checking and provides access to the metadata at run time via get_type_hints(). To provide backward compatibility with version 3.8, a new include_extras parameter has been added to the get_type_hints() function with a default value of False, retaining the same behavior as existed in version 3.8. When include_extras is set to True, get_type_hints() will return the defined Annotation type for use.

Various other language changes can be expected in Python 3.9. __import__() now raises ImportError instead of ValueError when a relative import went past the top-level package. Decorators have also been improved as described in PEP 614 ("Relaxing Grammar Restrictions On Decorators"), allowing any valid expression (defined as "anything that's valid as a test in if, elif, and while blocks") to be used to invoke them. In Python 3.8, the expressions available for use to invoke a decorator is limited. While the decorator grammar limitations "were rarely encountered in practice", according to the PEP, they occurred often enough over the years to be worth fixing in 3.9. The PEP has an example showing how PyQt5 currently works around the limitations.

Two new modules are provided as part of the Python 3.9 standard library: zoneinfo and graphlib. The zoneinfo module, which we have previously covered, provides support for the IANA time zone database and includes zoneinfo.ZoneInfo, which is a concrete datetime.tzinfo implementation allowing users to load time zone data identified by an IANA name. The graphlib module provides graphlib.TopologicalSorter, a class that implements topological sorting of graphs. In addition to these two new modules, many existing modules were improved in various ways. One notable change involves the asyncio module, which no longer supports the reuse_address parameter of asyncio.loop.create_datagram_endpoint() due to "significant security concerns." The bug report describes a problem when using SO_REUSEADDR on UDP in Linux environments. Setting SO_REUSEADDR allows multiple processes to listen on sockets for the same UDP port, which will pass incoming packets to each randomly; setting reuse_address to True in a Python script would enable this behavior.

There are a lot of interesting things worth checking out in Python 3.9, and the project's "What's new in Python 3.9" document is recommended for all the details. Additionally, the changelog provides an itemized list of changes between release candidates. Since no more release candidates of Python 3.9 are expected before the final version, developers may want to start testing their existing code to get a head start on the final release.

Comments (35 posted)

Removing run-time disabling for SELinux in Fedora

By Jake Edge
September 23, 2020

Disabling SELinux is, perhaps sadly in some ways, a time-honored tradition for users of Fedora, RHEL, and other distributions that feature the security mechanism. Over the years, SELinux has gotten easier to tolerate due to the hard work of its developers and the distributions, but there are still third-party packages that recommend or require disabling SELinux in order to function. Up until fairly recently, the kernel has supported disabling SELinux at run time, but that mechanism has been deprecated—in part due to another kernel security feature. Now Fedora is planning to eliminate the ability to disable SELinux at run time in Fedora 34, which sparked some discussion in its devel mailing list.

SELinux is a Linux Security Module (LSM) for enforcing mandatory access control (MAC) rules. But the "module" part of the LSM name has been a misnomer since a 2007 change to make the interface static and remove the option to load LSMs at run time. So kernels are built with a list of supported LSMs, and they can be enabled or disabled at boot time using kernel command-line options. Certain architectures had bootloaders that made it difficult for users to add parameters to the command line, though, so the SELinux developers added a way to disable it at run time. The need for that functionality has faded, and removing it will allow another kernel hardening feature to be used.

The post-init read-only memory feature provides a way to mark certain kernel data structures as read-only after the kernel has initialized them. The idea is that various data structures are prime targets for kernel exploits; function-pointer structures, like those used by the LSM hooks, are of particular interest. So the LSM hooks were protected that way. However, that hardening is only enabled if the ability to disable SELinux at run time is not present in the kernel. The presence of the SELinux feature is governed by the CONFIG_SECURITY_SELINUX_DISABLE kernel build option.

In order to get that hardening feature, Ben Cotton posted a proposal for Fedora 34 to remove the support for disabling SELinux at run time. The proposal is owned by Petr Lautrbach and Ondrej Mosnacek; it would migrate users to the selinux=0 command-line option if they are currently disabling SELinux via the SELINUX=disabled setting in /etc/selinux/config. The proposal, which has been updated on the Fedora wiki based on feedback, would not change the ability to switch SELinux between enforcing and permissive modes at run time using setenforce

The 5.6 kernel deprecated the run-time-disable feature for SELinux. The kernel currently prints a message to that effect, but there are plans to make using it even more painful by sleeping for five seconds when it is used. It may get even more obnoxious over time; eventually the plan is to remove it altogether. Red Hat distributions (Fedora, CentOS, RHEL) are the only known users of the feature at this point, so once they have all moved away, the feature can be removed from the kernel. RHEL and CentOS systems will stick around for a lot longer than Fedora systems, since it is only supported for a bit over year. But Red Hat will just continue to maintain the feature in the RHEL/CentOS kernels; removing the run-time disable from Fedora presumably means that the next RHEL/CentOS major release will no longer support it either.

The proposal seeks to smooth the path for users who upgrade but have SELinux already disabled via the config file. The kernel command-line parameter will not be automatically added, but it will be documented as the only real way to disable SELinux. Systems without selinux=0 at boot, but that disable it in the config file, will simply get a system that has SELinux in it, but without any policy loaded, so the run-time impact should be minimal. In addition, the SELinux filesystem will not be mounted at /sys/fs/selinuxfs.

In general, the reaction was positive, though there were some concerns. James Cassell wondered about forcing systems that have SELinux disabled in user space, but not in the kernel, to loudly fail. He was concerned that the performance impact of SELinux without any policy loaded could affect "certain real-time use cases", as he had found with SELinux in permissive mode. But Mosnacek said that the impact should be minimal. He did think it would be useful to alert the system administrator to that situation, but was not sure how to go about it beyond just documenting it in the release notes. Lautrbach suggested possibly adding a systemd unit that would warn users once if that situation was found, but he also pointed out that the existing description in the comments of /etc/selinux/config accurately describe the situation:

#     disabled - No SELinux policy is loaded.

^^ this is exactly what will happen when this change is accepted. SELinux will be enabled internally in kernel but no policy will be loaded and as it was already explained for [users] this situation should be almost the same as SELinux disabled.

There were also some suggestions for improvements to the proposal document from Vít Ondruch and Michal Schorm. In particular, Schorm wanted to see more clarity in the document that the change does not affect the ability to switch to SELinux permissive mode at run time. Those changes have been made and the Fedora Engineering Steering Committee (FESCo) ticket shows five "+1" votes at the time of this writing, so its adoption is all-but-assured.

Silent denials

An interesting sidelight came up in the discussion, though. Richard Hughes asked about SELinux denials that do not generate audit messages. Typically, when SELinux denies an operation, it puts out an audit log entry (known as an access vector cache, or AVC, denial), but sometimes that does not happen.

Speaking from personal experience, I've wasted days over the last decade trying to debug a locally installed system service that was not working where there were no messages in any of the logs (e.g. no AVCs) -- and turning off selinux at runtime magically fixed the problem.

Whilst I'm of course in favour of fixing the lockdown issue, would it also be fair to say that any selinux regression not triggering an AVC (which is fixed using selinux=0) would block this kind of proposal?

It turns out that there is a class of SELinux denials that are marked as "dontaudit", Mosnacek said:

They are used for when the impact of a failed access check doesn't prevent the application from functioning and we don't want to allow that access vector (e.g. when the application just checks to see if it can do some privileged operation and if not, it continues with some fallback). Unfortunately, it can happen sometimes that such a rule hides some denial that actually does break something.

It should not require rebooting to turn off SELinux in order to test whether that is occurring; switching to permissive mode should be sufficient. He also noted that the semodule command can be used to disable the dontaudit rules (and re-enable them). That should at least provide a useful error message, which should help Hughes (and others) figure out mysterious failures from SELinux:

There's sometimes no log output anywhere obvious that a syscall or something was blocked. It's the reason I turn off selinux on my work development machine, and I've often wasted *hours* of my life on code "doing something impossible" over the last decade until a neuron at the back of my brain remembers "you've not yet turned off selinux" and then when I "sudo setenforce 0" it works, and I can't actually file a bug as there's no indication of what selinux actually blocked or why.

Neal Gompa helped fill in some of the background for how these dontaudit rules get added. Based on conversations that he has witnessed, Red Hat customers are often the driving force:

Because Red Hat customers put the SELinux policy developers into no-win situations: they complain about AVC denials that don't actually significantly break anything in *their* app and often just disable SELinux in those scenarios. Red Hat wants customers to use it and not freak out all the time, so these kinds of things get added because it is very hard to come up with the right rules for all cases and there's not enough time to work on that.

"Hiding" denials is certainly unfortunate—and counterproductive, at least for Fedora. The distribution should perhaps consider its developer audience and disable the dontaudit rules by default. That may risk swamping the audit log with harmless denials, as described by Mosnacek and in this section of the Red Hat SELinux documentation, but it can be seriously maddening to try to debug a problem where SELinux is quietly denying some action. Raising the profile of dontaudit rules is clearly called for; that may help avoid more sudden baldness in frustrated users.

The ability to disable SELinux at run time would seem to be something that is unlikely to be missed. There are other, better ways to accomplish the same goals; removing it paves the way for Fedora systems to get more hardening in the form of read-only protection on the LSM hooks.

Comments (24 posted)

The seqcount latch lock type

By Jonathan Corbet
September 17, 2020

The kernel contains a wide variety of locking primitives; it can be hard to stay on top of all of them. So even veteran kernel developers might be forgiven for being unaware of the "seqcount latch" lock type or its use. While this lock type has existed in the kernel for several years, it is only being formalized with a proper type declaration in 5.10. So this seems like a good time to look at what these locks are and how they work.

Seqcounts and seqlocks

Seqcounts (and seqlocks, which are built on top of seqcounts) are among the many primitives used to reduce locking overhead in specific situations; their use is most indicated when reads to protected data far outnumber writes, and updates to the data are quick when they do happen. Rather than preventing concurrent access to data, seqcounts and seqlocks work by detecting when a reader and a writer collide and forcing readers to retry in such situations. They were first introduced for the 2.5.60 development kernel in 2003, and have grown considerably in complexity since then.

Seqcounts are the lowest-level piece of this mechanism; at their core, they are a simple counter that is incremented whenever the protected data is modified. Indeed, the counter is incremented twice, once before the process of modifying the data begins with a call to:

    static inline void raw_write_seqcount_t_begin(seqcount_t *s)
    {
	s->sequence++;
	smp_wmb();
    }

and once after modification is complete by calling:

    static inline void raw_write_seqcount_t_end(seqcount_t *s)
    {
	smp_wmb();
	s->sequence++;
    }

(Some debugging instrumentation has been removed from the above). As can be seen, write-side seqcount operations come down to incrementing the counter, plus some carefully placed write barriers (the calls to smp_wmb()) to ensure the correct ordering between changes to the counter and to the protected data. One key point to note here is that the counter, which starts at zero, will be odd while modification is taking place, and even otherwise.

Before a reader can access the protected data, it must enter the critical section with a call to:

    static inline unsigned __read_seqcount_t_begin(const seqcount_t *s)
    {
	unsigned ret;

    repeat:
	ret = READ_ONCE(s->sequence);
	if (unlikely(ret & 1)) {
		cpu_relax();
		goto repeat;
	}
	return ret;
    }

(Again, debugging code has been removed; note also that real users will call higher-level functions built on the above). This function starts by checking whether modification is currently taking place (as indicated by the sequence counter having an odd value); if so, it will spin until the sequence count is incremented again (the cpu_relax() call serves a few functions, including inserting a compiler barrier and potentially letting an SMT sibling run). Then the current counter value is returned and the caller can provisionally read the protected data. Once that has been done, the section is exited with a call to:

    static inline int __read_seqcount_t_retry(const seqcount_t *s, unsigned start)
    {
	return unlikely(READ_ONCE(s->sequence) != start);
    }

The return value from this function tells the caller whether modification of the data has occurred while it was being read; if __read_seqcount_t_retry() returns true, the caller must go back to the beginning and try again. For this reason, accesses to seqcount-protected data is normally coded as a do..while loop that repeats until the data has been successfully read.

Upon this simple foundation has been built a whole array of variants for specific use cases. Many callers in the kernel use the higher-level seqlock_t type, which handles details like concurrency among writers among other things. See include/linux/seqlock.h for lots of details.

The seqcount latch type

While the above interface works in most situations, there is one important case where things fall apart: if a reader ever preempts a writer on the same CPU. For example, if a writer is preempted by an interrupt handler, and that handler attempts to enter a read section for the same data, the CPU will deadlock while the reader spins waiting for an update that will never complete. This situation is normally avoided by disabling preemption and interrupts while the write is taking place; that is one of the many things handled by the higher-level seqlock interfaces.

There are times, though, when it is not possible to completely block interrupts; in particular, code that might be called within a non-maskable interrupt is, as the name suggests, not maskable. Blocking preemption and interrupts also tends to be unwelcome in realtime kernels; another solution must be found for those cases. One such solution, introduced by Mathieu Desnoyers in 2014, is the seqcount latch. It avoids the possibility of an infinite spin at the cost of maintaining two copies of the protected data.

In particular, if a structure of type struct mydata is to be protected with a seqcount latch, that structure will need to be declared as:

    struct mydata data[2];

At any given time, one entry in that array will be considered live and available, while the other is reserved for modifications by a writer. The least-significant bit in the sequence counter indicates which element should be read at any given time. Code for the read side now looks something like this:

    do {
        seq = raw_read_seqcount_latch(&seqcount);
	index = seq & 0x01;
	do_something_with(data[index]);
    } while (read_seqcount_retry(&seqcount, seq));

There is still a loop here, which detects concurrent modification of the data. But if a writer has been interrupted by a reader, the count will not change and there will be no need to retry the access.

To update the protected data, the writer simply makes any modifications to the entry in the data array that is not currently being used by the readers. Nobody should be looking at that entry, so there should be no need for any particular protection (unless concurrent writers are a possibility, of course). When the new data is ready, the writer calls:

    static inline void raw_write_seqcount_t_latch(seqcount_t *s)
    {
       smp_wmb();      /* prior stores before incrementing "sequence" */
       s->sequence++;
       smp_wmb();      /* increment "sequence" before following stores */
    }

After this call, readers will be directed to the new version of the data. For an example of how seqcount latches are used, see the handling of timekeeping data (read side and write side) in kernel/time/sched_clock.c.

The 5.10 kernel will see the merging of a patch series from Ahmed Darwish that formalizes the seqcount latch API. Since it was first introduced, the seqcount latch has been implemented as a sort of "off-label" use of the seqcount type, changing its semantics in ways that, one hopes, all users understand. Darwish, instead, has concluded that the seqcount latch is a separate type of lock that should be handled independently of seqcounts.

Thus, his patch set introduces a new seqcount_latch_t type and changes the prototypes of the relevant functions to expect parameters of that type. That helps to nail down the actual semantics of the seqcount latch and ensures that callers won't mix locks of that type up with ordinary seqcounts. The interface still lives in <linux/seqlock.h>, but it could logically be moved elsewhere at this point.

None of this is likely to make the use of seqcount latch locks more popular; the situations where they are needed are rare indeed. There are only four users in the 5.9 kernel, and one of those is removed in Darwish's patch set as an "abuse" of the type (though, if one counts users of the latch tree type, the number goes up slightly). If a kernel developer is wondering if a seqcount latch is needed in a given situation, the answer is almost certainly "no". But it is illustrative of the lengths to which kernel developers must go in order to provide safe-but-fast access to critical system data in all situations.

Comments (5 posted)

Four short stories about preempt_count()

By Jonathan Corbet
September 18, 2020

The discussion started out as a straightforward patch set from Thomas Gleixner making a minor change to how preemption counting is handled. The resulting discussion quickly spread out to cover a number of issues relevant to core-kernel development in surprisingly few messages; each of those topics merits a quick look, starting with how the preemption counter itself works. Sometimes a simple count turns out to not be as simple as it seems.

preempt_count()

In a multitasking system like Linux, no thread of execution is guaranteed exclusive access to the processor for as long as it would like to run; the kernel can (almost) always preempt a running thread in favor of one that has a higher priority. That new thread might be a different process, but it might also be a hardware interrupt or other outside event. In order to properly coordinate the running of all of a system's tasks, the kernel must keep track of the current execution state, including anything that has been preempted or which might prevent a thread from being preempted.

One piece of the infrastructure for that tracking is the preemption counter that is stored with every task in the system. That counter is accessed via the preempt_count() function which, in its generic form, looks like this:

    static __always_inline int preempt_count(void)
    {
	return READ_ONCE(current_thread_info()->preempt_count);
    }

The purpose of this counter is to describe the current state of whatever thread has been running, whether it can be preempted, and whether it is allowed to sleep. To do so, it must track a number of different states, so it is split into several sub-fields:

The least-significant byte tracks the nesting of preempt_disable() calls — the number of times that preemption has been disabled so far. The next three fields track the number of times the running thread has been interrupted by software, hardware, and non-maskable interrupts; they are all probably oversized for the number of interruptions that is likely to ever happen in real execution, but bits are not in short supply here. Finally, the most-significant bit indicates whether the kernel has decided that the current process needs to be scheduled out at the first opportunity.

A look at this value tells the kernel a fair amount about what is going on at any given time. For example, any non-zero value for preempt_count indicates that the current thread cannot be preempted by the scheduler; either preemption has been disabled explicitly or the CPU is currently servicing some sort of interrupt. In the same way, a non-zero value indicates that the current thread cannot sleep, since it is running in a context that must be allowed to run to completion. The "reschedule needed" bit tells the kernel that there is a higher-priority process that should be given the CPU at the first opportunity. This bit cannot be set unless preempt_count is non-zero; otherwise the kernel would have simply preempted the process rather than setting the bit and waiting.

Code throughout the kernel uses preempt_count to make decisions about which actions are possible at any given time. That, as it turns out, can be a bit of a problem for a few reasons.

Misleading counts

It is worth noting that preempt_disable() only applies when a thread is running within the kernel; user-space code is always preemptible. In the distant past, the kernel did not support preemption of kernel-space code at all; when that feature was added (as a way of improving latency), it was also made configurable. There are, as a result, still systems out there that are running without kernel preemption at all; it is a configuration that might make sense for some throughput-oriented workloads.

If kernel code cannot be preempted, there is little value in tracking calls to preempt_disable(); preemption is always disabled. So the kernel doesn't waste its time maintaining that information; in such kernels, the preempt-disable count portion of preempt_count is always zero. The preemptible() function will always return false, since the kernel is indeed not preemptible. It all seems to make sense.

There are some problems that result from this behavior, though. One is that functions like in_atomic(), which indicates whether the kernel is currently running in atomic context, do not behave in the same way. On a kernel with preemption configured in, calling preempt_disable() will cause in_atomic() to return true; if preemption is configured out, preempt_disable() is a no-op and in_atomic() will return false. This can cause in_atomic() to return false when, for example, spinlocks are held — a situation that is indeed an atomic context.

Gleixner, in his patch set, points out some other problems that result from this inconsistency and says that it is a problem overall:

The lack of such indicators which work on all kernel configurations is a constant source of trouble because developers either do not understand the implications or try to work around this inconsistency in weird ways.

His solution is to remove the conditional compilation for the preemption-disable tracking, causing that counter to be maintained even in kernels that do not support kernel preemption. There is a cost in terms of increased execution time and code size on machines running those configurations, but Gleixner says that his benchmark testing "did not reveal any measurable impact" from the change.

Linus Torvalds was not convinced about the value of this change, noting that the code generation for spinlocks is indeed better when preemption is not possible. Gleixner reiterated that the effect is not measurable, and Torvalds conceded that the patch set does make the code simpler and "has its very clear charm".

Using preempt_count

Torvalds's larger complaint, though, was about code that uses preempt_count to change its behavior depending on the context. Such code, he said, is "always simply fundamentally wrong". Code that changes its behavior depending on the context should have that context passed in as a parameter, he said, so that callers know what to expect. Thus, the GFP_ATOMIC flag to the memory-allocation functions is acceptable, but changing behavior based on the return value from in_atomic() is not.

To an extent, there is general agreement with this position. Gleixner's patch posting included a section with future plans to audit and fix callers of functions like in_atomic() where, he says, "the number of buggy users is clearly the vast majority". Daniel Vetter added that, in his experience, "code that tries to cleverly adjust its behaviour depending upon the context it's running in is harder to understand and blows up in more interesting ways".

Paul McKenney, instead, argued that some code has to be able to operate properly in different contexts; the alternative would be an explosion of the API:

Now perhaps you like the idea of call_rcu() for schedulable contexts, call_rcu_nosched() when preemption is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled, call_rcu_raw_atomic() from contexts where (for example) raw spinlocks are held, and so on. However, from what I can see, most people instead consistently prefer that the RCU API instead be consolidated.

In response, Torvalds clarified that he sees core-kernel code as having different requirements than the rest. Core code has to deal with multiple contexts and should always do the right thing; code in drivers, instead, should not be changing its behavior based on its view of the context.

No hard conclusions were reached in this branch of the discussion. It does seem likely, though, that code with context-dependent behavior will be looked at more closely in the future.

Questioning high memory

One example of questionable use of preempt_count, in the crypto code, was pointed out early in the discussion by Gleixner; it changes a memory allocation mode in strange ways if it thinks that it's not currently preemptible. After some discussion, it turned out, according to Ard Biesheuvel, that the real purpose had been to avoid using kmap_atomic() if possible.

For those who are not immediately familiar with kmap_atomic(), a look at this article on high memory might be helpful. In short: 32-bit machines can only map a limited amount of memory into the kernel's address space; that amount is a little under 1GB on most architectures and configurations. Any memory that is not directly mapped is deemed "high memory"; any page in high memory must be explicitly (and temporarily) mapped into the kernel before the kernel can access its contents. The functions kmap() and kmap_atomic() exist to perform this mapping.

There are a few differences between those two functions, starting with the fact that only kmap_atomic() is callable in atomic context. Beyond that, though, kmap_atomic() is more efficient and is thus seen as being strongly preferable in any situation where it can be used, regardless of whether the caller was running in atomic context before the call (the CPU will always be running in atomic context while the mapping is in place). As Biesheuvel pointed out, though, the documentation doesn't reflect this preference and encourages the use of kmap() instead, so that is what he did.

There is another reason to prefer kmap(), he added; a call to kmap_atomic() disables preemption even on 64-bit architectures, where high memory does not exist and no temporary mapping need be made. Using it would have resulted in much of the WireGuard VPN code running with preemption disabled, entirely unnecessarily. Torvalds pointed out that there is a reason for this behavior: it is there to cause code to fail on 64-bit machines if it does things that would not work on 32-bit machines where high memory does exist. It is essentially a debugging aid that is making up for the fact that few developers run on 32-bit machines.

One way to optimize kmap_atomic() on 64-bit systems, Gleixner said, would be to make kmap_atomic() sections be preemptible — no longer atomic, in other words. This approach has been taken in the realtime kernels, he said, and "it's not that horrible". The cost would be to make kmap_atomic() a little slower on systems where high memory is in use.

That, it seems, is a cost that the development community is increasingly willing to pay; Torvalds replied that he would like to start removing kmap() support entirely. 32-Bit systems will be around for some time yet, but they are increasingly unlikely to be used in situations where lots of memory is needed. Or, as Torvalds put it: "It's not that 32-bit is irrelevant, it's that 32-bit with large amounts of memory is irrelevant". Every time that the cost of supporting high memory (which adds a significant amount of complexity to the memory-management subsystem) makes itself felt, the desire to take it out grows.

That said, nobody will be removing high-memory support right away. But a change that penalizes high-memory systems in favor of the systems that are being deployed now, such as making kmap_atomic() no longer be atomic, is increasingly likely to be accepted. Meanwhile, the other issues around preempt_count remain mostly unresolved, but it seems likely that, in the end, changes that bring correctness and reduce complexity will probably win out.

Comments (12 posted)

Accurate timestamps for the ftrace ring buffer

September 22, 2020

This article was contributed by Steven Rostedt

The function tracer (ftrace) subsystem has become an essential part of the kernel's introspection tooling. Like many kernel subsystems, ftrace uses a ring buffer to quickly communicate events to user space; those events include a timestamp to indicate when they occurred. Until recently, the design of the ring buffer has led to the creation of inaccurate timestamps when events are generated from interrupt handlers. That problem has now been solved; read on for an in-depth discussion of how this issue came about and the form of its solution.

The ftrace ring buffer

The ftrace ring buffer was added in 2008 and, a little less than a year later, it became completely lockless. The design of the ring buffer split it into per-CPU buffers; each per-CPU buffer has a series of sub-buffers, the size of which happens to be the architecture's page size. This sizing was not a requirement of the design, but it is a convenient size for the splice() system call. Each sub-buffer begins with a header that includes, among other things, a full timestamp for the first event stored there.

Writes to a specific per-CPU buffer can only happen on the CPU for that buffer. That ensures that any contention between writers will always be in stack order. That is: a write being done in normal context could only have a contending writer running in an interrupt context, and that write must completely finish before returning back to normal context. There is no need to worry about parallel writers that are executing on other CPUs. Interrupted writes will thus always appear as shown below:

The design of the ring buffer depends on the fact that writers that interrupt other writers will completely finish before the interrupted writer may continue. This allows for some flexibility in how the writers can remain lockless. Although this simplified the coordination between writes, it added extra complexity to the tracking of time.

Before going into timestamp management, an understanding of how space is reserved on the ring buffer is necessary. An index is used to denote where the last event in the sub-buffer was written. The length of each new event is added to the index with local_add_return() (which can be used since this is a per-CPU index) and the location for the new event is simply the returned value minus the length of the event.

Obviously, if the value returned is greater than the size of the sub-buffer, it means there's no more room on the sub-buffer for this event, and the logic to move to the next page in the ring buffer is invoked.

Timestamps

A 64-bit timestamp requires eight bytes to store. The bigger an event is, the longer it takes to write it and the fewer of them a ring buffer may hold. To keep the event header small, the ring buffer code tries to avoid storing the full timestamp. An event on the ring buffer looks like this:

    struct ring_buffer_event {
	u32 type_len:5, time_delta:27;
	u32 array[];
    }

The first five bits of the event header determines its type and size, where a value of 29 means it is a padding event, 30 is a time-extend event, and 31 is an absolute timestamp. If the value is between one and 28, it represents an event with a data payload that starts at the array field, and the total event size is the type_len times four. If the total event size is greater than 112 (or 4*28) bytes, then type_len is set to zero, the 32-bit array field will hold the length of the event, and the data payload starts immediately after the array field. With most events having a size of 112 bytes or less, this helps keep the events compacted. Note that all events are four-byte aligned.

The next 27 bits of the first integer of the event is the time_delta. This field holds the delta of time since the last event (or zero if it's the first event on the sub-buffer, which holds a full eight-byte timestamp in its header). If the timestamp is in nanoseconds, the largest delta that can be stored is 134.217728 milliseconds (2²⁷ nanoseconds). If an event comes in after 134.217728 milliseconds, a time-extend event is added, which uses both the time_delta and the 32 bits of the array field to create a delta of up to 18 years (2⁵⁹ nanoseconds).

Tom Zanussi needed full timestamps from the events at the time they were recorded to get histograms to work. As the events only held deltas, a new event was created to store 59 bits of the full timestamp since boot up to allow the histograms to store the exact timestamp used for an event. The type 31 was used to denote this new event, which has the same make up as a time extend but, instead of being a delta, it would hold the time since boot. In actuality, this new time stamp could replace time extends since they could only fail if a machine was running for over 18 years without a reboot.

The problem with nested writes

Using a delta from the previous event proved to be a troublesome design; it requires saving the timestamp of the last event written into the ring buffer for use in calculating the delta stored in the next event. This put several actions in play that need to be atomic but cannot be:

Reading the timestamp to use for the current event.
Reserving a spot on the ring buffer to store the current event.
Calculating the delta of the timestamp of the current event from the timestamp of the previous event.
Saving the timestamp used for the current event to calculate the delta for the next event.

Any of the above steps can be interrupted by another context, such as an interrupt or non-maskable interrupt (NMI). This makes it difficult to know if the delta stored for the current event was really the delta since the timestamp of the previous event. After the last timestamp is retrieved for the delta calculation, an interrupt may occur and several events may be injected into the ring buffer before storage is allocated for the current event:

The timestamp for the new event must be taken before the allocation, so that it can be used to calculate deltas for events that may come in via an interrupt that occurs right after the storage was allocated. Even if a full timestamp were written for the interrupt events, the timestamp used for the interrupted event, if retrieved after the space allocation, would be later than the interrupt-event timestamps, even though the interrupted event itself happened first, as shown below:

Regardless of whether the timestamp is taken before or after the allocation is performed, the interrupt situations described above will cause time to appear to go backward in the ring buffer. That is considered unacceptable because it breaks the merge sort used when all of the per-CPU buffers are shown together as a single output.

The approach taken to avoid this problem was simply to write a zero delta for events that interrupt the writing of another event. Unfortunately, this makes it look as if time stood still. The obvious problem with this approach is that you lose the time between events when they interrupted the writing of another event. The output will look like all the events happened instantaneously. This approach has been satisfactory for the last 12 years, but it was a design flaw that needed to be fixed.

To see this problem in real use, try running a command like:

    trace-cmd record -p function

for a while and then running:

    trace-cmd report --debug -l -t --ts-diff --cpu 4

on the output file. Here, --debug shows where the sub-buffer breaks are, -l shows latency information (interrupt context), -t keeps the timestamps in nanosecond format (otherwise it will truncate to microseconds), --ts-diff shows the delta between events, and --cpu 4 is used just because I found what I was looking for on CPU 4 (I searched for the time delta of zero). This gives a good idea of the impact of what happens when events occur after interrupting the writing of another event.

    trace-cm-1724    4....   137.210588990: (+84)    function:                      kfree [84:0xf44:24]
    trace-cm-1724    4....   137.210589078: (+88)    function:                      wakeup_pipe_writers [88:0xf60:24]
    trace-cm-1724    4d.h.   137.210589709: (+631)   function:             __sysvec_apic_timer_interrupt [631:0xf7c:24]
    trace-cm-1724    4d.h.   137.210589709: (+0)     function:                hrtimer_interrupt [0:0xf98:24]
    trace-cm-1724    4d.h.   137.210589709: (+0)     function:                   _raw_spin_lock_irqsave [0:0xfb4:24]
    trace-cm-1724    4d.h.   137.210589709: (+0)     function:                   ktime_get_update_offsets_now [0:0xfd0:24]
    CPU:4 [SUBBUFFER START] [137210590461:0x27c53000]
    trace-cm-1724    4d.h.   137.210590461: (+752)   function:                   __hrtimer_run_queues [0:0x10:24]
    trace-cm-1724    4d.h.   137.210590461: (+0)     function:                      __remove_hrtimer [0:0x2c:24]
    trace-cm-1724    4d.h.   137.210590461: (+0)     function:                      _raw_spin_unlock_irqrestore [0:0x48:24]
    trace-cm-1724    4d.h.   137.210590461: (+0)     function:                      tick_sched_timer [0:0x64:24]
    trace-cm-1724    4d.h.   137.210590461: (+0)     function:                         ktime_get [0:0x80:24]
    trace-cm-1724    4d.h.   137.210590461: (+0)     function:                         tick_sched_do_timer [0:0x9c:24]
    trace-cm-1724    4d.h.   137.210590461: (+0)     function:                         tick_sched_handle.isra.0 [0:0xb8:24]
    trace-cm-1724    4d.h.   137.210590461: (+0)     function:                            update_process_times [0:0xd4:24]
    [...]
    trace-cm-1724    4d.s.   137.210590461: (+0)     function:                   rcu_segcblist_pend_cbs [0:0x940:24]
    trace-cm-1724    4d.s.   137.210590461: (+0)     function:                rcu_disable_urgency_upon_qs [0:0x95c:24]
    trace-cm-1724    4d.s.   137.210590461: (+0)     function:                rcu_report_qs_rnp [0:0x978:24]
    trace-cm-1724    4d.s.   137.210590461: (+0)     function:                _raw_spin_unlock_irqrestore [0:0x994:24]
    trace-cm-1724    4..s.   137.210590461: (+0)     function:                rcu_segcblist_ready_cbs [0:0x9b0:24]
    trace-cm-1724    4d.s.   137.210590461: (+0)     function:             irqtime_account_irq [0:0x9cc:24]
    trace-cm-1724    4....   137.210590461: (+0)     function:             kill_fasync [0:0x9e8:24]
    trace-cm-1724    4....   137.210605019: (+14558) function:             pipe_unlock [14558:0xa04:24]
    trace-cm-1724    4....   137.210606026: (+700)   function:             __x64_sys_splice [700:0xa58:24]

Looking at this output, I can tell that the call to __sysvec_apic_timer_interrupt() happened from an interrupt that came in as the call to kill_fasync() started to be recorded but before it reserved space on the ring buffer. I know this because __sysvec_apic_timer_interrupt() has a time delta, thus it was able to reserve space on the ring buffer before kill_fasync() was able to, but after the processing of the event for kill_fasync() started. Once the processing of events happen, only the first event to get on the ring buffer will have a delta timestamp, then all events after that (including the one that was interrupted because its storage comes later), gets a zero delta.

The --debug option for trace-cmd report is what caused the extra data to show in the output, which includes this line:

    CPU:4 [SUBBUFFER START] [137210590461:0x27c53000]

This output indicates that the trace crossed over a sub-buffer page at this point. As each sub-buffer stores an absolute timestamp, the first event on the sub-buffer will also have a delta as shown above.

Over the years, this flaw really bothered me; I would spend countless hours thinking about how to find a way to reliably make the nested timestamps meaningful. The fact that we only needed to worry about stacked writes and not concurrent writes made me believe there was a solution. As there are only realistically four levels of the stack to worry about, I thought I could make a state for each level and use the above and below states to synchronize the timestamps. Those four levels are: normal context, software-interrupt (softirq) context, interrupt context, and NMI context.

Theoretically, you could have a machine check during an NMI, making a fifth level, but odds of a softirq interrupting the writing of an event, and it too writing an event that gets interrupted by an interrupt, that then writes an event, where an NMI were to trigger and it too were to write an event is extremely low, and to put a machine check on top of that was even lower. Even with running function tracing that traces every function in all contexts, I had trouble finding one nested level, let alone four levels. And we could detect the nesting level, so the worst that could happen is that we store zero for the delta on detecting it. This turned out not to be a worry as my solution does not need to know about the levels.

Avoiding `cmpxchg()`

In all my prior attempts to solve this problem, I tried hard to avoid the use of local_cmpxchg() (the local CPU version of cmpxchg()). cmpxchg() is an architecture-specific function that will atomically read a value from a given location, compare it with a given value and, if the two are equal, it will write a third value back to that location. If the values do not match, then the location is not updated. The original value read from the location is the return value of the cmpxchg(); it can be used to determine if the cmpxchg() succeeded in updating the location or not.

When I first started working on the ring buffer, all of my benchmarks would show a slight but noticeable overhead when using local_cmpxchg() over local_add_return(). The goal was thus to not use a cmpxchg() and have, instead, a timestamp that would be used for each level of nesting. Starting with a four-element array of timestamps, I tried various approaches of a nesting counter and storing timestamps in each level. Upon detecting nesting, I thought that a context that interrupted another context could fix up the timestamps of the contexts that were interrupted without needing local_cmpxchg(). But this became much more complex, and had to deal with issues like this:

Having to deal with an array of timestamps just added one more variable that needed to be synchronized with the other variables.

The above figure shows a case where an interrupt comes in right after the timestamp was taken and the storage was allocated for the first event, but before the event is actually stored. Then an NMI comes in after the timestamp and storage is allocated for an event happening in interrupt context. At this point, because the allocation during the NMI would not be the first event in the commit, and because two other contexts were interrupted below, it is difficult to know if it should update the timestamp of the event that happened in the interrupt context or not; the timestamp may have already been updated. On top of this, another event is recorded in interrupt context after the NMI added an event, and the state for this event would have to deal with an event injected from another context since the previous event recorded in the interrupt context. The number of states that are added by keeping track of four levels of context and how they relate to inserting events into the ring buffer grew so numerous that it became obvious this was not going to be a viable solution.

The twelve year old puzzle solved

Julia Lawall reported a bug where she recorded a trace with trace-cmd and found that time went backward. Looking into it, I discovered that it was due to the addition of the full timestamp used by Zanussi's histograms; the change allowed the time extensions to not be reset to zero if they occurred in a nested event. Writing the fix for that issue triggered another idea for solving the nested timestamp issue.

All my previous attempts tried to avoid using cmpxchg(). While debugging the issue that Lawall reported, I realized that nested events were extremely uncommon and, because they can be detected, it should be possible to separate the slow path from the fast path. A fast path is the common case, which is when an event being written did not interrupt another event, and also was not interrupted itself. Otherwise the slow path is run. cmpxchg() should not be a performance problem if it were only to be used in the slow path. Not restricting what can be done in the slow path allowed me to think about other possible solutions. This gave me new hope, and inspired me to look for a solution in this direction.

While incorporating cmpxchg() back into the solution, I found that the array of four states still added too much complexity. I looked into whether it would be possible to consolidate the array, and only care about an event that interrupts another event or the event being interrupted. Upon interrupting an event in a lower context, it is known that the interrupted event is, in essence, "frozen in time". That is, it will not proceed until the current context returns to it. For being interrupted, there are only two states; before being interrupted and after being interrupted. What that means is that once interrupted, when processing resumes, everything that happened in the interrupt would have come to a conclusion. With the above characteristics, a defined set of states can be calculated for every step of the algorithm, by keeping track of two different timestamps: one that happens before allocating storage on the ring buffer, and one that happens afterward.

Thus, the solution deals with three players:

write_tail: the index used to reserve space on the buffer for the event.
before_stamp: a timestamp saved by all events as they start the recording process.
write_stamp: a timestamp updated after an event has successfully reserved space.

The following code is run in this order to determine the next decisions to be made:

    w = local_read(write_tail);
    before = local_read(before_stamp);
    after = local_read(write_stamp);
    ts = clock();

Before doing anything else, this code saves the current value of write_tail for later use. At this point, we can decide whether this event needs to go into the slow path or not. If before does not equal after, one of two possibilities is indicated: this event interrupted another event while it was updating its timestamps, or this event was interrupted by another context after reading before_stamp and before reading the write_stamp. In either case, the code would fall into the slow path.

    if (before != after) {
	event_length += RB_LEN_TIME_EXTEND;
	add_timestamp = RB_ADD_STAMP_FORCE | RB_ADD_STAMP_EXTEND;
    } else {

One part of this solution requires injecting absolute timestamps instead of using a delta. For this slow path, the event length is increased by the size of the absolute timestamp event (which is the same size as a time extend). The ADD_STAMP_FORCE and ADD_STAMP_EXTEND flags are saved for later use in the algorithm.

Even if this event did not interrupt another event, a check still must be made to see if the delta since the last event stored can fit in the time_delta portion of the event. Otherwise a time extend is required.

	delta = ts - after;
	if (delta & ~((1 << 27) - 1)) {
	    event_length += RB_LEN_TIME_EXTEND;
	    add_timestamp = RB_ADD_STAMP_EXTEND;
	}
    }

Now write to the before_stamp and allocate storage on the ring buffer by adding to the write_tail.

    local_set(before_stamp, ts);
    write = local_add_return(event_length, write_tail);
    tail = write - event_length;

The start of the event can be found by subtracting its length from write, which is the index of the end of the event. This is stored in tail. Now compare the saved write_tail from the start of this algorithm with the calculated value of the start of the event. If they match, we know that no event interrupted this algorithm between the saving of write_tail and the allocation of the storage for the event. This is the fast path. But we are not out of the woods yet. We still need to update the write_stamp. Note, that the before_stamp has already been updated, making it different than the write_stamp. Any nested event that interrupts this event will now fall into the slow path and use an absolute timestamp.

The next step is to simply update the write_stamp:

    local_set(write_stamp, ts);

But wait! What if an interrupt came in just before writing into the write_stamp and that interrupt wrote an event? Wouldn't that cause the write_stamp to now be incorrect, as it would not contain the timestamp of the last event written to the ring buffer? The answer is yes, but we don't care. The reason is that write_stamp is not used for any calculation unless it equals before_stamp. Because before_stamp does not equal write_stamp, any nesting events will not use it for their calculations.

This is how stacked interrupting events (where all interrupting events finish before this event can continue) helps in our algorithm. before_stamp is always updated by all events, including nested events that interrupted this event, so the contents of before_stamp now contains the timestamp representing the last event stored in the ring buffer, and is what write_stamp also needs to be set to. Updating write_stamp still needs some care, but it is still easy to detect if this event was interrupted by another, and if so, the slow path is entered, and cmpxchg() can be taken advantage of:

    save_before = local_read(before_stamp);

    if (add_timestamp & RB_ADD_STAMP_FORCE)
	delta = ts; // will use the full timestamp
    else
	delta = ts - after; // remember, not force means not nested

    if (ts != save_before) {
	after = local_read(write_stamp);
	if (save_before > after)
	    local_cmpxchg(write_stamp, after, save_before);
    }

The above code first re-reads before_stamp; it runs after write_stamp was updated. If another event came in between reserving space for the buffer and updating write_stamp, then before_stamp will not equal the read timestamp (ts). If the timestamp is still equal to before_stamp, then write_stamp was updated without worrying about racing with other interrupting events. At this point, the delta for the event is also calculated. If the ADD_STAMP_FORCE flag is set, that means this event interrupted another event and an absolute timestamp is required. Otherwise, it is safe to calculate the delta from the write_stamp and the clock value that was read.

If before_stamp is not equal to the read clock (ts), that tells us that an event came in and updated before_stamp as well as write_stamp sometime after the storage for this event was allocated (the update of write_tail). As there is no way of knowing when that occurred, it must be assumed that it could have occurred before the update to write_stamp. To solve this, a simple cmpxchg() is performed by re-reading write_stamp; if write_stamp is less than the last read before_stamp then we have to update it. If write_stamp is greater than or equal to the last read before_stamp or the cmpxchg() fails, then there is nothing to be done. That is because this can only happen if this event was interrupted by another event after the update to write_stamp and that nested event would have taken care of the correctness of write_stamp.

This is the end of the algorithm for the case of not being interrupted between taking the timestamps and allocating space on the ring buffer. But what happens if this event was interrupted before the allocation of its space on the ring buffer?

The case of the interrupted event before allocating storage

In this path, an interrupt came in and other events were injected into the ring buffer somewhere between the first read of the write_tail and reserving space on the ring buffer for this event. At this moment, nothing can be trusted. Some work needs to be done to get back to some kind of known state.

    after = local_read(write_stamp);
    ts = clock();

As this event was interrupted and nested events made it into the ring buffer, the original recording of the clock (ts) is useless. Also, because this is the path of being interrupted by another event, the nested event (or events) would make sure that the write_stamp is the timestamp of the last event added to the ring buffer. Thus we reread both the clock and write_stamp to get into some kind of known state.

    if (write == local_read(write_tail) && after < ts) {
	delta = ts - after;
	local_cmpxchg(write_stamp, after, ts);

Note: the above code turns out to contain a subtle bug; the author will accept the first patch containing a correct fix. First-timers should review this document describing how to submit patches to the kernel before sending.

If write_tail equals the value returned by the local_add_return() when allocating this event, then there was no nested event occurring after the allocation and the re-reading of write_stamp. This means that this event is the last event in the ring buffer, and write_stamp needs to be set to it. A cmpxchg() is used to update write_stamp only if it hasn't changed. If a nested event came in after reading the write_stamp then that event would be the last event on the ring buffer, and the write_stamp should not be updated (the cmpxchg() would fail and nothing more needs to be done.). The delta can safely be calculated as it is known that the write_stamp is from the event stored just before this event was allocated.

    } else {
	delta = 0;
    }

If the value returned by local_add_return does not match write_tail, that means an interrupt came in between the allocation of this event and the re-reading of write_tail. In this case this event was recorded between two interrupts that had nested events. One before the event was allocated, and again after it was allocated on the buffer. As there is now no way to know what timestamp to use for calculating its delta, we have no choice but to go back to a zero delta, but this is actually the best thing to do. If this event was sandwiched between two sets of events, what timestamp it has really does not matter in any use case, as long as it is shown to have happened between the two sets of nested events.

One might think that the above code is a little ambitious, and it may be fine to simply use a zero delta if an interrupt happens between the start of the processing and the allocation of the event. Why not just set the delta to zero in this case? The reasoning behind that is because it is not that uncommon of a case to hit. While tracing several hackbench runs, this situation happens a few times. The problem with just using zero for a delta is that, if the event recorded in the interrupt happened at the start of the interrupt, and the interrupt itself ran for some time before returning, then the zero delta would make it seem that the interrupt was much shorter than it actually was.

But doesn't that excuse make any delta zero a problem? Unfortunately, yes. But the case of being interrupted by two different interrupts just before and just after storage is highly unlikely. It may still happen, but as stated, there's not much we can do about it. After running several traces of hackbench, I could not find a single occurrence of that happening. The only way I was able to test this last code path was by artificially injecting an event in a "fake" context and seeing if the algorithm performed as expected.

At this point the problem has been solved — on 64-bit systems. It turns out that there was an additional obstacle to overcome for the 32-bit case; those looking for the details can find them in this supplemental article.

Conclusion

For several years I was afraid that correct timestamps for ftrace ring-buffer events would end up being impossible for a Turing machine to achieve. But as I agonized over the zero-delta flaw, I knew I had only two options to make the pain go away. Prove that it is an impossible solution and walk away from it with my tail between my legs, or find a solution that actually works. The first was not an option, as I also know that impossible problems can have possible solutions if you can put restrictions on the requirements. For instance, we still have one zero-delta path. But that path is so uncommon, and only affects a single event, thus it is not worth agonizing over.

What I felt was most interesting from this experience, was that my solution was the least complex of those that I tried. That shouldn't be surprising. A lot of problems never get solved because people tend to overthink the solutions. All it took for me was to debug something slightly related to the issue to help me not overthink it as much, and everything just fell into place after that.

Comments (1 posted)

Security quote of the week

This is, to say the least, very odd. The way OTPs [one-time pads] work should produce a uniform distribution of all ten digits in the ciphertext. The odds of an entire message lacking 9s (or any other digit) are infinitesimal. And yet such messages were plainly being transmitted, and fairly often at that. [...]

I remember concluding that the most likely, if still rather improbable, explanation was that the 9-less messages were dummy fill traffic and that the random number generator used to create the messages had a bug or developed a defect that prevented 9s from being included. This would be, to say the least, a very serious error, since it would allow a listener to easily distinguish fill traffic from real traffic, completely negating the benefit of having fill traffic in the first place. It would open the door to exactly the kind of traffic analysis that the system was carefully engineered to thwart. The 9-less messages went on for almost ten years. (If I were reporting this as an Internet vulnerability, I would dub it the "Nein Nines" attack; please forgive the linguistic muddle). But I was resigned to the likelihood that I would never know for sure.

And this brings us to the second observation from [Peter] Strzok's book.

Compromised doesn't say anything about missing nueves, but it does mention that the FBI exploited a serious error on the part of the sender: the FBI was able to tell when messages were and weren't being sent during the weekly timeslot when the suspect couple was observed in the room where they copied traffic. Even worse (for the illegals), empty message slots correlated perfectly with times that the suspect couple was traveling and not able to copy messages. This observation helped confirm the FBI's suspicions and ultimately led to their arrest and expulsion (along with the rest of the Russian illegals network).

[...] So remember this story next time someone tries to sell you their super-secure one-time-pad-based crypto scheme. If actual Russian spies can't use it securely, chances are neither can you.

— Matt Blaze

Comments (1 posted)

Kernel release status

The current development kernel is 5.9-rc6, released on September 20. "The one thing that does show up in the diffstat is the softscroll removal (both fbcon and vgacon), and there are people who want to save that, but we'll see if some maintainer steps up. I'm not willing to resurrect it in the broken form it was in, so I doubt that will happen in 5.9, but we'll see what happens."

Stable updates: 5.8.10, 5.4.66, and 4.19.146 were released on September 17, followed by 5.8.11, 5.4.67, 4.19.147, 4.14.199, 4.9.237, and 4.4.237 on September 23.

Comments (2 posted)

Bottomley: Creating a home IPv6 network

James Bottomley has put together a detailed recounting of what it took to get IPv6 fully working on his network. "One of the things you’d think from the above is that IPv6 always auto configures and, while it is true that if you simply plug your laptop into the ethernet port of a cable modem it will just automatically configure, most people have a more complex home setup involving a router, which needs some special coaxing before it will work. That means you need to obtain additional features from your ISP using special DHCPv6 requests."

Comments (68 posted)

Cook: Security things in Linux v5.7

Kees Cook catches up with the security-related changes in the 5.7 kernel. "The kernel’s Linux Security Module (LSM) API provide a way to write security modules that have traditionally implemented various Mandatory Access Control (MAC) systems like SELinux, AppArmor, etc. The LSM hooks are numerous and no one LSM uses them all, as some hooks are much more specialized (like those used by IMA, Yama, LoadPin, etc). There was not, however, any way to externally attach to these hooks (not even through a regular loadable kernel module) nor build fully dynamic security policy, until KP Singh landed the API for building LSM policy using BPF. With this, it is possible (for a privileged process) to write kernel LSM hooks in BPF, allowing for totally custom security policy (and reporting)."

Comments (2 posted)

Quote of the week

/*
 * The worst case is that all tasks preempt one another in a migrate_disable()
 * region and stack on a single CPU. This then reduces the available bandwidth
 * to a single CPU. And since Real-Time schedulability theory considers the
 * Worst-Case only, all Real-Time analysis shall revert to single-CPU
 * (instantly solving the SMP analysis problem).
 */

— Peter Zijlstra

Comments (none posted)

Firefox 81.0

Firefox 81.0 is out. This version allows you to control media from the keyboard or headset, introduces the Alpenglow theme, adds ArcoForm support to fill in, print, and save supported PDF forms, and more. See the release notes for details.

Comments (3 posted)

GNOME's new versioning scheme

The GNOME Project has announced a change to its version-numbering scheme; the next release will be "GNOME 40". "After nearly 10 years of 3.x releases, the minor version number is getting unwieldy. It is also exceedingly clear that we're not going to bump the major version because of technological changes in the core platform, like we did for GNOME 2 and 3, and then piling on a major UX change on top of that. Radical technological and design changes are too disruptive for maintainers, users, and developers; we have become pretty good at iterating design and technologies, to the point that the current GNOME platform, UI, and UX are fairly different from what was released with GNOME 3.0, while still following the same design tenets."

Full Story (comments: 46)

Precursor: an open-source mobile hardware platform

Andrew "bunnie" Huang has announced a new project called "Precursor"; it is meant to be a platform for makers to create interesting new devices. "Precursor is unique in the open source electronics space in that it’s designed from the ground-up to be carried around in your pocket. It’s not just a naked circuit board with connectors hanging off at random locations: it comes fully integrated—with a rechargeable battery, a display, and a keyboard—in a sleek, 7.2 mm (quarter-inch) aluminum case." You can't get one yet, but the crowdfunding push starts soon.

Comments (37 posted)

Development quotes of the week

Sorry I missed your comment of many months ago. I no longer build software; I now make furniture out of wood. The hours are long, the pay sucks, and there's always the opportunity to remove my finger with a table saw, but nobody asks me if I can add an RSS feed to a DBMS, so there's that :-)

— Eric Diven

In short, federation distributes governance and cost, and can allow us to tackle challenges that we couldn’t overcome without it. The free software community needs to rally behind federation, because no one else will. For all of the reasons which make it worth doing, it is not rewarding for corporations. They would much rather build walled gardens and centralize, centralize, centralize — it’s more profitable! Democratic software which puts control into the hands of the users is something we’re going to have to take for ourselves.

— Drew DeVault

Comments (5 posted)

Linux Journal is Back

Linux Journal has returned under the ownership of Slashdot Media. "As Linux enthusiasts and long-time fans of Linux Journal, we were disappointed to hear about Linux Journal closing its doors last year. It took some time, but fortunately we were able to get a deal done that allows us to keep Linux Journal alive now and indefinitely. It's important that amazing resources like Linux Journal never disappear."

Comments (9 posted)

DistroWatch Weekly (September 21)

Lunar Linux Weekly News (September 18)

openSUSE Tumbleweed Review of the Week (September 18)

Ubuntu Weekly Newsletter (September 19)

Emacs News (September 21)

These Weeks in Firefox (September 19)

These Weeks in Firefox (September 23)

What's cooking in git.git (September 16)

What's cooking in git.git (September 18)

What's cooking in git.git (September 22)

LLVM Weekly (September 21)

LXC/LXD/LXCFS Weekly Status (September 21)

OCaml Weekly News (September 22)

Perl Weekly (September 21)

Python Weekly Newsletter (September 17)

Weekly Rakudo News (September 21)

Ruby Weekly News (September 17)

Wikimedia Tech News (September 21)

openSUSE board meeting minutes (September 15)

Free Software Foundation Europe Newsletter (September)

CFP Deadlines: September 24, 2020 to November 23, 2020

The following listing of CFP deadlines is taken from the LWN.net CFP Calendar.

Deadline	Event Dates	Event	Location
October 1	December 1 December 3	Open Source Firmware Conference	online
October 7	November 28 November 29	EmacsConf 2020	Online
October 11	November 7 November 8	OpenFest 2020	online
October 12	October 19 October 23	EPICS collaboration meeting 2020	Virtual
October 14	October 28 October 29	eBPF Summit	online
October 25	November 5 November 7	Ohio LinuxFest	Online
October 30	November 21 November 22	MiniDebConf - Gaming Edition	Online
October 31	February 6 February 7	FOSDEM 2021	Online
November 1	November 14 November 15	Battlemesh v13	online
November 5	November 10	S&T 2020 (SQLite & TCL)	Online
November 6	January 23 January 25	linux.conf.au 2021	Online
November 6	November 16 November 22	Guix Days	Online
November 6	February 18 February 20	DevConf.CZ	Online
November 10	December 3	Live Embedded Event	Online
November 11	March 20 March 21	LibrePlanet 2021	Online

If the CFP deadline for your event does not appear here, please tell us about it.

Events: September 24, 2020 to November 23, 2020

The following event listing is taken from the LWN.net Calendar.

Date(s)	Event	Location
September 22 September 24	Linaro Virtual Connect	online
September 29 October 1	ApacheCon 2020	Online
October 2 October 5	PyCon India 2020	Virtual
October 2 October 3	PyGotham TV	Online
October 3 October 4	Handmade Seattle 2020	Online
October 6 October 8	2020 Virtual LLVM Developers' Meeting	online
October 8 October 9	PyConZA 2020	Online
October 10 October 11	Arch Linux Conf 2020 Online	Online
October 13 October 15	Lustre Administrator and Developer Workshop 2020	Online
October 15 October 17	openSUSE LibreOffice Conference	Online
October 19 October 20	[Virtual] All Things Open 2020	Virtual
October 19 October 23	EPICS collaboration meeting 2020	Virtual
October 19 October 23	Open Infrastructure Summit	Virtual
October 20 October 23	[Canceled] PostgreSQL Conference Europe	Berlin, Germany
October 24 October 25	[Cancelled] T-Dose 2020	Geldrop (Eindhoven), Netherlands
October 26 October 29	Open Source Summit Europe	online
October 28 October 29	[Canceled] DevOpsDays Berlin 2020	Berlin, Germany
October 28 October 29	eBPF Summit	online
October 28 October 30	[Virtual] KVM Forum	Virtual
October 29 October 30	[Virtual] Linux Security Summit Europe	Virtual
November 5 November 7	Ohio LinuxFest	Online
November 7 November 8	OpenFest 2020	online
November 7 November 8	RustFest Global	Online
November 10	S&T 2020 (SQLite & TCL)	Online
November 12 November 14	Linux App Summit	Online
November 14 November 15	Battlemesh v13	online
November 16 November 22	Guix Days	Online
November 21 November 22	MiniDebConf - Gaming Edition	Online

If your event does not appear here, please tell us about it.

Netdev 0x14: slides and papers posted

The slides and papers from the recent Netdev conference have been posted and are available through the schedule.

Full Story (comments: none)

Alert summary September 17, 2020 to September 23, 2020

Dist.	ID	Release	Package	Date
Arch Linux	ASA-202009-6		chromium	2020-09-17
Arch Linux	ASA-202009-7		netbeans	2020-09-17
Debian	DLA-2375-1	LTS	inspircd	2020-09-20
Debian	DSA-4764-1	stable	inspircd	2020-09-18
Debian	DSA-4765-1	stable	modsecurity	2020-09-18
Fedora	FEDORA-2020-9b9e8e5306	F32	chromium	2020-09-19
Fedora	FEDORA-2020-5ed5af6275	F31	cryptsetup	2020-09-19
Fedora	FEDORA-2020-e2deb72e0f	F32	dotnet3.1	2020-09-16
Fedora	FEDORA-2020-30cd8d9ad6	F31	gnutls	2020-09-19
Fedora	FEDORA-2020-5920a7a0b2	F31	kernel	2020-09-16
Fedora	FEDORA-2020-3c6fedeb83	F32	kernel	2020-09-16
Fedora	FEDORA-2020-48a1ae610c	F31	mbedtls	2020-09-16
Fedora	FEDORA-2020-7dd29dacad	F31	mingw-libxml2	2020-09-19
Fedora	FEDORA-2020-b60dbdd538	F32	mingw-libxml2	2020-09-19
Fedora	FEDORA-2020-16167a66a2	F32	python35	2020-09-16
Fedora	FEDORA-2020-3813e1317b	F31	seamonkey	2020-09-20
Fedora	FEDORA-2020-15999f707a	F32	seamonkey	2020-09-18
Mageia	MGASA-2020-0368	7	libraw	2020-09-17
Mageia	MGASA-2020-0369	7	mysql-connector-java	2020-09-21
openSUSE	openSUSE-SU-2020:1183-2		ark	2020-09-18
openSUSE	openSUSE-SU-2020:1310-2		ark	2020-09-18
openSUSE	openSUSE-SU-2020:1048-1		chromium	2020-09-18
openSUSE	openSUSE-SU-2020:1181-1		chromium	2020-09-18
openSUSE	openSUSE-SU-2020:1215-1		chromium	2020-09-18
openSUSE	openSUSE-SU-2020:1322-1		chromium	2020-09-18
openSUSE	openSUSE-SU-2020:1032-1		chromium	2020-09-18
openSUSE	openSUSE-SU-2020:1499-1	15.1 15.2	chromium	2020-09-22
openSUSE	openSUSE-SU-2020:1192-1		claws-mail	2020-09-18
openSUSE	openSUSE-SU-2020:1494-1	15.2	curl	2020-09-21
openSUSE	openSUSE-SU-2020:1433-1		docker-distribution	2020-09-18
openSUSE	openSUSE-SU-2020:1478-1	15.1 15.2	fossil	2020-09-20
openSUSE	openSUSE-SU-2020:1438-1		hylafax+	2020-09-18
openSUSE	openSUSE-SU-2020:1427-1		inn	2020-09-18
openSUSE	openSUSE-SU-2020:1232-1		knot	2020-09-18
openSUSE	openSUSE-SU-2020:1505-1		libetpan	2020-09-22
openSUSE	openSUSE-SU-2020:1454-1	15.2	libetpan	2020-09-19
openSUSE	openSUSE-SU-2020:1458-1	15.2	libjpeg-turbo	2020-09-19
openSUSE	openSUSE-SU-2020:1500-1		libqt4	2020-09-22
openSUSE	openSUSE-SU-2020:1452-1	15.1	libqt4	2020-09-19
openSUSE	openSUSE-SU-2020:1501-1	15.2	libqt4	2020-09-22
openSUSE	openSUSE-SU-2020:1428-1		librepo	2020-09-18
openSUSE	openSUSE-SU-2020:1455-1	15.2	libvirt	2020-09-19
openSUSE	openSUSE-SU-2020:1465-1	15.2	libxml2	2020-09-19
openSUSE	openSUSE-SU-2020:1506-1		lilypond	2020-09-22
openSUSE	openSUSE-SU-2020:1453-1	15.2	lilypond	2020-09-19
openSUSE	openSUSE-SU-2020:1439-1		mumble	2020-09-16
openSUSE	openSUSE-SU-2020:1439-2		mumble	2020-09-18
openSUSE	openSUSE-SU-2020:1459-1	15.2	openldap2	2020-09-19
openSUSE	openSUSE-SU-2020:1509-1		otrs	2020-09-23
openSUSE	openSUSE-SU-2020:1475-1	15.1 15.2	otrs	2020-09-20
openSUSE	openSUSE-SU-2020:1055-1		pdns-recursor	2020-09-18
openSUSE	openSUSE-SU-2020:1101-1		pdns-recursor	2020-09-18
openSUSE	openSUSE-SU-2020:1502-1	15.1	perl-DBI	2020-09-22
openSUSE	openSUSE-SU-2020:1483-1	15.2	perl-DBI	2020-09-20
openSUSE	openSUSE-SU-2020:1423-1		python-Flask-Cors	2020-09-18
openSUSE	openSUSE-SU-2020:1446-1		python-Flask-Cors	2020-09-18
openSUSE	openSUSE-SU-2020:1100-1		singularity	2020-09-18
openSUSE	openSUSE-SU-2020:1497-1	15.1 15.2	singularity	2020-09-22
openSUSE	openSUSE-SU-2020:1468-1	15.2	slurm_18_08	2020-09-19
openSUSE	openSUSE-SU-2020:1486-1	15.2	virtualbox	2020-09-20
Oracle	ELSA-2020-3732	OL8	mysql:8.0	2020-09-17
Oracle	ELSA-2020-3631	OL7	thunderbird	2020-09-17
Red Hat	RHSA-2020:3803-01	EL7.4	bash	2020-09-22
Red Hat	RHSA-2020:3804-01	EL7.4	kernel	2020-09-22
Red Hat	RHSA-2020:3810-01	MRG2	kernel-rt	2020-09-22
Slackware	SSA:2020-266-01		seamonkey	2020-09-22
SUSE	SUSE-SU-2020:2715-1	SES5	grafana	2020-09-22
SUSE	SUSE-SU-2020:2690-1	SLE12	jasper	2020-09-21
SUSE	SUSE-SU-2020:2689-1	SLE15	jasper	2020-09-21
SUSE	SUSE-SU-2020:2687-1	SLE12	less	2020-09-21
SUSE	SUSE-SU-2020:2711-1	SLE12	libmspack	2020-09-22
SUSE	SUSE-SU-2020:2660-1	OS8 OS9 SLE12 SES5	libsolv	2020-09-16
SUSE	SUSE-SU-2020:0079-2	OS8 OS9 SLE12 SES5	libzypp	2020-09-16
SUSE	SUSE-SU-2020:2712-1	SLE15	openldap2	2020-09-22
SUSE	SUSE-SU-2020:2714-1	OS9 SLE12	ovmf	2020-09-22
SUSE	SUSE-SU-2020:2691-1	SLE15	ovmf	2020-09-21
SUSE	SUSE-SU-2020:2713-1	SLE15	ovmf	2020-09-22
SUSE	SUSE-SU-2020:2718-1	OS8	pdns	2020-09-23
SUSE	SUSE-SU-2020:2661-1	OS7 OS8 OS9 SLE12 SES5	perl-DBI	2020-09-16
SUSE	SUSE-SU-2020:2698-1	OS6 OS7 SLE12	python-pip	2020-09-21
SUSE	SUSE-SU-2020:2699-1	OS7 OS8 OS9 SLE12 SES5	python3	2020-09-21
SUSE	SUSE-SU-2020:2710-1	SLE15	rubygem-actionpack-5_1	2020-09-22
SUSE	SUSE-SU-2020:2686-1	OS6 OS7	rubygem-actionview-4_2	2020-09-21
SUSE	SUSE-SU-2020:2678-1	OS7	rubygem-rack	2020-09-18
SUSE	SUSE-SU-2020:2724-1	OS7 SLE12	samba	2020-09-23
SUSE	SUSE-SU-2020:2721-1	OS8 OS9 SLE12 SES5	samba	2020-09-23
SUSE	SUSE-SU-2020:2673-1	SLE12	samba	2020-09-17
SUSE	SUSE-SU-2020:2720-1	SLE12	samba	2020-09-23
SUSE	SUSE-SU-2020:2719-1	SLE15	samba	2020-09-23
SUSE	SUSE-SU-2020:2722-1	SLE15 SES6	samba	2020-09-23
Ubuntu	USN-4513-1	16.04	apng2gif	2020-09-17
Ubuntu	USN-4531-1	18.04 20.04	busybox	2020-09-22
Ubuntu	USN-4528-1	16.04 18.04	ceph	2020-09-22
Ubuntu	USN-4530-1	18.04	debian-lan-config	2020-09-22
Ubuntu	USN-4529-1	18.04	freeimage	2020-09-22
Ubuntu	USN-4516-1	18.04	gnupg2	2020-09-17
Ubuntu	USN-4533-1	20.04	ldm	2020-09-22
Ubuntu	USN-4534-1	12.04 14.04 16.04 18.04	libdbi-perl	2020-09-23
Ubuntu	USN-4509-1	14.04	libdbi-perl	2020-09-16
Ubuntu	USN-4517-1	16.04 18.04	libemail-address-list-perl	2020-09-18
Ubuntu	USN-4523-1	16.04	libofx	2020-09-21
Ubuntu	USN-4521-1	16.04 18.04 20.04	libpam-tacplus	2020-09-21
Ubuntu	USN-4505-1	18.04	libphp-phpmailer	2020-09-16
Ubuntu	USN-4514-1	16.04 18.04 20.04	libproxy	2020-09-17
Ubuntu	USN-4526-1	14.04 16.04 18.04	linux, linux-aws, linux-aws-hwe, linux-azure, linux-azure-4.15, linux-gcp, linux-gcp-4.15, linux-gke-4.15, linux-hwe, linux-oem, linux-oracle, linux-raspi2, linux-snapdragon	2020-09-21
Ubuntu	USN-4525-1	20.04	linux, linux-azure, linux-gcp, linux-oracle	2020-09-21
Ubuntu	USN-4506-1	16.04	mcabber	2020-09-16
Ubuntu	USN-4507-1	16.04	ncmpc	2020-09-16
Ubuntu	USN-4532-1	18.04	netty-3.9	2020-09-22
Ubuntu	USN-4522-1	16.04	novnc	2020-09-21
Ubuntu	USN-4504-1	16.04 18.04	openssl, openssl1.0	2020-09-16
Ubuntu	USN-4519-1	16.04	pulseaudio	2020-09-17
Ubuntu	USN-4515-1	16.04	pure-ftpd	2020-09-17
Ubuntu	USN-4511-1	16.04 18.04 20.04	qemu	2020-09-17
Ubuntu	USN-4520-1	16.04	sa-exim	2020-09-18
Ubuntu	USN-4510-2	14.04	samba	2020-09-17
Ubuntu	USN-4510-1	16.04 18.04	samba	2020-09-17
Ubuntu	USN-4508-1	16.04 18.04 20.04	storebackup	2020-09-16
Ubuntu	USN-4524-1	16.04	tnef	2020-09-21
Ubuntu	USN-4512-1	18.04	util-linux	2020-09-17
Ubuntu	USN-4518-1	16.04	xawtv	2020-09-17

Full Story (comments: none)

Linus Torvalds Linux 5.9-rc6 Sep 20

Greg Kroah-Hartman Linux 5.8.11 Sep 23

Greg Kroah-Hartman Linux 5.8.10 Sep 17

Greg Kroah-Hartman Linux 5.4.67 Sep 23

Greg Kroah-Hartman Linux 5.4.66 Sep 17

Greg Kroah-Hartman Linux 4.19.147 Sep 23

Greg Kroah-Hartman Linux 4.19.146 Sep 17

Greg Kroah-Hartman Linux 4.14.199 Sep 23

Greg Kroah-Hartman Linux 4.9.237 Sep 23

Greg Kroah-Hartman Linux 4.4.237 Sep 23

David Brazdil Independent per-CPU data section for nVHE Sep 22

Gavin Shan arm64/mm: Enable color zero pages Sep 23

Atish Patra Unify NUMA implementation between ARM64 & RISC-V Sep 18

Atish Patra Add UEFI support for RISC-V Sep 17

Alexander Graf Allow user space to restrict and augment MSR emulation Sep 16

Yu-cheng Yu Control-flow Enforcement: Shadow Stack Sep 18

Yu-cheng Yu Control-flow Enforcement: Indirect Branch Tracking Sep 18

Sami Tolvanen Add support for Clang LTO Sep 18

Thomas Gleixner sched: Migrate disable support for RT Sep 17

Peter Zijlstra sched: Migrate disable support Sep 21

Frederic Weisbecker rcu/nocb: De-offload and re-offload support v2 Sep 21

Marco Elver KFENCE: A low-overhead sampling-based memory safety error detector Sep 21

Daniel Latypov kunit: introduce class mocking support. Sep 18

Manivannan Sadhasivam Add PCIe support for SM8250 SoC Sep 16

Daniel Scally Add bridge driver to connect sensors to CIO2 device via software nodes on ACPI platforms Sep 16

Divya Bharathi Introduce support for Systems Management Driver over WMI for Dell Systems Sep 17

Matti Vaittinen Support ROHM BD9576MUF and BD9573MUF PMICs Sep 17

Xin Ji Add initial support for slimport anx7625 Sep 17

Mauro Carvalho Chehab Add a staging driver for Hikey 970 PHY laywer Sep 17

Srinivas Kandagatla clk: qcom : add sm8250 LPASS GFM drivers Sep 17

Srujana Challa Add Support for Marvell OcteonTX2 Cryptographic Sep 17

Khalil Blaiech i2c: add driver for Mellanox BlueField SoC Sep 16

Swapnil Jakhade drm: Add support for Cadence MHDP8546 DPI/DP bridge and J721E wrapper. Sep 18

Ben Levinsky Provide basic driver to control Arm R5 co-processor found on Xilinx ZynqMP Sep 17

Srinivasa Rao Mandadapu Qualcomm's lpass-hdmi ASoC driver to support audio over dp port Sep 18

Vinod Koul dmaengine: Add support for QCOM GSI dma controller Sep 18

Jonathan McDowell dmaengine: qcom: Add ADM driver Sep 20

Kishon Vijay Abraham I Implement NTB Controller using multiple PCI EP Sep 18

Corentin Labbe crypto: allwinner: add xRNG and hashes Sep 18

Clément Péron Add Allwinner H3/H5/H6/A64 HDMI audio Sep 20

Ikjoon Jang spi: spi-mtk-nor: Add mt8192 support. Sep 18

Nobuhiro Iwamatsu Add WDT driver for Toshiba Visconti ARM SoC Sep 18

Nobuhiro Iwamatsu Add Toshiba Visconti SoC PWM support Sep 18

Jean-Philippe Brucker iommu: Shared Virtual Addressing for SMMUv3 (PT sharing part) Sep 18

Robin Murphy Arm CMN-600 PMU driver Sep 18

Peter Hilber firmware: arm_scmi: Add virtio transport Sep 18

Jim Quinlan mailbox: Add Broadcom STB mailbox driver Sep 18

trix@redhat.com fpga: dfl: a prototype uio driver Sep 19

Serge Semin spi: dw: Add full Baikal-T1 SPI Controllers support Sep 20

Wan Ahmad Zainie phy: intel: Add Keem Bay USB PHY support Sep 21

Corentin Labbe staging: media: bring back zoran driver Sep 21

Joel Stanley ARM: aspeed: Add socinfo driver Sep 21

Viorel Suman (OSS) DAI driver for new XCVR IP Sep 21

Maxime Chevallier media: rockchip: Introduce driver for Rockchip's camera interface Sep 22

Vadim Pasternak hwmon: (pmbus) Add support for MPS mp2975 controller Sep 22

Roy Im da7280: haptic driver submission Sep 23

Mateusz Holenko LiteX SoC controller and LiteUART serial driver Sep 23

Maximilian Luz Add support for Microsoft Surface System Aggregator Module Sep 23

Ricardo Rivera-Matos Introduce the BQ256XX family of chargers Sep 23

Thara Gopinath Introduce warming in thermal framework Sep 16

Lukasz Luba Thermal devfreq cooling improvements with Energy Model Sep 21

Kent Gibson gpio: cdev: add uAPI v2 Sep 22

Suravee Suthikulpanit iommu: amd: Add Generic IO Page Table Framework Support Sep 23

Eric Biggers fscrypt: improve file creation flow Sep 16

Tejun Heo iocost: improve debt forgiveness logic Sep 17

Konstantin Komarov NTFS read-write driver GPL implementation by Paragon Software Sep 18

Harshad Shirwadkar ext4: add fast commits feature Sep 18

Daniel Rosenberg Add support for Encryption and Casefolding in F2FS Sep 23

Thomas Gleixner mm/highmem: Provide a preemptible variant of kmap_atomic & friends Sep 19

Peter Xu mm: Break COW for pinned pages during fork() Sep 21

Daniel Winkler Bluetooth: Add new MGMT interface for advertising add Sep 17

Moshe Shemesh Add devlink reload action and limit level options Sep 18

Dmitry Safonov xfrm: Add compat layer Sep 21

Nikolay Aleksandrov net: bridge: mcast: IGMPv3/MLDv2 fast-path (part 2) Sep 22

Sumit Garg Introduce TEE based Trusted Keys support Sep 17

Stephan Müller /dev/random - a new approach Sep 18

Nicolai Stange random: possible ways towards NIST SP800-90B compliance Sep 21

Tianjia Zhang crpyto: introduce OSCCA certificate and SM2 asymmetric algorithm Sep 21

madvenka@linux.microsoft.com [RFC] Implement Trampoline File Descriptor Sep 22

Andra Paraschiv Add support for Nitro Enclaves Sep 21

shuo.a.liu@intel.com HSM driver for ACRN hypervisor Sep 22

Nick Terrell Update to zstd-1.4.6 Sep 22

Shuah Khan Introduce Simple atomic and non-atomic counters Sep 22

LWN.net Weekly Edition for September 24, 2020

New operators and methods

Deprecation and porting

Other goodies

Silent denials

Seqcounts and seqlocks

The seqcount latch type

preempt_count()

Misleading counts

Using preempt_count

Questioning high memory

The ftrace ring buffer

Timestamps

The problem with nested writes

Avoiding cmpxchg()

The twelve year old puzzle solved

The case of the interrupted event before allocating storage

Conclusion

Brief items

Security

Kernel development

Development

Miscellaneous

Announcements

Newsletters

Distributions and system administration

Development

Meeting minutes

Miscellaneous

Calls for Presentations

CFP Deadlines: September 24, 2020 to November 23, 2020

Upcoming Events

Events: September 24, 2020 to November 23, 2020

Event Reports

Security updates

Kernel patches of interest

Kernel releases

Architecture-specific

Build system

Core kernel

Development tools

Device drivers

Device-driver infrastructure

Filesystems and block layer

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Avoiding `cmpxchg()`