LWN.net Weekly Edition for December 3, 2020

Welcome to the LWN.net Weekly Edition for December 3, 2020

This edition contains the following feature content:

Python structural pattern matching morphs again: fitting pattern matching into Python is proving difficult; here's the latest from that ongoing discussion.
Mutt releases version 2.0: a look at Mutt, past and present.
ID mapping for mounted filesystems: yet another attempt to make file ownership more flexible.
epoll_pwait2(), close_range(), and encoded I/O: three proposed filesystem-related enhancements.
Scheduling for asymmetric Arm systems: what should the scheduler do when different CPUs have different instruction sets?
Challenges in protecting virtual machines from untrusted entities: a survey of work toward more secure virtual hosting.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Python structural pattern matching morphs again

By Jake Edge
December 2, 2020

A way to specify multiply branched conditionals in the Python language—akin to the C switch statement—has been a longtime feature request. Over the years, various proposals have been mooted, but none has ever crossed the finish line and made it into the language. A highly ambitious proposal that would solve the multi-branch-conditional problem (and quite a bit more) has been discussed—dissected, perhaps—in the Python community over the last six months or so. We have covered some of the discussion in August and September, but the ground has shifted once again so it is time to see where things stand.

It seems quite possible that this could be the last major change that is made to the language—if it is made at all. As with many mature projects, there is a good deal of conservatism that tends to rear its head when big changes are proposed for Python. But this proposal has the backing of project founder (and former benevolent dictator for life) Guido van Rossum and has attracted support from other core developers—as well as opposition from within that group. It may also depend on one's definition of major, of course, but large syntactic and semantic language changes are definitely finding major headwinds in the Python community these days.

Background

The basic idea behind the "structural pattern matching" proposal is fairly straightforward, but there are some rather deep aspects to it as well. Our previous coverage, as well as the various Python Enhancement Proposals (PEPs) surrounding the feature—linked below—will be helpful to readers who want to dig in a ways. For those who just want the high-level introduction, this example taken from PEP 622 ("Structural Pattern Matching") gives much of the flavor of the proposed feature:

def make_point_3d(pt):
    match pt:
        case (x, y):
            return Point3d(x, y, 0)
        case (x, y, z):
            return Point3d(x, y, z)
        case Point2d(x, y):
            return Point3d(x, y, 0)
        case Point3d(_, _, _):
            return pt
        case _:
            raise TypeError("not a point we support")

The make_point_3d() function uses the proposed match statement to extract the relevant information from its pt argument, which may be passed as a two-tuple, three-tuple, Point2d, or Point3d. The x, y, and z (if present) are matched in the object passed and assigned to those variables, which are then used to create a Point3d with the right values. The use of "_" as a wildcard is consistent with other languages that have similar constructs, and is even used in a similar fashion as a convention in Python, but is perhaps one of the more contentious parts of the proposal. The final case matches anything at all that has not been matched by an earlier case.

If you squint at that example, it looks ... Python-ish, perhaps. But the case entries have some substantial differences from the existing language. In particular, constructs like Point2d(x, y) do not instantiate a Point2d object, but test if the match argument matches that type. If so, x and y are not looked up in the local scope, but are, instead, assigned to. It is different enough from the usual way of reading Python code that some have called it a domain-specific language inside Python for matching, which is seen (by some) as something to be avoided.

Another contentious part of the proposal is the handling of names, which are always treated as variables that get filled in from the match (called "capture variables"), as opposed to looking the name up and using its current value as a constant to be matched. That does not sit well with some, who mainly think that the capture variables should be indicated with some kind of sigil (e.g. ?var); other uses of names should conform to Python's usual practice. But the long list of authors for PEP 622 unanimously agreed that the common capturing case should not be made "ugly" for consistency with other parts of Python. Part of the reasoning is that other languages which have the feature also default to capture variables for unadorned names.

But programmers will want to be able to use constants in their case entries. The first version of PEP 622 required a sigil in the form of a dot prepended to names that should be used as constants (e.g. .CONSTANT), but that was not wildly popular—to put it mildly. Round two of that PEP switched to requiring constants to be in a namespace, which might be seen as something of a cop-out, since that effectively still requires the dot (e.g. namespace.constant).

Three new PEPs

When last we left the saga, PEP 622 was being handed off to the Python steering council for consideration. The council members discussed the PEP among themselves as well as with the PEP's authors. The result of that was announced by one of those authors, Van Rossum, toward the end of October. It turned out that "there were a lot of problems with the text" of PEP 622, so the authors abandoned it in favor of three new PEPs:

PEP 634: "Structural Pattern Matching: Specification"
PEP 635: "Structural Pattern Matching: Motivation and Rationale"
PEP 636: "Structural Pattern Matching: Tutorial"

He also summarized the changes to the proposal that came in the new PEPs. Some of the details were changed and some problems that many users were likely to run into (i.e. "footguns") were turned into errors, but most of the contentious pieces were left unchanged. In particular, no changes were made for the interpretation of names in case entries (so they are capture variables unless they are in a namespace) or for the wildcard character (it remained as "_").

Make that four

A few days before Van Rossum's announcement, steering council member Thomas Wouters posted a PEP addressing the use of "_": PEP 640 ("Unused variable syntax"). It would create a new unused variable that can be assigned to, though the binding (or assignment) is not actually performed and that variable cannot be used in any other way. The PEP proposes to use "?" as that variable.

Currently, some Python code conventionally uses "_" for unused variables, though that name has no special treatment in the language. In particular, the "unused" value does get bound to the name "_". It is often used as follows:

    x, _, z = (2, 3, 4)    # x=2, z=4 (but _=3 as well)

    for _ in range(10):
        do_something()
    # _=9 here

Using "unused", "dummy", or other regular names is possible too, of course. The problem that Wouters (and others) see is that the structural pattern matching proposal gives an additional meaning to "_", but does not extend it to the rest of the language. It is this inconsistency that led to the PEP:

[...] However, the special-casing of ``"_"`` for this wildcard pattern purpose is still problematic: the different semantics *and meaning* of ``"_"`` inside pattern matching and outside of it means a break in consistency in Python.

Introducing ``?`` as special syntax for unused variables *both inside and outside pattern matching* allows us to retain that consistency. It avoids the conflict with internationalization *or any other uses of _ as a variable*. It makes unpacking assignment align more closely with pattern matching, making it easier to explain pattern matching as an extension of unpacking assignment.

There is one other oddity with "_": it has ... interesting ... behavior in the Python read-eval-print loop (REPL), where "_" is normally assigned to the value of the last-executed expression.

    >>> 2+2
    4
    >>> _
    4

If any of that is done in the REPL after the user explicitly assigns to "_", though, it always holds the last value that was assigned. So there is a fair amount of established usage of "_" that PEP 640 is trying to sidestep.

In Wouters's posting, he noted that adding "?" as the unused variable had benefits entirely independent of the pattern matching proposal, but he believes they are too small if PEP 634 is not adopted. So he thinks that PEP 640 should be rejected in that case. The reaction to the PEP was generally somewhat negative, though there was not a lot of discussion of the PEP itself in that thread. The main objection is that debugging uses of the unused variable when its value cannot be queried will be difficult.

Or five

Van Rossum's announcement of the three PEPs was also met with a fairly abbreviated thread (at least by the standards set in earlier rounds) that mostly consisted of tangential discussions on various pieces. But, as he was with PEP 622, Mark Shannon is not convinced that this form of pattern matching is needed at all in the language. He argued that it is a bad fit for a dynamically-typed procedural language like Python and that PEP 635 fails to offer a convincing case for the value of the feature (though the arguments have improved since PEP 622, he said).

Shannon had a number of specific areas where he believes that the proposal falls short, which were mostly met with disagreement, but Nick Coghlan noted that he shared some of Shannon's concerns. In fact, Coghlan had just posted an announcement of PEP 642 ("Constraint Pattern Syntax for Structural Pattern Matching") addressing some of those problems. His idea is that the existing assignment syntax can be tweaked slightly to accommodate pattern matching, while retaining the possibility that it could be used elsewhere in the language down the road.

In the original version of the PEP, Coghlan combines literal and value (e.g. namespace.constant) patterns from PEP 634 into "constraint patterns". These constraint patterns can be tested either for equality or identity in a case. He used "?" as a prefix for equality and "?is" for identity and replaced the non-binding "_" wildcard with "?". The end result is that names are looked up and literals used if they are marked with "?"; literals that are not marked would raise a SyntaxError. It would look something like:

    MISSING=404
    match foo:
        case ?0:
	    print('foo equals zero')
	case ?is None:
	    print('foo is None')
	case ?MISSING:
	    print('foo not found (404)')
	case (a, b):
	    print(f'foo is a two-tuple: {a} {b}')
	case _:      # still works, _ is just a normal capture variable
	    print('foo is something wildly unexpected')

Steven D'Aprano did not like the PEP, but he had several suggestions, some of which were subsequently adopted by Coghlan. In particular, he dropped the need to have equality markers for literal values and switched away from using "?" entirely. Literal patterns are simply "case 0:", equality uses "==", and identity uses "is". D'Aprano also suggested that the problem with "_" in match is overblown:

I really don't get why so many people are hung up over this minuscule issue of giving `_` special meaning inside match statements. IMO, consistency with other languages' pattern matching is more useful than the ability to capture using `_` as a variable name.

Wouters sees things differently, however:

Allow me to explain, then: structured pattern matching is (even by admission of PEPs 634-363) an extension of iterable unpacking. The use of '_' as a wildcard pattern is a sharp break in that extension. In the structured pattern matching proposal, '_' is special syntax (and not in any way less so than '?') but *only* in cases in match statements, not in iterable unpacking. It *already* isn't consistent with '_' in other languages, and we can't fix that without breaking uses of _ for gettext, not to mention other situations existing code uses '_' as something other than an assign-only variable.

[...] The use of something else, like '?', leaves existing uses of '_' unambiguous, and allows structured pattern matching and iterable unpacking to be thought of the same. It reduces the complexity of the language because it no longer uses the same syntax for disparate things.

Tobias Kohn, one of the PEP 622 authors and co-author of PEP 635 with Van Rossum, noted that the idea of "load sigils" had been discussed and, in fact, the authors had settled on dot (".") for that case, but it proved to be unpopular. Kohn said that there is nothing in the current structural pattern matching proposal that precludes adding, say, "?" as a load sigil in the future. But he thinks those kinds of things can wait:

You might have noticed that the original PEP 622 contained a lot more than the current PEPs 634-636. This is intentional: with the current pattern matching PEPs, we boiled down the entire concept to the basic infrastructure that we need in order to get it going; a basic "starter kit" if you will. [...] But let us perhaps just start with pattern matching---hopefully in 3.10 :)---and then gradually build on that. Otherwise, I am afraid we will just keep running in circles and never get it to lift off.

Deciding

While there are five PEPs floating around, two of them are informational in nature (635 and 636), so the steering council needs to decide if it will accept PEP 634 and add structural pattern matching to the language. It also needs to decide whether to augment or modify the feature with either PEP 640 to add "?" as an unused variable or PEP 642 to add constraint patterns and, effectively, load sigils. It could choose to adopt all three since Coghlan had switched PEP 642 to use "__" (double underscore) as its wildcard matching variable.

It is a complicated set of questions; if anything is adopted, it seems likely to have a significant impact for the language for a long time to come. The 2020 steering council will not be making the decision, however. The election for the 2021 steering council is currently underway; it completes on December 16. As reported by Wouters in early November, the current council will make a strong recommendation on the PEPs to the incoming council, which will make the final determination. There is no huge rush since the schedule for Python 3.10 shows the first beta, which is also the feature-freeze date, in early May 2021.

As part of the effort to make that recommendation, steering council member Brett Cannon posted a poll to the Python Discourse instance. He posted to the "Committers" category, where only core developers can comment and answer the poll. There were five options, one rejecting pattern matching entirely, three accepting PEP 634 with and without the other PEPs, and one for those who want pattern matching but not as defined in any of the PEPs.

When the voting closed on November 23, the clear split among core developers was evident. Half of the 34 voters wanted to accept PEP 634 in some form, while 44% (15 voters) did not want pattern matching at all and two voters (6%) wanted pattern matching but not as proposed. The poll is not binding in any way, of course, but it is indicative of the fault lines in the community with regard to the feature. Whichever way the council decides, it is likely to leave a sizable contingent unhappy.

Several commented in the poll thread about why they were voting one way or another; those in favor tended to see ways they could use the feature in their own code and were not overly bothered by any perceived inconsistencies. For the "no pattern matching" folks, Larry Hastings may have spoken for many of them when he said:

[...] The bigger the new syntax, the higher the bar should become, and so the bigger payoff the new syntax has to provide. To me, pattern matching doesn’t seem like it’s anywhere near big enough a win to be worth its enormous new conceptual load.

I can see how the PEP authors arrived at this approach, and I believe them when they say they thought long and hard about it and they really think this is the best solution. Therefore, since I dislike this approach so much, I’m pessimistic that anybody could come up with a syntax for pattern matching in Python that I would like. That’s why I voted for I don’t want pattern matching rather than I want pattern matching, but not as defined in those PEPs. It’s not that I’m against the whole concept of pattern matching, but I now believe it’s impossible to add it to Python today in a way that I would want.

There is a great deal more discussion in the python-dev mailing list for those who might want to dig in further. Coghlan's post of version two of PEP 642 and a suggestion by David Mertz to use words rather than sigils both led to interesting discussions. Paul Sokolovsky pointed participants to a recent academic paper [PDF] written by the authors of PEP 622 about pattern matching for Python; the paper sparked some discussion. Shannon also posted about some work he has been doing to define the precise semantics of pattern matching, which is something that is currently lacking. And so on.

It is, in short, one of the most heavily discussed Python features of all time. It seems likely that it even surpasses the discussion in the "PEP 572 mess", which brought the walrus operator (":=") to Python, but also led to Van Rossum's retirement. But maybe it only seems as large. In any case, the soon-to-be-elected steering council is in something of an unenviable position, but it seems clear that the question of this style of pattern matching for Python will finally be laid to rest early in 2021—one way or the other.

Comments (25 posted)

Mutt releases version 2.0

November 25, 2020

This article was contributed by Lee Phillips

The venerable email client Mutt has just reached version 2.0. Mutt is different from the type of client that has come to dominate the email landscape—for one thing, it has no graphical interface. It has a long history that is worth a bit of a look, as are its feature set and extensive customizability. Version 2.0 brings several enhancements to Mutt's interface, configurability, and convenience, as well. In this article, readers who are unfamiliar with Mutt will learn about a different way to deal with the daily chore of wrangling their inboxes, while Mutt experts may discover some new sides to an old friend.

What is Mutt?

Mutt is an open-source (Gitlab repository), GPL-licensed email client that runs in a terminal. When it was first released as Mutt 1.0 in October 1999, it adhered to the Unix philosophy of doing one thing well: being a mail user agent (MUA). Therefore it did not include an editor, image viewer, or an HTML renderer, farming these functions out to external programs. It did include POP and IMAP support, so that it could retrieve mail, but it did not deliver mail over the internet, handing that job to a locally-installed mail submission agent (MSA) instead.

Most email clients perform additional functions in addition to their roles as MUAs. Programs such as Thunderbird or the commercial offerings for macOS and Windows incorporate editors, viewers for attachments, address books, and can directly speak simple mail transfer protocol (SMTP). They often include even more, such as task management or calendering. The widely-used Gmail service is a web application that automatically assumes all of the media display features of the browser and is closely integrated with many of Google's other services.

Mutt retained its tight focus for many years, but made one major compromise in April 2007. With v.1.5.15, Mutt could deliver email using the SMTP protocol. No longer would users need to have a local copy of Sendmail or Postfix installed, because now Mutt could authenticate with and communicate to their ISPs directly in order to send email.

Mutt was written by Michael Elkins in 1995 with an interface inspired by Elm, incorporating features from Pine and Mush, resulting in a mixed breed, hence the name. Mutt has an official slogan: "All mail clients suck. This one just sucks less."

Why use Mutt?

Mutt has two main advantages: efficiency and configurability. It appeals to those who spend hours tuning their .vimrc files (mea maxima culpa) and who demand that their programs start up and respond instantly.

Mutt is extremely configurable in the way that it presents information and in its command interface. In order to take advantage of this, the user must become familiar with the basics of Mutt's pattern language, which is set of codes that refer to properties of messages. All the details about this are in Mutt's exhaustive manual; in this article I'll provide some examples, to give a bit of flavor of what it is like to work with Mutt.

There are two main places where Mutt presents information: the index is the list of messages, and the pager shows the contents of a message. Upon startup, the user sees the index; when a message is selected for viewing, the window is split: a small portion on top to see the index entry of the message with some context, and a larger portion with the email.

The following screen shot shows Mutt with one message selected for viewing. The colors of each email header, as well as each level of quoting, can be specified in the startup file. The screen shot shows the command reminder on the top line, with the index pane beneath it. This has the "cursor", indicated by black text on a cyan background, on message four, which is the message displayed in the large pane at the bottom.

One of Mutt's unusual strengths is that it actually displays message threads correctly. When people complain that they have trouble following email conversations involving groups of people, what they typically mean is that they are using a client such as Gmail that does not properly display the reply structure of email messages. When an email user hits the reply key, an In-Reply-To header should be added to the email. Mutt refers to these headers, and to the related References header, to keep track of which messages replied to what, and builds a tree to show the structure of the conversation. Some other mail clients attempt to construct this tree using only the Subject header, which can wind up joining unrelated messages and incorrectly breaking a thread when a respondent alters the subject of a reply.

Here is a screen shot of the same inbox, with a little conversation about a performance review added. Mutt displays a visual representation of the conversation tree in the index, showing which messages reply to which, even when participants change the Subject line. Message six is a reply from the CEO to the boss's first email, for example. In message seven I'm writing to someone else about my performance review, in a new message, so it's not part of the tree.

After importing this group of messages into my Gmail account, here is what I see:

I have activated the "Conversation view" setting, which is supposed to group messages into conversations. But Gmail fails to do this correctly. It has created two pairs of related messages, shown by the numeral "2", requiring the user to click on those lines to see the individual messages. But it has lost the thread of all the other messages, displaying them as unrelated. I have dwelt on this comparative example at some length because it is a good answer to the question in the heading: Mutt's conversation threading is sufficient reason for some to stick with it in preference to more "modern" clients. And perhaps a third quality, "correctness", should be added to efficiency and configurability strengths.

Mutt goes beyond simply displaying the conversation tree, also allowing the user to edit it. There are commands to detach a message from the conversation or mark a message as being a reply to another message. This is helpful because some messages will come from correspondents who are using clients that fail to set the Reply-To header, and others will commit the offense of "thread hijacking," often seen in mailing lists, where a user replies to a message with something that is not an actual reply, but the beginning of a new topic.

The user can configure the colors of lines in the index based on almost any property of the message, including its origin, subject, and status (replied to, read, unread, marked, etc.), or even depending on the contents of the message body. Effective use of Mutt's coloring requires a terminal with 256-color support. For fine-grained control over colors, I find it best to use the terminal color codes. Here is an example of a coloring command; it uses the pattern ~F to indicate a message that has been "flagged", or marked as important:

    color index color124 color234 ~F

If this line is present in Mutt's startup file, .muttrc in the user's home directory, then lines in the index for flagged messages will appear with red (color124) text on a nearly black (color234) background.

Colors can also be controlled by a scoring system, where points can be added to or subtracted from a message based on header criteria, and a color assigned based on the total score. For example, these configuration commands:

    score '~f boss' +10
    score '~s meeting' -5
    score '~s party' +20
    color index color208 color234 "~n 10"
    color index black red  "~n 20"
    color index magenta black "~n 5"

give email from the boss (~f boss) a ten point bonus, but any message with "meeting" in the subject line (~s meeting) gets knocked down by five points. Really important emails, about parties, get a 20 point boost. The last three commands set colors in the index, based on the message scoring, and using both color names and numbers; the text color comes first in each command, followed by the background color.

Here is the result:

The last row, with the default black text on a cyan background, shows the position of the cursor.

Mutt puts a lot of expressive power in the user's fingertips. Its pattern language can be used to construct searches using any property of the messages. For example, after pressing "/" to enter the search prompt, the command:

    ~fboss~d<2d~smeeting

searches for emails from boss, less than two days old, about meetings, placing the cursor on the first message found. Hitting "n" takes the user to subsequent matches. The same commands entered after typing "l" limit the index display to messages satisfying the criteria, giving the user a filtered view of their mailbox; messages not matching the search criteria are temporarily hidden.

In addition to manually entered commands, hooks can be defined in Mutt's startup file that are executed before taking certain actions. A hook can be set to to run before replying to a specific user, when changing mailboxes or accounts, when editing an email, and more. The user can also define macros, binding arbitrary sequences of commands to single keystrokes, and these bindings can have different meanings in different contexts; a keystroke can do one thing while reading an email, something else when looking at the mailbox index, and a third thing when viewing a message's attachment index.

In addition to Mutt's two main information panes, Mutt has an attachment pane, shown in the next figure, to display an index of the attachments for a particular message. Here the user can view, delete, save, pipe to a shell command, and do other things to attachments individually or in groups. When "viewing" an attachment, Mutt will refer to the user's .mailcap file, or, if that doesn't exist, to the system-wide one, to find out which program—or, in fact, shell pipeline—to use to handle it.

Mutt can also be instructed to automatically convert some attachment types to text for direct viewing in the pager. This is most commonly done in order to read HTML email using programs such as w3m or Lynx. If the text view is not enough, a couple of keystrokes pops the attachment over to the user's graphical browser. Mutt is also designed to work with OpenPGP to sign and encrypt email, which will become a useful feature when people decide to use encryption.

In addition to its curses-based interface, Mutt can be used from the command line for batch processing of email. Here is the command to send an email to the boss with a subject of "Performance review", a PDF file as an attachment, and using the contents of the file reviewComments.txt as the message body:

    mutt -s "Performance review" boss@bigco.xyz -a signedreview.pdf < reviewComments.txt

Mutt's remarkable efficiency can change one's approach to email. The program can start up and present a properly sorted and threaded index of thousands of emails in under a second. Searching and filtering are effortless and practically instantaneous. For these reasons, I don't bother maintaining separate mailboxes; there is no need to decide how email should be organized, as I can create any view into my inbox to serve the needs of the moment, on the fly. Instead of "inbox zero", my guideline is "inbox 10,000", because that is the point where Mutt may take longer than two seconds to open my mailbox. At that point, with just a few keystrokes I can archive, say, the oldest non-flagged 5,000 messages, an operation that is also nearly instantaneous.

For some, Mutt's advantages will not outweigh what they consider to be its inconveniences. The chief one of these is its lack of integrated HTML rendering, made more acute by the current preponderance of HTML email. Although Mutt handles HTML attachments as described above, there are those who dislike switching to another program, and prefer everything to be included. Such users may also prefer programs with integrated calendaring, task management, GUI message composition, spam detection, auto-replies, and other features that will likely never be part of Mutt, as they are alien to its MUA-focused nature. However, these users could consider enhancing Mutt with some of the wide variety of patches and add-ons available.

New in 2.0

Version 2.0 of Mutt was released on November 7, 2020. There are a few minor incompatibilities with previous releases, which means that existing startup files may not work, or may work differently. These incompatibilities are the reason for the major version bump to 2, although, as one data point, my existing setup worked without change. The new version introduces a few new features that make Mutt even easier to work with.

Those who want the new version now will probably need to download the source and compile it, except for users of distributions such as Arch that already incorporate the new release. Compiling is quick and there are no unusual dependencies.

The screen shots above illustrate the default background and foreground colors for the "indicator", showing where the cursor is in the index. Of course any colors can be chosen for this purpose, but in versions before 2.0 those colors replaced the user's carefully chosen colors for showing the status and properties of the message. I've always found this annoying, and, apparently, I was not alone, because in the new version the cursor can be made less obtrusive. The following figure shows the mailbox in the first figure again, but now using Mutt 2.0 with the indicator set to be an underline and the new option set cursor_overlay added to the startup file. In the figure, the cursor is on the first message, but the color can still be seen:

It may seem like a small thing, but an email program is something that is used all day long much of the time, so any improvement that makes it easier to see what is going on is most welcome.

Another interesting addition is something the developers are calling "MuttLisp". This is not a full-blown language like Emacs Lisp, but an enhancement to the configuration syntax, allowing the use of S-expressions for making settings based on conditions, and constructing commands by manipulating strings. Users are warned that this feature is experimental and the details are expected to change in future releases.

In version 2.0, the user can get a reminder of the 46 possible things in the pattern language that can follow a "~" when entering a search or limit command, by hitting the TAB key. Mutt users can also now address mail directly to IPv6 addresses, such as:

    To: toaster@[IPv6:ffbc:b47:ba4f:94ed:4425:492:d948:8aee]

This may be useful in communicating with Internet of Things devices.

Another enhancement in the latest version will be welcomed by those whose IMAP connection may not always be reliable. Now Mutt will automatically attempt to reconnect if the connection drops or if the server stops responding. The other additions in version 2.0 are nearly all additional configuration parameters whose description would take us pretty far into technical Mutt minutiae.

Getting started

It can be a little daunting to get started with Mutt. The manual is a great place to look things up when you're already familiar with the program; it is complete to a fault. But it is full of terms and ideas used without explanation. Fortunately, many helpful articles about Mutt have been written over its long lifetime, including this getting-started guide. Two others, "A Quick Guide to Mutt" and this Linux Journal article are a bit older, but still useful. When getting started, it may be helpful to have examples of other users' configurations; there is a list of those here. There are mailing lists, including mutt-users where pleas for help are welcome, as well as others for announcements and for developers.

When I was learning Mutt some decades ago I discovered that it allowed me to freely edit message headers, and of course I immediately amused myself by sending off several emails to friends from "God@heaven.com" and Santa Claus. Today, although I've mostly gotten over this phase, I still routinely do things in Mutt that cannot be done in other email clients, such as Gmail, because there are no commands to do them. Mutt is one of those power tools that demand a significant investment from the user to learn its ways and to get it set up and configured. For some of us, who "live in email," this investment pays off handsomely—but it's certainly not for everyone.

Comments (31 posted)

ID mapping for mounted filesystems

By Jonathan Corbet
November 19, 2020

Almost every filesystem (excepting relics like VFAT) implements the concept of the owner and group of each file; the higher levels of the operating system then use that information to control access to those files. For decades, it has usually sufficed to track a single owner and group for each file, but there is an increasing number of use cases wanting to make that ownership relative to the environment any given process is running in. Developers have been working for a few years to find solutions to this problem; the latest attempt is the ID-mapped mounts patch set from Christian Brauner.

In truth, the ID-mapping problem is not exactly new. User and group IDs for files only make sense across a management domain if there is a single authority controlling the assignment of those IDs. Since that is often not the case, network filesystems like NFS have had the ability to remap IDs for many years. The growth of virtualization and container technologies has brought the problem closer to home; there can be multiple management domains running on a single machine. The NFS ID-remapping mechanism is of little use if NFS itself is not being used.

For example, container runtime systems may want to provide a common root image to each container. User namespaces may be used to ensure that each container is running with a set of nonprivileged IDs on the host system, but those containers should be able to access their root images with root privileges. Mounting that image with ID remapping would make this possible. Similarly, ID remapping would make it easier to share filesystems between containers regardless of the IDs in use within each container. Or consider systemd-homed, which provides consistent access to a user's home directory across machines. If a user logs into a system and is given a user ID that doesn't match the ownership of their home directory, systemd-homed will change the ownership of all files in and below the home directory — not an especially efficient operation. ID remapping would solve the problem in a more satisfying way.

There have been a number of previous attempts to address these use cases. The shiftfs filesystem was designed to be stacked on top of an ordinary filesystem; it would then remap user and group IDs in operations as they passed through. That idea then evolved into shifting bind mounts, which moved the ID-mapping function into the virtual filesystem (VFS) layer. Shortly after that, Brauner proposed FSID mappings, which repurposed the kernel's filesystem-ID abstraction to perform the remapping. Now, with ID-mapped mounts, the remapping is again handled within the VFS, but with a twist.

This patch set adds a new pointer to the vfsmount structure that represents a mounted filesystem; this pointer, called mnt_user_ns, points to a user namespace. One of the key features of user namespaces is, of course, ID remapping; a process that is running within a user namespace will already have its user and group IDs remapped for any operation, including filesystem operations, that reaches outside of the namespace. But user namespaces have a single map that applies to all operations, and to all mounted filesystems; attaching a user namespace to the vfsmount structure allows every mounted filesystem to have a different mapping.

Setting up ID-mapped mounts, thus, involves the creation of user namespaces to contain the ID-mapping tables. These user namespaces will, most likely, never have processes running within them; in a sense, much of their functionality is wasted in this context. But this approach made it possible to use all of the existing ID-mapping helpers, while creating a more focused ID-mapping abstraction would require duplicating much of that functionality.

By default, mounted filesystems will point to the initial user namespace, which is taken as an indication that no remapping is to be done at that layer. Code that wants to add ID mapping to a mounted filesystem has to start by creating a new user namespace; this is a bit of a roundabout procedure that is not directly supported by the kernel. In a sample mount-idmapped tool written by Brauner, this task is done by creating a new process within its own user namespace. The child process does nothing but suspend itself with a SIGSTOP signal while the parent creates a reference to the child's user namespace by opening the associated /proc file.

The next step is to establish the ID mapping in the newly created user namespace; this is done by writing appropriate values to the uid_map and gid_map files in the child process's /proc directory. Once that has been done, the child can just be killed off; the open file descriptor to its user namespace will ensure that it will stay around after the process is gone.

Actually associating the user namespace is done with the mount_setattr() system call, which is also added by this patch set:

    struct mount_attr {
	__u64 attr_set;
	__u64 attr_clr;
	__u64 propagation;
	__u64 userns_fd;
    };

    int mount_setattr(int dfd, const char *path, unsigned int flags,
    		      struct mount_attr *attr, size_t attr_size);

The attr_set and attr_clr fields of the mount_attr structure describe the attributes to be set and cleared, respectively; propagation controls whether this operation affects only the filesystem indicated by dfd and path or whether it also affects all filesystems currently mounted underneath it. To add ID mapping to a filesystem, the caller (who must have the CAP_SYS_ADMIN capability in the current patches) should set MOUNT_ATTR_IDMAP in attr_set, and set userns_fd to the file descriptor for the relevant user namespace.

While ID mapping can apparently be set up for any filesystem mount, the feature is expected to be mostly used with bind mounts, which create a new view of an existing filesystem. The above-linked cover letter for the patch series gives a number of examples of how this capability could be used. A simple one involves just providing a view of a directory with the files owned by a different user ID. Another creates an identity mapping (so IDs don't change), but that mapping lacks user ID 0, preventing access as root. Filesystems without the concept of user IDs (such as VFAT) can have those IDs grafted onto them with ID-mapped mounts. And so on.

The previous posting of this patch set generated a certain amount of interest. This work seems to have the approval of the VFS developers, which is a significant hurdle that any patch in this area must overcome. So it might just be that a solution to the ID-mapping problem has finally been found and there will be no need for yet another attempt — maybe.

Comments (8 posted)

epoll_pwait2(), close_range(), and encoded I/O

By Jonathan Corbet
November 20, 2020

The various system calls and other APIs that the kernel provides for access to files and filesystems has grown increasingly comprehensive over the years. That does not mean, though, that there is no need or room for improvement. Several relatively small additions to the kernel's filesystem-related API are under consideration in the development community; read on for a survey of some of this work.

Higher-resolution `epoll_wait()` timeouts

The kernel's "epoll" subsystem provides a high-performance mechanism for a process to wait on events from a large number of open file descriptors. Using it involves creating an epoll file descriptor with epoll_create(), adding file descriptors of interest with epoll_ctl(), then finally waiting on events with epoll_wait() or epoll_pwait(). When waiting, the caller can specify a timeout as an integer number of milliseconds.

The epoll mechanism was added during the 2.5 development series, and became available in the 2.6 release at the end of 2003. Nearly 20 years ago, when this work was being done, a millisecond timeout seemed like enough resolution; the kernel couldn't reliably do shorter timeouts in any case. In 2020, though, one millisecond can be an eternity; there are users who would benefit from much shorter timeouts than that. Thus, it seems it is time for another update to the epoll API.

Willem de Bruijn duly showed up with a patch set adding nanosecond timeout support to epoll_wait(), but it took a bit of a roundabout path. Since there is no "flags" argument to epoll_wait(), there is no way to ask for high-resolution timeouts directly. So the patch set instead added a new flag (EPOLL_NSTIMEO) to epoll_create() (actually, to epoll_create1(), which was added in 2.6.27 since epoll_create() also lacks a "flags" argument). If an epoll file descriptor was created with that flag set, then the timeout value for epoll_wait() would be interpreted as being in nanoseconds rather than milliseconds.

Andrew Morton, however, complained about this API. Having one system call set a flag to change how arguments to a different system call would be interpreted was "not very nice" in his view; he suggested adding a new system call instead. After a bit of back and forth, that is what happened; the current version of the patch set adds epoll_pwait2():

    int epoll_pwait2(int fd, struct epoll_event *events, int maxevents,
                     const struct timespec *timeout, const sigset_t *sigset);

In this version, the timeout is passed as a timespec structure, which includes a field for nanoseconds.

There has been some discussion of the implementation of this system call, but not a lot of comments on the API, so perhaps this work will go forward in this form. Your editor cannot help but note, however, that this system call, too, lacks a "flags" argument, so the eventual need for an epoll_pwait3() can be readily foreseen.

`close_range()` — eventually

The close_range() system call was added in the 5.9 release as a way to efficiently close a whole list of file descriptors:

    int close_range(int first, int last, unsigned int flags);

This call will close all file descriptors between first and last, inclusive. There is currently one flags value defined: CLOSE_RANGE_UNSHARE, which causes the indicated range of file descriptors to be unshared from any other processes (and does not close them).

In this patch set, Giuseppe Scrivano adds another flag, CLOSE_RANGE_CLOEXEC. This flag will set the "close on exec()" flag on each of the indicated file descriptors. Once again, close_range() does not actually close the files in this case; it simply marks them to be closed if and when the calling process does an exec() in the future. This is, presumably, faster than executing a loop and setting the flag with fcntl() on each file descriptor individually.

The functionality seems useful, and there have not really been any complaints about the API (there were some issues with the implementation in previous versions of the patch set). Given that close_range() is taking on more functionality that does not involve actually closing files, though, it seems increasingly clear that this system call is misnamed. It has only been available since the 5.9 release on October 11, so there are not yet C-library wrappers for it in circulation. So there is time to come up with a better name for this system call, should the desire to do so arise.

Encoded I/O

Some filesystems have the ability to compress and/or encrypt data written to files. Normally, this data will be restored to its original form when read from those files, so users may be entirely unaware that this transformation is taking place at all. What if, however, somebody wanted the ability to work with this "encoded" data directly, bypassing the processing steps within the filesystem code? Omar Sandoval has a patch set making that possible.

The main motivation for this work appears to be backups and, in particular, the transmission of partial or full filesystem images with the Btrfs send and receive operations. The whole point of using this mechanism is to create an identical copy of a Btrfs subvolume on another device. If the subvolume is using compression, a send will currently decompress the data, which must then be recompressed on the receive side, ending up in its original form. If there is a lot of data involved, this is a somewhat wasteful operation; it would be more efficient to just transmit the compressed data.

With this patch set applied, it becomes possible to read the compressed and/or encrypted data directly and write it directly, with no intervening processing. The first step is to open the subvolume with the new O_ALLOW_ENCODED flag. The CAP_SYS_ADMIN capability is needed to open a subvolume in this mode; imagine what could happen if an attacker were to write corrupt compressed data to a file, for example. Dave Chinner argued early on that corrupt data should just be treated as bad data and this operation could be unprivileged, but that view did not win out.

Then, encoded data can be read or written using the preadv() and pwritev() system calls. The new RWF_ENCODED flag must be used to indicate that encoded data is being transferred. A normal invocation of these system calls takes an array of pointers to iovec structures describing the buffers to be transferred; when encoded I/O is being done, though, the first pointer instead refers to an instance of the new encoded_iov structure type:

    struct encoded_iov {
	__aligned_u64 len;
	__aligned_u64 unencoded_len;
	__aligned_u64 unencoded_offset;
	__u32 compression;
	__u32 encryption;
    };

The len field must contain the length of this structure; it is there in case new fields are added in the future. The unencoded_len and unencoded_offset fields describe the portion of the file affected by this operation; the compression and encryption fields contain filesystem-dependent values describing the type of compression and encryption applied. All other pointers in the iovec array point to actual iovec structures describing the data to transfer.

The patch set includes support for reading and writing compressed data from a Btrfs filesystem. There is also a follow-on patch set working this support into the send and receive operations. Benchmarks included there show a significant reduction in bandwidth required to transmit the data, reduced CPU time usage and, in some cases, reduced elapsed time as well.

This patch series has been through six revisions as of this writing; the first version was posted in September 2019. Various implementation issues have been addressed, and the work appears to be converging on something that should be ready to merge soon.

Comments (26 posted)

Scheduling for asymmetric Arm systems

By Jonathan Corbet
November 30, 2020

The Arm processor architecture has pushed the boundaries in a number of ways, some of which have required significant kernel changes in response. For example, the big.LITTLE architecture placed fast (but power-hungry) and slower (but more power-efficient) CPUs in the same system-on-chip (SoC); significant scheduler changes were needed for Linux to be able to properly distribute tasks on such systems. For all their quirkiness, big.LITTLE systems still feature CPUs that are in some sense identical: they can all run any task in the system. What is the scheduler to do, though, if confronted with a system where that is no longer true?

Multiprocessor support on Linux was born in the era of symmetric multiprocessing — systems where all CPUs are, to a first approximation, identical. Any CPU can run any task with essentially the same performance; the scheduler's main concern on SMP systems is keeping all of the CPUs busy. While cache effects and NUMA locality discourage moving tasks between CPUs, the specific CPU chosen for any given task is usually a matter of indifference otherwise.

Big.LITTLE changed that assumption by bundling together CPUs with different performance characteristics; as a result, the specific CPU chosen for each task became more important. Putting tasks on the wrong CPU can result in poor performance or excessive power consumption, so it is unsurprising that a lot of work has gone into the problem of optimally distributing workloads on big.LITTLE systems. When the scheduler gets it wrong, though, performance will suffer, but things will still work.

Future Arm designs, though, include systems where some CPUs can run both 64-bit and 32-bit tasks, while others are limited to 64-bit tasks only. The advantage of such a design will be reduced chip area devoted to 32-bit support which, on many systems, may never actually be used at all; meanwhile, the ability to run the occasional 32-bit program still exists. The cost, though, is the creation of a system where some CPUs cannot run some tasks at all. The result of an incorrect scheduling choice is no longer a matter of performance; it could be catastrophic for the workload involved.

An initial attempt to address this problem was posted by Qais Yousef in October. The bulk of this work — and of the ensuing discussion — was focused on what should happen if a 32-bit task attempts to run on a 64-bit-only CPU. Yousef initially had the kernel just kill such tasks outright, but added an optional patch that would, in such cases, recalculate the task's CPU-affinity mask (a user-controllable bitmask indicating which CPUs the task can run on) to include only 32-bit-capable CPUs. If user space could be trusted to properly set the CPU affinity of 32-bit tasks, he said, that last patch would be unnecessary.

Scheduler maintainer Peter Zijlstra responded that the affinity-mask tweaking was "not going to happen"; that mask is under user-space control, and should not be changed by the kernel, he said. Will Deacon added that the kernel should not try to hide the system's asymmetry from user space: "I'd be *much* happier to let the scheduler do its thing, and if one of these 32-bit tasks ends up on a core that can't deal with it, then tough, it gets killed".

Toward the end of October, Deacon posted a patch set of his own addressing a number of problems he saw with Yousef's implementation. It removed the affinity-mask manipulation in favor of just killing tasks that attempt to run on CPUs that cannot support them. To help user space set affinity masks properly, the patch added a sysfs file indicating which CPUs can run 32-bit tasks.

By the time this patch series hit version 3 in mid-November, though, that behavior had changed. If a 32-bit task attempts to run on a 64-bit-only CPU, its affinity mask will be narrowed as with Yousef's first patch. If, however, the original affinity mask included no 32-bit-capable CPUs, this operation will zero the mask entirely, leaving the task no CPU to run on. In that case, a fallback mask will be used; its definition is architecture-specific but, on Arm (the only architecture that needs this feature currently), the fallback mask contains the set of CPUs that can run 32-bit tasks. This can have the effect of enabling the task to run on CPUs outside of its original mask.

Zijlstra questioned the move away from killing misplaced tasks: "I thought we were okay with that... User does stupid, user gets SIGKILL. What changed?" The problem, it turns out, was finding the right response when a 64-bit task calls execve() to run a 32-bit program — while running on a 64-bit-only CPU. The 64-bit code may not know that the new executable is incompatible with the current CPU, so it is hard to expect that task to set the CPU affinity properly. The new program cannot even run to call sched_setaffinity() to fix the problem, even if it was written with an awareness of such systems. In fact, by the time the problem is found, it cannot even run to have the SIGKILL signal delivered to it. Rather than try to handle all of that, Deacon decided to just override the affinity mask if need be.

The result is arguably a violation of the kernel's ABI rules, which say that the CPU-affinity mask is supposed to survive across an execve() call (and not be modified by the kernel in general). The alternative, as Marc Zyngier pointed out, "'only' results in an unreliable system". Bending the ABI rules seems preferable to unreliability, even if the other issues can be worked out.

So, most likely, some variant of this behavior will be in the patch set when it eventually makes its way upstream. Yousef endorsed Deacon's approach, saying: "My only worry is that this approach might be too elegant to deter these SoCs from proliferating". It remains to be seen how widespread this hardware will eventually be but, once it's in use, Linux should be ready for it. Stay tuned to see what the next interesting asymmetry dreamed up by CPU designers will be.

Comments (52 posted)

Challenges in protecting virtual machines from untrusted entities

December 1, 2020

This article was contributed by Kashyap Chamarthy

KVM Forum

As an ever-growing number of workloads are being moved to the cloud, CPU vendors have begun to roll out purpose-built hardware features to isolate virtual machines (VMs) from potentially hostile parties. These processor features, and their extensions, enable the notion of "secure VMs" (or "confidential VMs") — where a VM's "sensitive state" needs to be protected from untrusted entities. Drawing from his experience contributing to the secure VM implementation for the s390 architecture, Janosch Frank described the challenges involved in a talk at the 2020 (virtual) KVM Forum. Though the implementations across CPU vendors may vary, there are many shared problems, which opens up possibilities for collaboration.

Secure Encrypted Virtualization (SEV) from AMD (more information is available in the slides [PDF] from a talk at last year's KVM Forum and LWN's brief recap of it), Trust Domain Extensions (TDX) by Intel, and IBM's Secure Execution for s390 (last year's KVM Forum talk [YouTube] about it) and Power are some of the hardware technologies that aim to protect virtual machines from potential malicious entities. Other architectures, such as Arm, are expected to follow suit.

The sensitive state of a secure VM should not be accessible from the hypervisor, instead a "trusted entity" — a combination of software, CPU firmware, and hardware — manages it. But this raises a question: What counts as "sensitive state"? The lion's share comprises the guest's memory contents, which can contain disk encryption keys and other sensitive data. In addition, guest CPU registers can hold sensitive cryptographic key fragments. The execution path of the VM is another; a rogue hypervisor can potentially change the execution flow of a VM — e.g. it can inject an exception into the guest, which is highly undesirable. Therefore, effective "VM controls" that decide which instructions to execute, and how they're executed, must be protected. Furthermore, a hostile hypervisor, even if it can't extract any information from its guests, can still mount a denial-of-service (DoS) attack on them.

Then there is "data at rest" (i.e. guest data stored on disk), which is often not protected by the trusted entity; it is the VM's responsibility to protect it with common techniques such as disk encryption. Successfully protecting VMs and their data allows users to deploy sensitive workloads in public clouds.

Threat vectors

An approach to narrow down the threat vectors to defend against is by defining the nature of trust, more commonly known as "threat modeling". Co-located VMs and their host hypervisor are to be considered completely untrusted in a public cloud setup. AMD's SEV [PDF] uses the fuzzily-defined "benign but vulnerable hypervisor" model, where "the hypervisor is not believed to be 100% secure, but it is trusted to act with benign intent" — i.e. the hypervisor might not actively try to compromise SEV-enabled VMs, but it could contain exploitable vulnerabilities. For example, AMD treats only the processor hardware and firmware as fully trusted, along with the "secure VM" itself.

And what is not trusted? Cloud operators, processor firmware, SMM (System Management Mode), host OS and its hypervisor, all external PCI(e) devices, and more. "Untrusted" here means these components are assumed to be "malicious, potentially conspiring with other untrusted components in an effort to compromise the security guarantees of an SEV-SNP VM". In a similar vein, Intel's TDX [PDF] defines its "trust boundaries". Intel's hardware, including its TDX module, is trusted; so are its Authenticated Code Modules in firmware. The rest is all untrusted.

Frank outlined some common building blocks to guard against attacks from the untrusted entities and protect the sensitive state of a VM. First, encrypting the VM's memory and other bits it can modify (e.g. guest CPU registers), so that an attacker only sees gibberish without the key. Second is to restrict access to sensitive guest data via access controls. Third, "integrity verification", to make sure a VM only accesses state that is not altered by a hostile party.

Protecting memory, vCPU registers, and boot integrity

Making the guest memory unreadable, by encrypting it so it is inaccessible to every other entity, except the guest itself and the trusted entity, provides "memory confidentiality". One way to achieve this is by letting the CPU's memory controller do the heavy lifting for memory encryption — each guest gets its own key that is stored in hardware, never to leave.

Storing encryption keys in hardware also protects against "cold-boot attacks", a kind of side-channel attack that requires access to hardware to dump sensitive information stored in RAM. However, keys stored in hardware means there's a limit to the number that can be stored — e.g. the first generation AMD EYPC ("Zen") CPU had a hard-limit of 15 encryption keys, which severely limits the number of secure VMs. But a later generation CPU (AMD "Zen 2") extended that to 509 keys.

Encrypting guest memory alone won't suffice; it also needs to be tamper-proof — because despite encryption, the hypervisor can still corrupt the guest RAM. Tampering with guest RAM can be prevented by means of architecture-specific hardware access controls. Reads and writes originating from outside the secure VM will result in an exception; this protects the integrity of the memory, assuming it never leaves the protected state. Further, this also allows "rogue accesses" to be traced and logged, Frank noted.

Guest CPU registers need to be unreadable by external parties. The hypervisor should only be able to read from or write to specific registers, in cases where it needs to emulate a CPU instruction. Therefore, the trusted entity must encrypt all or specific guest CPU registers. The state of a VM, both while it is being initialized to run and while it is running, is stored in a vendor-specific data structure known as the VM "control block". The guest CPU registers are isolated by providing a dummy VM control block to the hypervisor, while the trusted entity manages the real control block.

Yet another challenge is that only the user-approved executables should be allowed to be booted by the guest. There are two ways to handle this problem: one is boot-data encryption — here the executable (e.g. a guest kernel) is encrypted, and gets a header that holds key slots for each physical machine the executable is allowed to run on; plus some integrity-verification data. Then, the processor's firmware (a trusted entity) searches for a key slot it can unlock, to retrieve the executable's encryption key from it. The other is "remote attestation" — an idea that has existed for ages, but is fiendishly difficult to implement and manage — which allows a virtual machine to authenticate itself to a remote party, the owner of the VM, by proving that it is indeed running approved executables. This proof gives confidence to the owner that the guest is executing authorized workloads on genuine and authenticated hardware.

Attestation allows for quick changes of authorization rules for VMs and other entities; whereas boot-data encryption is easier to implement and doesn't require network connectivity. But relying only on boot-data encryption has its problems: authorizing a new machine involves building the executable afresh with a new key slot and updating the executable may require user intervention. There is also the perennial problem of distributing and verifying public/private key pairs.

Often, combining all of these techniques yields the best results, Frank emphasized.

What about I/O and swap?

Once a secure VM is up and running, it might want to do I/O and swap. A device trying to perform I/O on an encrypted guest memory page will only see garbled data or get access exceptions. There needs to be a way to "unprotect" some special I/O pages. This ought to be done based on an explicit request from the guest. The guest I/O is bounced off the shared pages, therefore sensitive data must be encrypted — by using common mechanisms such as SSH, HTTPS, LUKS disk encryption, and so on — before the guest does any I/O with it. However, special handling for I/O means degraded performance, but Frank was confident that the degradation can be significantly reduced in the future.

Getting swap to work is another challenge. During swap-out, where a memory page is pushed to a storage device, it needs to be made accessible in encrypted form to the host, so that it can be written to the device. On swap-in, where the page is pulled back to the main memory, it should get integrity-checked and decrypted before the guest can access the page again. There also needs to be protection against "replay" attacks, where an attacker can replace a VM's memory with a stale copy. The hypervisor mediates the swap-out and swap-in process with help from the trusted entity, which safeguards the entire operation.

But as is the case with special I/O handling, swapping incurs significant performance penalty; so secure VMs should be avoided in environments where memory is over-committed.

Current development efforts

An approach proposed to "generalize memory encryption models" proved quite difficult to find common ground around, as CPU architectures differ a tad too much in their implementation of secure VMs. Intel is still in the process of adding support for secure VMs (see also: Sean Christopherson's slides from a KVM Forum talk on Intel TDX [PDF]). Frank noted that s390 already hooks into the Linux memory-management subsystem to pin I/O pages; other platforms will need similar hooks. However, he expects the resulting mainline kernel inclusion to take more time as it requires common code changes. IBM's Secure Extension is available since Linux 5.4 for its Power architecture and in Linux 5.7 for s390.

Support for AMD's SEV was introduced in Linux 4.16 for KVM-based guests; Linux 5.10 has support for AMD's extensions Encrypted State and support for Secure Nested Paging is underway. Further, there is an in-progress effort to add support for SEV's "address space IDs" (ASIDs) to control groups in Linux. The problem area involves the earlier-mentioned limit to maximum possible hardware encryption keys. Developers for other CPU architectures quickly expressed interest in finding a common approach to tackle this problem.

A pressing concern that Frank pointed out is the complexity of testing: setting up boot-data encryption and configuring attestation environments are both cumbersome. There will be more compile-time options for the kernel, new KVM ioctl() calls, new interfaces to the trusted entity, and changes to user-space components (e.g. QEMU, libvirt). All of this increases the testing burden.

Future

Live migration is expected to be tackled down the road. During migration, the entirety of the VM's state needs to be encrypted and its integrity verified on the destination. Frank anticipates backward compatibility to be a major challenge — migration logic that is traditionally handled by the hypervisor now partially moves to the trusted entity. Further, migration policies need to determine the possible target hosts a secure VM can be migrated to, which involves many variables. There is plenty of work for the coming years. Additionally, we might see "secure I/O devices" which only respond to I/O requests from an authorized secure VM and don't speak to the host at all.

Another potentially tricky topic is the ability to capture memory contents in the event of a guest kernel crash. With kdump, it is possible to capture a crash dump to an encrypted disk, but it only works if the code to do so can still be executed inside the guest. It needs to be possible to boot the "capture kernel" that kdump loads (via the kexec subsystem) to actually write the memory contents to disk.

Not least of all, Frank noted that further protections against side-channel attacks such as disabling simultaneous multithreading (SMT) and extra cache flushing need to be enforced by the trusted entity.

Overall, the basic blocks that underpin secure VMs are common across various CPU architectures. One concrete area where vendors can work together is on boot-related tooling. Both remote attestation and encrypting boot executables require extensive tooling. If CPU vendors can manage to not overly diverge in this area, they can work together on common tooling, instead of everyone maintaining their own bespoke tooling — e.g. sev-tool from AMD, genprotimg for s390, and so forth.

Enarx is a relatively recent project that wants to "make it simple to deploy [sensitive] workloads to a variety of different Trusted Execution Environments" (TEEs). It is CPU-architecture independent and thus aims to create an abstraction for the different TEEs from processor vendors. More specifically, Enarx provides encryption for "data in use" (as opposed to data at rest or in transit) and manages attestation — all without having to start from scratch for every hardware platform.

"The importance and complexity of secure VMs will continue to increase with each [hardware] extension that is released for an architecture. Fortunately, we still have time to come together and discuss the collaboration possibilities," Frank concluded. Since several CPU architectures have introduced the idea and developers are working on the implementation of secure VMs, it is the perfect time for all of the vendors to work together.

[I'd like to thank Janosch Frank for a careful review of this article.]

Comments (5 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Let's Encrypt certs; Hardware security; Heap quarantine; Guix 1.2; Perl governance; PHP 8; Rust 1.48; Quotes; ...
Announcements: Newsletters; conferences; security updates; kernel patches; ...

Next page: Brief items>>