LWN.net Weekly Edition for October 18, 2018

Welcome to the LWN.net Weekly Edition for October 18, 2018

This edition contains the following feature content:

A farewell to email: email has many failings; projects are considering moving to other communications mechanisms in response.
A new direction for i965: from XDC: a new and faster i965 graphics driver.
Fighting Spectre with cache flushes: a novel technique for blocking potential Spectre exploits.
OpenPGP signature spoofing using HTML: some email clients make it easy to spoof PGP signatures on incoming mail.
I/O scheduling for single-queue devices: which is the right I/O scheduler to use by default on slower devices?
Secure key handling using the TPM: a way to manage a large number of keys using the trusted platform module.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

A farewell to email

By Jonathan Corbet
October 16, 2018

The free-software community was built on email, a distributed technology that allows people worldwide to communicate regardless of their particular software environment. While email remains at the core of many projects' workflow, others are increasingly trying to move away from it. A couple of recent examples show what is driving this move and where it may be headed.

Email is certainly not without its problems. For many of us, plowing through the daily email stream is an ongoing chore. Development lists like linux-kernel can easily exceed 1,000 messages per day; it is thus unsurprising that the number of kernel developers who actually follow such lists has been dropping over time. Email is based on a trust model from a simpler time; now it is overwhelmed by spam, trolls, sock puppets, and more. Dealing with the spam problem alone is a constant headache for mailing-list administrators. Interacting productively via email requires acquiring a set of habits and disciplines that many find counterintuitive and tiresome. Your editor's offspring see email as something to use to communicate with their grandparents, and not much more.

It is thus not surprising that some projects are thinking about alternative ways of communicating. Even projects like the kernel, which remains resolutely tied to email, are seeing some experimentation around the edges. Some, though, are diving in more seriously, with a couple of recent experiments being found in the Fedora and Python projects.

On the Fedora side, project leader Matthew Miller recently proposed moving the council-discuss list to Fedora's Discourse instance. The idea was proposed as a sort of trial, with the hope that it can "help increase engagement' in council discussions. Sanja Bonic added some reasons for why one might expect that to happen:

It's important to look at the user base of who is preferring mailing lists and how it holds up for new users. Mailing lists are the preferred way for most people who've used them for a long time, but new users (newcomers to open source, younger people) definitely prefer to write on a forum and for that experience, it's better if it's a more modern one with relevant features that enhance user experience.

She also added that mailing lists provide no control over where messages are kept and are impossible to delete material from — attributes that others will, instead, see as being an advantage of email.

Meanwhile, a significant part of the Python community was surprised at the end of September when Łukasz Langa posted a missive titled python-committers is dead, long live discuss.python.org. It stated that "we have reached the limits of what is possible with mailing lists" and fingered email as contributing to some of the project's community-management problems. He pitched the advantages of Python's Discourse instance, including the mobile application, community moderation, the ability to move discussions between categories, rich media support, code syntax highlighting, dynamic notifications, social login support, and "emojis". The email asked all members of the python-committers list to cease using it immediately and start talking on Discourse instead.

This move was not a welcome surprise to everybody involved; some thought that the timing — while the project is trying to conclude a fundamental discussion on its leadership model — was not ideal, and that the new system was being imposed on the community without discussion. The result is that the conversation split, with some posters moving over to Discourse and others remaining on the list. Here, too, there has been discussion of the advantages and disadvantages of each mode of communication.

Proponents of email value the ability to choose their own tools and workflows; many of us who deal with large amounts of email have come up with highly optimized setups that allow this work to be done efficiently. Email can often be processed with scripts, and the ability to maintain local archives can be highly useful. Mailing lists sometimes offer other mechanisms, such as NNTP access, that facilitate quick reading. Many people also appreciate the simple fact that email comes to the reader rather than having to be explicitly looked for on a discussion site; as Máirín Duffy commented in the Fedora discussion:

The age of having people manually poll web-based systems is the past. The main methods I can think of to maintain that is dopamine hits or charging money so they need to connect regularly to get their money's worth. We don't have the creepiness factor to do dopamine hits, and we're not going to charge money.

For that and other reasons, she worried that a switch to a system like Discourse could make bringing in new contributors harder rather than easier.

Proponents of a switch note (though not in so many words) that the indicated dopamine hits have been thoughtfully provided: there are various email notification mechanisms for new topics and such, and users by default get handy notifications every time somebody "likes" one of their posts. The mobile app makes engagement from handsets easier. There is a graduated trust model that allows proven community members to take on some of the moderation task, taking the load off of list administrators and community managers; Python developer Brett Cannon looks forward to having such features available:

But these past three weeks have been hell for me. I now dread checking my email because of what's going to be there. And the fact that fighting these CoC [code of conduct] fires on multiple mailing lists with the tools they provide have not helped to improve my situation. The difficulty in locking down threads, the fact that there is no shared burden on each of these mailing lists because they are each viewed as independent entities when it comes to administration, and the barrier for people to feel comfortable in sending an email notifying admins when they feel a post has crossed a line has shown me that we have a problem.

He concluded by suggesting that the project has "simply gotten too big for email".

In both the Fedora and Python cases, the move to Discourse has been put forward as an experiment that may or may not be rolled back, but going back can be hard. Neal Gompa suggested that OpenMandriva's shift to Discourse was "largely a failure" when it comes to developer discussions (it worked better for user support), but the project still has not gone back to using a mailing list for those discussions. Moving a project to a new communication medium is hard; declaring defeat and moving back can be even harder, especially since people will have become attached to the numerous aspects of the new system that do actually work better.

Regardless of how these specific experiments work out, one conclusion is clear. Even the people who are most tied to email are finding it increasingly unworkable in the world we have grown into. Administering an email installation while blocking spam and ensuring that one's emails actually get delivered is getting harder; fewer people and organizations want to take it on. As a result, our formerly decentralized email system is becoming more centralized at a tiny number of email providers. If "email" means "Gmail", it starts to lose its advantage over other centralized sites.

As others have often said, what we need is a modern replacement for email — some sort of decentralized solution that preserves the advantages of email while being suitable to the 2018 Internet and attractive to our children. Various projects have been working in this area for years, and some, like Mastodon, appear to be gaining some traction. But none have made any real progress toward replacing email as a large-scale collaboration mechanism.

Your editor's free-software experience began by installing software and applying patches received via email over Usenet. In the far-too-many intervening years, some things have changed significantly, but the primacy of email for development discussions remains unchallenged for many projects. But the writing is on the wall; many mailing lists have already gone dark as patch discussions have moved to centralized hosting systems, and other types of conversation are starting to shift to systems like Discourse. These systems may not be everything that we could hope for, and they are likely to significantly slow down work for many of us. But they also represent important experiments that will, with luck, lead to better communications in the future. Email is not dead, but neither is FTP; soon we may look at them both in about the same way.

Comments (112 posted)

A new direction for i965

By Jake Edge
October 17, 2018

X.Org Developers Conference

Graphical applications are always pushing the limits of what the hardware can do and recent developments in the graphics world have caused Intel to rethink its 3D graphics driver. In particular, the lower CPU overhead that the Vulkan driver on Intel hardware can provide is becoming more attractive for OpenGL as well. At the 2018 X.Org Developers Conference Kenneth Graunke talked about an experimental re-architecting of the i965 driver using Gallium3D—a development that came as something of a surprise to many, including him.

Graunke has been working on the Mesa project for eight years or so; most of that time, he has focused on the Intel 3D drivers. There are some "exciting changes" in the Intel world that he wanted to present to the attendees, he said.

CPU overhead has become more of a problem over the last few years. Any time that the driver spends doing its work is time that is taken away from the application. There has been a lot of Vulkan adoption, with its lower CPU overhead, but there are still lots of OpenGL applications out there. So he wondered if the CPU overhead for OpenGL could be reduced.

Another motivation is virtual reality (VR). Presenting VR content is a race against time, so there is no time to waste on driver overhead. In addition, Intel has integrated graphics, where the CPU and GPU share the same power envelope; if the CPU needs more power, the GPU cannot be clocked as high as it could be. Using less CPU leads to more watts available for GPU processing.

For the Intel drivers, profilers show that "draw-time has always been [...] the volcanically hot path" and, in particular, state upload (sending the state of the OpenGL context to the GPU) is the major component of that. There are three different approaches to handling state upload in an OpenGL driver that he wanted to compare, he said. OpenGL is often seen as a "mutable state machine"; it has a context that has a "million different settings that you can tweak". He likens it to an audio mixing board, which has lots of different knobs that each do something different. At its heart, OpenGL programs are setting these knobs, drawing, then setting them and drawing again—over and over.

Handling state

The first way to handle state tracking is with "state streaming": translate the knobs that have been changed and send those down to the GPU. This assumes that it is not worth reusing any state from previous draws; since state is changing all the time and every draw call could be completely new, these drivers just translate as little as possible of the state changes, as efficiently as possible, before sending them to the GPU.

Early on, he asked his mentor about why state was not being reused and was told that cache lookups are too expensive. Essentially the context and state have to be turned into some kind of hash key that gets looked up in the cache. If there is a miss, that state needs to be recalculated and sent to the GPU, so you may as well just have done the translation. This is what i965 does "and it works OK", but it does make draw-time dominate, which leads to efforts to shave microseconds off the draw-time.

But he started thinking about this idea of "not reusing state" some. He put up an image from a game he has been playing lately and noted that it is drawn 60 times per second; the scene is fairly static. So the same objects get drawn many, many times. It is not all static, of course, so maybe a character walks into the scene in shiny armor and you need to figure out how to display that in the first frame, but the next 59 frames can reuse that. "This 'I've never seen your state before' idea is kinda bunk", he said.

The second mechanism is the one that Vulkan uses, which is to have "pre-baked pipelines". The idea is to create pipeline objects for each kind of object displayed in the scene. That makes draw-time "dirt cheap" because you just bind a pipeline and draw, over and over again. If the applications are set up to do this, "it is wonderful", but OpenGL applications are not, so it is not really an option.

The third option, surprisingly, is Gallium3D (or simply "Gallium"), he said. He has been learning that it is basically a hybrid approach between state streaming and pre-baked pipelines. It uses "constant state objects" (CSOs), which are immutable objects that capture a part of the GPU state and can be cached across multiple draw operations. CSOs are essentially a Vulkan pipeline that has been chopped up into pieces that can be mixed and matched as needed. Things like the blending state, rasterization mode, viewport, and shader would each have their own CSO. The driver can associate the actual GPU commands needed to achieve that state with the CSO.

The Gallium state tracker essentially converts the OpenGL mutable API into the immutable CSOs that make up the Gallium world. That means the driver really only has to deal with CSOs and Gallium handles the messy OpenGL context.. The state tracker looks at the OpenGL context, tracks what's dirty, and ideally rediscovers cached CSOs for the new state. The state tracker helps "get rid of a bunch of nonsense from the API", Graunke said. For example, handling the different coordinate systems between the driver, GPU, and the window system is much simplified using Gallium. That simplifying can be done before the cache lookup occurs, which may mean more cache hits for CSOs.

A consistent critique of Gallium is that it adds an extra layer into the driver path. While that's true, there is a lot of work done in a Classic (or state-streaming) driver to convert the context into GPU commands. There is far less work needed to set things up to be looked up in the cache for a Gallium driver, but if there is a cache miss, there will be additional work needed to create (and cache) the new CSO. But even if the sum of those two steps is larger for Gallium, and it generally is, the second step is skipped much of the time, which means that a Gallium-based driver may well be more efficient.

i965

The i965 driver is one of the only non-Gallium Mesa drivers. Graunke said that the developers knew it could be better than it is in terms of its CPU usage. The code itself is pretty efficient, but the state tracking is too coarse-grained, which means that this efficient code executes too often. Most of the workloads they see are GPU bound, so they spent a lot of time improving the compiler, adding color compression, improving cache utilization, and the like to make the GPU processing more efficient.

But, CPU usage could be improved and "people loved to point that out", he said. It was a source of criticism from various Intel teams internally, but there was also Twitter shaming from Vulkan fans. The last straw for him was data showing that the i965 driver was "obliterated" by the performance of the AMD RadeonSI driver on a microbenchmark. That started him on the path toward seeing what could be done to fix the CPU side of the equation.

A worst case for i965 is when an application binds a new texture or modifies one. The i965 driver does not have good tracking for the texture state, so it has to do a bunch of retranslations for each other texture and image that are bound to that texture in any shader stage. It is a lot of stuff to do for a relatively simple operation. Reusing some state would help a lot, but it is hard to do for surprising reasons.

Back in the "bad old days of Intel hardware", there was one virtual GPU address space for all processes. The driver told the kernel about its buffers and the kernel allocated addresses for them. But the commands needed to refer to those buffers using pointers when the addresses had not yet been assigned, so the driver gave the kernel a list of pointers in those buffers that needed to be patched up when the buffer was assigned to an address. Intel GPUs save the last known GPU state in a hardware context that could potentially be reused, but it includes pointers to unpatched addresses, so no state that involves pointers can be saved and reused. The texture state has pointers, which leads to the worst case he described.

Modern hardware does not suffer from these constraints. More recent Intel GPUs have 256TB of virtual GPU address space per process. The "softpin" kernel feature available since Linux 4.5 allows user space to assign the virtual addresses and never have to worry about them being changed. That allows the state to be pre-baked or inherited even if it has pointers. Other changes also make state reuse much easier.

So that led to the obvious conclusion that state reuse needed to be added to the i965 driver. But in order to do that, an architectural overhaul was needed. It required a fundamental reworking of the state upload code. Trying to do that in the production driver was "kinda miserable". Adding these kinds of changes incrementally is difficult. Enterprise kernels also complicate things, since there are customers using new hardware on way-old kernels—features like softpin are not available on those. Beyond that, the driver needs to support old hardware that still has all of the memory constraints, so he would need to support both old and new memory management and state management in the same driver.

Enter Gallium

He finally realized that Gallium gave him "a way out of all of these problems". In retrospect that all may seem kind of obvious, so why didn't Intel do this switch earlier? In looking back, though, Gallium never seemed to be the answer to the problems the developers were facing along the way. It didn't magically get them from OpenGL 2.1 to 4.5; that required lots of feature work. The shader compiler story was lacking or not really viable (because it was based on TGSI). It also didn't solve their driver performance problems or enable new hardware. Switching to Gallium would have allowed the driver to support multiple APIs, but that was not really of interest at the time. All in all, it looked like a huge pile of work for no real gain. But things have changed, Graunke said.

Gallium has gotten a lot better due to lots of work by the Mesa community. Threading support has been added. NIR is a viable option instead of TGSI. In addition, the i965 driver has gotten more modular due to Vulkan. It seemed possible to switch to Gallium, but it was still a huge effort, and he wanted to be sure that it would actually pay off to do so.

That led to his big experiment. In November 2017, he started over from scratch using the noop driver as a template. He borrowed ideas from the Vulkan driver and focused on just the latest hardware and kernel versions. He wanted to be free to experiment so he did not publicize his work, though he did put it in his Git repository in January. If it did not work out, he wanted to be able to scrap it without any fanfare.

After ten months, he is ready to declare the experiment a success. The Iris driver is available in his repository (this list of commits is what he pointed to). It is primarily of interest to driver developers at this point and is not ready for regular users. One interesting note is that Iris uses no TGSI; most drivers use some TGSI, rather than translating everything to NIR, but Iris did not take that approach. It only supports Skylake (Gen9+) hardware and kernel 4.16+, though he could make it work for any 4.5+ kernel if needed.

The driver passes 87% of the piglit OpenGL tests. It can run some applications, but others run into bugs. There are several missing features that are being worked on or at least thought about at this point. But enough is there that he can finally get an answer to whether switching to Gallium makes sense; the performance can be measured and the numbers will be in the right ballpark to allow conclusions to be drawn.

Results

He put up performance numbers on the draw overhead using piglit for five different scenarios (which can be seen in his slides [PDF] and there is more description of the benchmark and results in the YouTube video of the talk). It went from roughly two million draw calls per second to over nine million in the simplest case; in the worst case (from the perspective of the i965 driver) it went from half a million to almost nine million. On average, Iris can do 5.45x more draw calls per second than i965. Those are good numbers, but using threaded contexts results in even better numbers (6.48x for the simple case and 20.8x for the worst), though he cautioned that support for threaded contexts in Iris is still not stable, so those numbers should be taken with a grain of salt.

The question "is Iris worthwhile?" can be answered with a "yes". But reducing the CPU overhead when most workloads are GPU bound may not truly reflect reality. The microbenchmark he used is "kind of the ideal case for a Gallium driver" since it uses back-to-back draws that are hitting the CSO cache. That said, going back to his observations about displaying a game, it may well be representative for the 59 of 60 frames per second where little changes. There is a need to measure real programs; one demo he ran on a low-power Atom-based system was 19% faster, but a bunch of others showed no difference with i965 at all. "So, your mileage may vary, but it's still pretty exciting", he said.

Graunke believes that this work has settled the Classic versus Gallium driver debate in favor of the latter. Also, the Gallium interface is much nicer to work with than the Classic interface. He does not regret the path Intel took, but he is excited about the future; Iris is a much better architecture for that future. In addition, he believes that RadeonSI, and now Iris, have basically debunked the myth that Mesa itself is slow. i965 may be slow, but that is not really an indictment of Mesa.

There is a lot of work left to do and lots of bugs to fix. He needs to finish getting the piglit running as well as doing so for the conformance test suites (CTS) for OpenGL. He has started running the CTS, which is looking good so far. He still needs to test lots of applications and there is work to be done cleaning up some of his Gallium hacks before the driver can go upstream. Beyond that, he wants to look at Iris performance on real applications and compare it to i965 to see if there are places where Iris can be made even faster. He would like to use FrameRetrace on application data as part of that process.

Now that Intel has joined the rest of the community in using Gallium, that is probably a good opportunity for the whole community to think about where Mesa should go from here. All of the drivers will be Gallium-based moving forward, so the community can collaborate with a focus on further Gallium enhancements. Gallium is not the ideal infrastructure (nor is Classic), but by dreaming about what could come in the future, the Mesa community can evolve it into something awesome, he said.

In the Q&A, he was asked about moving more toward a Vulkan-style driver. Graunke noted that there are several efforts to implement OpenGL on Vulkan and that he is interested to see where they go. There is something of an impedance mismatch between the baked-in pipelines of Vulkan and the wildly mutable OpenGL world and it is not clear to him whether that can be resolved reasonably. For Iris, he chose the route that RadeonSI had taken and proven, but if the Vulkan efforts pan out, that could be something to look at down the road.

[I would like to thank the X.Org Foundation and LWN's travel sponsor, the Linux Foundation, for travel assistance to A Coruña for XDC.]

Comments (8 posted)

Fighting Spectre with cache flushes

By Jonathan Corbet
October 15, 2018

One of the more difficult aspects of the Spectre hardware vulnerability is finding all of the locations in the code that might be exploitable. There are many locations that look vulnerable that aren't, and others that are exploitable without being obvious. It has long been clear that finding all of the exploitable spots is a long-term task, and keeping new ones from being introduced will not be easy. But there may be a simple technique that can block a large subset of the possible exploits with a minimal cost.

Speculative-execution vulnerabilities are only exploitable if they leave a sign somewhere else in the system. As a general rule, that "somewhere else" is the CPU's memory cache. Speculative execution can be used to load data into the cache (or not) depending on the value of the data the attacker is trying to exfiltrate; timing attacks can then be employed to query the state of the cache and complete the attack. This side channel is a necessary part of any speculative-execution exploit.

It has thus been clear from the beginning that one way of blocking these attacks is to flush the memory caches at well-chosen times, clearing out the exfiltrated information before the attacker can get to it. That is, unfortunately, an expensive thing to do. Flushing the cache after every system call would likely block a wide range of speculative attacks, but it would also slow the system to the point that users would be looking for ways to turn the mechanism off. Security is all-important — except when you have to get some work done.

Kristen Carlson Accardi recently posted a patch that is based on an interesting observation. Attacks using speculative execution involve convincing the processor to speculate down a path that non-speculative execution will not follow. For example, a kernel function may contain a bounds check that will prevent the code from accessing beyond the end of an array, causing an error to be returned instead. An attack using the Spectre vulnerability will bypass that check speculatively, accessing data that the code was specifically (and correctly) written not to access.

In other words, the attack is doing something speculatively that, when the speculation is unwound, results in an error return to the calling program — but, by then, the damage is done. The error return is a clue that there maybe something inappropriate going on. So Accardi's patch will, in the case of certain error returns from system calls, flush the L1 processor cache before returning to user space. In particular, the core of the change looks like this:

    __visible inline void l1_cache_flush(struct pt_regs *regs)
    {
	if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) &&
	    static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
	    if (regs->ax == 0 || regs->ax == -EAGAIN ||
		regs->ax == -EEXIST || regs->ax == -ENOENT ||
		regs->ax == -EXDEV || regs->ax == -ETIMEDOUT ||
		regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS)
			return;

	    wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
	}
    }

The code exempts some of the most common errors from the cache-flush policy, which makes sense. Errors like EAGAIN and ENOENT are common in normal program execution but are not the sort of errors that are likely to be generated by speculative attacks; one would expect an error like EINVAL in such cases. So exempting those errors should significantly reduce the cost of this mitigation without significantly reducing the protection that it provides.

(Of course, the code as written above doesn't quite work right, as was pointed out by Thomas Gleixner, but the fix is easy and the posted patch shows the desired result.)

Alan Cox argued for this patch, saying:

The current process of trying to find them all with smatch and the like is a game of whack-a-mole that will go on for a long long time. In the meantime (and until the tools get better) it's nice to have an option that takes a totally non-hot path (the fast path change is a single test for >= 0) and provides additional defence in depth.

Andy Lutomirski is not convinced, though. He argued that there are a number of possible ways around this protection. An attacker running on a hyperthreaded sibling could attempt to get the data out of the L1 cache between the speculative exploit and the cache flush, though Cox said that the time window available would be difficult to hit. Fancier techniques, such as loading the cache lines of interest onto a different CPU and watching to see when they are "stolen" by the CPU running the attack could be attempted. Or perhaps the data of interest is still in the L2 cache and could be probed for there. In the end, he said:

Adding ~1600 cycles plus the slowdown due to the fact that the cache got flushed to a code path that we hope isn't hot to mitigate one particular means of exploiting potential bugs seems a bit dubious to me.

Answering Lutomirski's criticisms is probably necessary to get this patch set merged. Doing so would require providing some numbers for what the overhead of this change really is; Cox claimed that it is "pretty much zero" but no hard numbers have been posted. The other useful piece would be to show some current exploits that would be blocked by this change. If that information can be provided, though (and the bug in the patch fixed), flushing the L1 cache could yet prove to be a relatively cheap and effective way to block Spectre exploits that have not yet been fixed by more direct means. As a way of hardening the system overall, it seems worthy of consideration.

Comments (23 posted)

OpenPGP signature spoofing using HTML

October 11, 2018

This article was contributed by Hanno Böck

Beyond just encrypting messages, and thus providing secrecy, the OpenPGP standard also enables digitally signing messages to authenticate the sender. Email applications and plugins usually verify these signatures automatically and will show whether an email contains a valid signature. However, with a surprisingly simple attack, it's often possible to fool users by faking — or spoofing — the indication of a valid signature using HTML email.

For example, until version 2.0.7, the Enigmail plugin for Mozilla Thunderbird displayed a correct and fully trusted signature as a green bar above the actual mail content. The problem: when HTML mails are enabled this part of the user interface can be fully controlled by the mail sender. Below is an image of real signed message in Enigmail:

The signature below has been faked:

Thus an attacker can simply fake a signature by crafting an HTML mail that will display the green bar. This can, for example, be done by using a screen shot of a valid signature and embedding it as an image. The attack isn't perfect: in addition to the green bar, Enigmail shows an icon of a sealed letter in the header part of the mail. That part can't be faked. However the green bar is the much more visible indicator. It's plausible to assume that many users would fall for such a fake.

After learning about this attack, the Enigmail developers changed the behavior in version 2.0.8, which was released in August. Enigmail now displays the security indicator above the mail headers and thus outside of an attacker's control.

Perfect fakes

An even better fake can be achieved in the KDE application KMail. Signed mail in KMail is framed with a green bar above and below the mail. This can easily be recreated with an HTML table. If the message is viewed in a separate window, the fake is indistinguishable from a proper signature. Only a look into the message overview can uncover the fake: an icon indicates a signed mail.

Similar attacks are possible in other mail clients, too. In GNOME's Evolution, a green bar is displayed below the mail, but a faked signature has some minor display issues due to a border drawn around the mail content. A plugin for the web-mail client, Roundcube, is similarly vulnerable. In many cases, signatures using S/MIME are also vulnerable.

The examples mentioned all rely on HTML mail to fake signature indicators. HTML mail is often seen as a security risk, which was particularly highlighted by the EFAIL attack that was published by a research team earlier this year. EFAIL relied largely on HTML mail as a way to exfiltrate decrypted message content. Security-conscious users often disable the display of HTML mail, but all of the mail clients mentioned have the automatic display of HTML messages enabled by default.

In the GPGTools plugin for Apple Mail, this kind of attack was not possible using the same technique. The indication of a valid signature is displayed as a pseudo-header. A correctly signed mail contains a header indicating "Security: Signed" below the "To:" field. Since the mail headers are clearly separate from the mail content it's not possible to use HTML mail to achieve anything here. However, it turns out that it's possible to inject additional lines into the "Subject:" header that look like an additional mail header. This can be achieved in two ways: either by encoding a newline character in the subject line or by sending multiple subject headers within one mail.

Despite being able to inject additional fake mail headers, the attacker can't inject headers below the "To:" line. Thus, the order of the headers is not correct in a fake created with this trick. The pseudo-header also contains an icon: a check mark in a circle. While some similar characters exist in the Unicode standard, this exact icon cannot be replicated by an attacker.

Despite the shortcomings of this attack, the developers of GPGTools released an update that avoids this attack vector. Apple has not yet commented on the issue that, at its core, is a bug in Apple Mail, which should not allow multi-line subjects.

Text fakes

The text-mode client Mutt naturally isn't vulnerable to HTML-based attacks. Yet the most prominent indication of a signed message in Mutt is simply the output of the GPG verification command within the mail. One can easily send this output as part of the mail body. However, at the bottom of the screen, there is a message confirming the signature that cannot be faked. Here we see a real signature verification in Mutt:

And here is an example of a faked signature verification:

When asked about this in email, a Mutt developer confirmed that the developers are aware of the issue and that several mitigations exist. Notably, enabling colors for signed messages makes this attack almost useless, as an attacker can't change colors of a mail in Mutt's standard settings.

It's worth mentioning that all these attacks require the attacker to know details about the victim's system, in particular, the mail client they use. However this information often isn't secret. Many mail clients routinely include the "X-Mailer:" header that specifies which mail client and version was used.

User-interface attacks

The attacks on signatures are a case of user-interface (UI) attacks. These don't just affect mail clients, they can happen anywhere that an attacker can potentially influence the look of some UI element.

In the past it was possible to fake a browser's URL-entry bar in a popup. This is prevented in modern browsers; all browser windows — including popups — always contain a real URL bar. Yet similar attacks remain possible on mobile browsers [PDF]. Other attacks of that kind are still possible. One is the picture-in-picture attack, where a web page may simply draw a new window that looks just like a whole browser within the page. Such a trick was recently used in an attempt to phish Steam user accounts.

UI attacks are nothing new, but they haven't been properly investigated in the email space until recently. The first defense is obvious: the security indicators need to be moved out of the attacker-controlled space. For users, disabling HTML mail display and promptly applying security updates to mail-handling programs — good security practices in any case — are the best ways to avoid these fakes.

[I presented this attack in a lightning talk at the SEC-T conference [YouTube] and was interviewed afterward [YouTube] for a podcast.]

Comments (11 posted)

I/O scheduling for single-queue devices

By Jonathan Corbet
October 12, 2018

Block I/O performance can be one of the determining factors for the performance of a system as a whole, especially on systems with slower drives. The need to optimize I/O patterns has led to the development of a long series of I/O schedulers over the years; one of the most recent of those is BFQ, which was merged during the 4.12 development cycle. BFQ incorporates an impressive set of heuristics designed to improve interactive performance, but it has, thus far, seen relatively little uptake in deployed systems. An attempt to make BFQ the default I/O scheduler for some types of storage devices has raised some interesting questions, though, on how such decisions should be made.

A bit of review for those who haven't been following the block layer closely may be in order. There are two generations of the internal API used between the block layer and the underlying device drivers, which we can call "legacy" and "multiqueue". Unsurprisingly, the legacy API is older, while the multiqueue API was first merged in 3.13. The conversion of block drivers to the multiqueue API has been ongoing since then, with the SCSI subsystem only switching over, after a false start, in the upcoming 4.19 release. Most of the remaining holdout legacy drivers will be converted to multiqueue in the near future, at which point the legacy API can be expected to go away.

Several I/O schedulers exist for the legacy interface but, in practice, only two are in common use: cfq for slower drives and none for fast, solid-state devices. The multiqueue interface was aimed at fast devices from the outset; it was not able to support an I/O scheduler at all initially. That capability was added later, along with the mq-deadline scheduler, which was essentially a forward port of one of the simpler legacy schedulers (deadline). BFQ, which came later, is also a multiqueue-API scheduler.

In early October Linus Walleij posted a patch making BFQ the default I/O scheduler for single-queue devices driven by way of the multiqueue API. The idea of a single-queue multiqueue device may seem a bit contradictory at a first encounter, but one should remember that "multiqueue" refers to the API which, unlike the legacy API, is capable of handling block devices that implement more than one queue in hardware (but does not require multiple queues). As more drivers move to this API, more single-queue devices will be driven using it. In this particular case, Walleij is concerned with SD cards and similar devices, the kind often found in mobile systems. The expectation is that devices with a single hardware queue can be expected to be relatively slow, and that BFQ will extract better performance from those devices.

The initial response from block subsystem maintainer Jens Axboe was not entirely positive: "I think this should just be done with udev rules, and I'd prefer if the distros would lead the way on this". This approach is not inconsistent with how the kernel tries to do things in general, leaving policy decisions to user space. But, of course, current kernels, by selecting mq-deadline for such devices, are already implementing a specific policy.

There were a few objections made to Axboe's position. Paolo Valente, the creator of BFQ, asserted that almost nobody understands I/O schedulers or how to choose one, so almost everybody will stick with whatever default the system gives them. And mq-deadline, as a default, is far worse than BFQ for such devices, he said. Walleij added that there are quite a few systems out there that do not use udev at all, so a rule-based approach will not work for them. On embedded systems where initramfs is not in use, it's currently not possible to mount the root filesystem using BFQ at all. As an additional practical difficulty, the number of hardware queues provided by a device is currently not made available to udev, so it could not effect this particular policy choice in any case (though that would be relatively straightforward to fix).

Oleksandr Natalenko was not impressed by the embedded-systems argument; he said that the people building such systems know which I/O scheduler they should use and can build their systems accordingly. Mark Brown took issue with that view of things, though:

That's not an entirely realistic assessment of a lot of practical embedded development - while people *can* go in and tweak things to their heart's content and some will have the time to do that there's a lot of small teams pulling together entire systems who rely fairly heavily on defaults, focusing most of their effort on the bits of code they directly wrote.

Walleij echoed that view, and added that there have been many times in kernel history where the decision was made to try to do the right thing automatically when possible, without requiring intervention from user space.

Bart Van Assche, instead, questioned the superiority of the BFQ scheduler. He initially claimed that it would slow down kernel builds (a sure way to prevent your code from being merged), but Valente challenged that assessment. Van Assche's other concern, though, had to do with fast solid-state SATA drives. Once SCSI switches over to the multiqueue API, those drives will show up with a single queue, and will thus be affected by this change. He questioned whether BFQ could be as fast as mq-deadline in that situation, but did not present any test results.

One other potential problem, as pointed out by Damien Le Moal, is shingled magnetic recording (SMR) disks, which often require that write operations arrive in a specific order. BFQ does not provide the same ordering guarantees that mq-deadline does, so attempts to use it with SMR drives are unlikely to lead to a high level of user satisfaction. Valente has a plan for how to support those drives in BFQ, but he acknowledged that they will not work correctly now.

The discussion wound down without reaching any sort of clear conclusion. It would appear that, before being merged, a patch of this nature would need to gain some additional checks to ensure, at a minimum, that BFQ is not selected for hardware that it cannot schedule properly. No such revision has been posted as of this writing. The proponents of BFQ seem unlikely to give up in the near future, though, so this topic seems like one that can be expected to arise again.

Comments (8 posted)

Secure key handling using the TPM

October 17, 2018

This article was contributed by Tom Yates

Kernel Recipes

Trusted Computing has not had the best reputation over the years — Richard Stallman dubbing it "Treacherous Computing" probably hasn't helped — though those fears of taking away users' control of their computers have not proven to be founded, at least yet. But the Trusted Platform Module, or TPM, inside your computer can do more than just potentially enable lockdown. In our second report from Kernel Recipes 2018, we look at a talk from James Bottomley about how the TPM works, how to talk to it, and how he's using it to improve his key handling.

Everyone wants to protect their secrets and, in a modern cryptographic context, this means protecting private keys. In the most common use of asymmetric cryptography, private keys are used to prove identity online, so control of a private key means control of that online identity. How damaging this can be depends on how much trust is placed in a particular key: in some cases those keys are used to sign contracts, in which case someone who absconds with a private key can impersonate someone on legal documents — this is bad.

The usual solution to this is hardware security modules, nearly all of which are USB dongles or smart cards accessed via USB. Bottomley sees the problem with these as capacity: most USB devices can only cope with one or two key pairs, and smart cards tend to only hold three. His poster child in this regard is Ted Ts'o, whose physical keyring apparently has about eleven YubiKeys on it. Bottomley's laptop has two VPN keys, four SSH keys, three GPG keys (because of the way he uses subkeys) and about three other keys. Twelve keys is beyond the capacity of any USB device that he knows of.

But nearly all modern general-purpose computing hardware has a TPM in it; laptops have had them for over ten years. The only modern computers that don't have one, said Bottomley, are Apple devices "and nobody here would use one of those". The TPM is a small integrated circuit on the motherboard that the CPU can talk to, which can perform cryptographic operations. The advantage of using the TPM for key management is that the TPM can scale to thousands of keys, though the mechanism by which it does this is, as we shall see, interesting. Most of the TPMs currently in the field are TPM 1.2 compliant. Thanks largely to Microsoft, he added, TPM 2.0 is the up-and-coming standard; some modern Dell laptops are already shipping with TPM 2.0, even though the standard isn't actually finalized yet.

The main problem with TPMs under Linux is that accessing them has been a "horrifically bad" experience; anyone who tried to use them has ended up never wanting to deal with them again. The mandated model for dealing with TPMs is the Trusted Computing Group (TCG) Software Stack (TSS), which is "pretty complicated". The Linux implementation of TSS 1.2 is TrouSerS; it was completed in 2012, and has "bit-rotted naturally since then", but remains the only way to use TPM 1.2 under Linux. The architecture involves a TrouSerS library that is linked to any relevant application, which talks to a user-space daemon (tcsd), which in turn talks to the TPM through a kernel driver. One of the many issues Bottomley has with this is that the daemon is a natural point of attack for anyone looking to run off with TPM secrets. For his day job, which is running containers in the cloud, the design is completely unacceptable: cloud tenants will not tolerate "anything daft in security terms", and a single daemon through which the secrets of all the system's tenants pass is certainly that.

The added functionality in TPM 2.0 means we can do much better than the TCG model. For many people in the industry, departure from the TCG model is heresy, but fortunately, said Bottomley, there are large free-wheeling sections of the industry (i.e. Linux people) who are happy to use any programming model that actually works. As with 1.2, the TCG is writing a software stack, but it's naturally even more complex than 1.2's TSS. IBM has had a fully-functional TPM 2.0 stack since May 2015 and Intel is writing its own. As of 4.12, the Linux kernel has a TPM 2.0 resource manager.

Under TPM 1.2, the asymmetric encryption algorithm was 2048-bit RSA (RSA2048), which is still good enough today. The hashing function was SHA-1, which isn't good enough any more. To avoid falling into the trap of mandating an algorithm that ages badly, TPM 2.0 features "algorithm agility", whereby it can support multiple hashing functions and encryption algorithms, and will use whichever the user requests. In practice, this usually means that in addition to SHA-1 and RSA2048, SHA-256 and a few elliptic curves are supported, and this is good enough for another few years. When SHA-256 starts to show its age, TPM 2 can support a move to, for example, the SHA-3 family of algorithms.

There are problems with elliptic curves, however; most TPMs support only BN-256 and NIST P-256, but nobody uses the former, and the NSA had a hand in the creation of the latter, so nobody trusts it. Attempts to get more curves added tend to run into borders; some nations don't like crypto being exported, and others will only allow it to be imported if it's crypto they like. Curve25519 has not apparently even been on the TCG's radar, although Bottomley said there's now a request into the TCG to approve it, and since it's already been published worldwide, there is some chance that it may be generally approved as a standard addition to the TPM.

The TPM's functions include shielded key handling, things related to trusted operating systems including measurement and attestation, and data sealing; Bottomley focused on the key-handling capabilities. The TPM organizes keys in hierarchies with a well-known key at the top; this is a key, generated by the TPM, the private half of which is known only to the TPM itself, and the public half of which it will give to anyone who asks. This enables the secure exchange of secrets with the TPM. If the TPM is reinitialized, the random internal seed from which the TPM derives its well-known key is regenerated, and anything the TPM has ever encrypted becomes unusable. This is important in terms of decommissioning devices; reinitializing the TPM is enough to render it, and anything it has ever encrypted, harmless.

Bottomley said that the TPM is capable of scaling to thousands of keys. While that's not exactly false, the TPM accomplishes this by not storing any of your keys at all. The TPM has very little memory and can only hold around three transient keys at any one time. Key handling is initially a matter of feeding a secret key to the TPM, which encrypts it into a "key blob" that only the TPM can read, which the TPM then gives back to you. Key storage is your responsibility, but with the safety net that these key blobs are meaningless to any TPM save yours, so they are not the difficult-to-store dynamite that an unprotected private key is. You can safely pass the blob in plaintext to the TPM any time you want the TPM to perform a key operation. The TPM will not decrypt a blob and give you the private key inside it; it will only perform operations with such a key. This also means that if you want to make backups of a private key, it is quite important to do it before you feed it to the TPM. If you ever get a new system, migrating your identity to it will not be possible unless you have these backups stored safely.

Prior to about March 2018, it was generally thought the the physical closeness of the CPU and the TPM made the channel between them effectively secure, at least without physical access to the hardware. Unfortunately, this has turned out not to be true, and a software tool called TPM Genie, which can attack this channel, was released. TCG's response was to develop the Enhanced System API (ESAPI), which allows for encrypted parameters in communication with the TPM (and for those of a suspicious mindset, it also allows the TPM to prove it's really a TPM).

ESAPI has its wrinkles: encrypted commands support only one encrypted argument, which must be the first, but some secret-handling commands, such as the one to import a private key in the first place, pass that key as the second argument, which was apparently overlooked by the TCG. But these problems were solved (in that specific case, by modifying the import command so that a randomly generated single-use symmetric key was used to encrypt the private key, and passed in the secure first field so that the TPM can decrypt the private key).

Having hardware support is all very well, but if user space can't easily come to grips with it, it's useless. Support in user space has come by recognizing that existing cryptosystems tend to use passphrase-protected (i.e. passphrase-encrypted) key files. With fairly small modifications, these can be passphrase-protected key blobs; possession of the passphrase allowing you to enable the OS to feed the blob to the TPM. This turns key-file-based systems into full two-factor authentication: if your laptop is stolen, each blob is passphrase-encrypted and cannot be fed to the TPM without that passphrase, and if your passphrase is compromised it is of no use to an attacker without your physical laptop.

OpenSSL now has a TPM 2.0 engine, though there is a problem in that many projects that use OpenSSL don't load the engine part of the API, thus cannot use engine-based keys. However, the fix is usually just a couple of lines of extra code in any program that uses OpenSSL, to ensure that the engine is loaded when OpenSSL is initialized. OpenVPN, for example, has been converted this way and can use the TPM. OpenSSH has been converted, though the agent requires additional patching. GnuPG uses libgcrypt, not OpenSSL, but after discussion at last year's Kernel Recipes, Bottomley has written a TPM module for GnuPG, which he demonstrated as part of the talk. Sbsigntools and gnome-keyring also now have TPM support. Some use cases remain unsupported: the TPM is a slow, low-power device that cannot handle hundreds of key operations per second, so it will likely remain unusable for things like full-disk encryption.

In response to an audience question about audit, Bottomley accepted that TPM 1.2 was poor in this regard. This allowed a number of issues to slip through the net, such as weak prime generation for Infineon TPMs. For TPM 2.0, the TCG has upped its game; there is a certification program involving a standardized test suite to check correct operation. There is also a reference implementation, written by Microsoft, but available under a BSD license, though apparently it is not widely used by TPM vendors.

The days of practical username/password authentication have already gone. The more we can support two-factor authentication, the more secure a future we can reasonably expect. The TPM is by no means the only way of doing this — I shall continue to use my YubiKey over NFC, and my GPG smartcard — but as TPM 2.0 hardware becomes more widespread it's an increasingly practical way of doing it, and it gets utility from a part of your computer that until now may have largely been looked on with disdain.

[We would like to thank LWN's travel sponsor, The Linux Foundation, for assistance with travel funding for Kernel Recipes.]

Comments (35 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Tutanota; Bro becomes Zeek; SFLC on automotive software and copyleft; Quotes; ...
Announcements: Newsletters; events; security updates; kernel patches; ...

Next page: Brief items>>