|
|
Log in / Subscribe / Register

5.10 Merge window, part 1

By Jonathan Corbet
October 16, 2020
As of this writing, 7,153 non-merge changesets have been pulled into the mainline Git repository for the 5.10 release — over a period of four days. This development cycle is clearly off to a strong start. Read on for an overview of the significant changes merged thus far for the 5.10 kernel release.

Architecture-specific

  • The arm64 architecture can now do performance-events monitoring over Arm's CMN-600 interconnect.
  • The Arm v8.5 memory tagging extension [PDF] is now supported. This feature allows a four-bit tag to be assigned to every 16-byte "granule" in physical memory; whenever a pointer into a specific granule is dereferenced, the CPU will ensure that the tag stored in the pointer matches that assigned to the granule. Proper use of this extension can trap use-after-free and buffer-overflow bugs (and attempts to exploit those bugs). See this article for more information on this feature.
  • The ia64 performance-monitoring code has been removed. Few are likely to miss it, since it hasn't worked for years.
  • AMD's "secure encrypted virtualization" (SEV) feature encrypts memory assigned to virtualized guests; Linux has supported SEV for a while. The new SEV-ES feature, merged for 5.10, expands SEV by encrypting the guest's processor registers as well, making them unavailable to the host unless the guest explicitly shares them.

Core kernel

Filesystems and block I/O

  • The Btrfs filesystem has gained some significant performance improvements in fsync() operations.
  • XFS has seen a bunch of work to resolve its year-2038 problems; timestamps in this filesystem are now good through 2468. Developers now have clear warning of a problem coming in 448 years, but chances are they will procrastinate on addressing it for at least 440 of them.

Hardware support

  • Graphics: Lontium LT9611 DSI/HDMI bridges, Toshiba TC358775 DSI/LVDS bridges, Toshiba TC358762 DSI/DPI bridges, Mantix MLAF057WE51-X MIPI-DSI LCD panels, Cadence DPI/DP bridges, Samsung S6E63M0 RGB DSI interfaces, and NXP i.MX8MQ display controllers.
  • Hardware monitoring: Analog Devices ADM1266 sequencers, MPS MP2975 multi-phase controllers, Intel MAX10 BMC monitoring chips, and Moortec Semiconductor MR75203 PVT controllers.
  • Industrial I/O: Analog Devices ADXRS290 dual-axis MEMS gyroscopes, AMS AS73211 XYZ color sensors, and TI HDC2010 relative humidity and temperature sensors.
  • Miscellaneous: Amazon Annapurna Lab memory controllers, MediaTek MStar interrupt controllers, Xiphera XIP8001B true random number generators, Ingenic true random number generators, MCHP Sparx5 SDHC interfaces, Baikal-T1 SPI controllers, TI LP5036/30/24/18/12/9 LED controllers, Kontron sl28cpld interrupt/watchdog/PWM controllers, ENE KB3930 embedded controllers, Intel MAX 10 board management controllers, Kinetic KTD253 backlight drivers, Hisilicon 3670 SPMI controllers, Hisilicon Hi6421v600 SPMI power-management ICs, and Qualcomm SM8150 and SM8250 interconnect buses.
  • Pin control: Qualcomm 8226 pin controllers, Actions Semi S500 pin controllers, Mediatek MT8192 and MT8167 pin controllers, Toshiba Visconti TMPV7700 series pin controllers, and Allwinner A100 pin controllers.
  • Regulator: Richtek RT4801 regulators, Raspberry Pi 7-inch touchscreen panel ATTINY regulators, MediaTek MT6360 SubPMIC regulators, and Richtek RTMV20 load switch regulators.
  • Sound: MediaTek MT6359 codecs, Microchip S/PDIF controllers, Cirrus Logic CS4234 codecs, and Texas Instruments TAS2764 mono audio amplifiers.
  • USB: Hisilicon hi3670 USB PHYs, Mediatek MT6360 Type-C controllers, UniPhier AHCI PHYs, Intel Lightning Mountain USB PHYs, STMicroelectronics STUSB160x Type-C controllers, Maxim TCPCI based Type-C chips, and Qualcomm PMIC USB Type-C detectors.
  • There is an entirely new user-space API for GPIO lines. Naturally, this API is rigorously undocumented, but some information can be gleaned from this commit.
  • The "raw" char device, which provides a classic Unix-style char interface to block devices, has been deprecated with the intention of removing it in the 5.14 release. The raw device was primarily used for direct I/O, which has been supported via the O_DIRECT flag since 2002.

Security-related

  • Support for the RC4-HMAC-MD5 KerberosV algorithm has been removed from the crypto subsystem. This algorithm was created for compatibility with Windows 2000; according to the commit, its removal "should only adversely affect interoperability with Windows NT/2000 systems that have not received any updates since 2008 (but are connected to a network nonetheless)".
  • The SM2 digital-signature algorithm is now supported.
  • It is now possible to remove security.selinux extended attributes from files, but only before the SELinux policy is loaded. This makes it possible to "unlabel" files when SELinux is not being used.

Virtualization and containers

Internal kernel changes

  • The seqcount latch specialized lock type has been added.
  • "Orphan sections" — code or data sections that find their way into the kernel image without having been explicitly put there by the linker script — will now generate warnings during the kernel build. This change was made to protect the kernel from the possibility of unwanted changes in how linkers place those sections; the merge changelog describes orphan sections as "a long-standing source of obscure bugs".
  • Static calls are a mechanism for performing indirect function calls with better performance, especially on systems where retpolines would otherwise have to be used to protect against Spectre vulnerabilities. This mechanism has been under development since 2018; it was finally merged for 5.10.
  • The printk() subsystem has gained a new lockless ring buffer meant to be a first step in resolving a number of problems in this area. See this article for an overall description of the printk() work, including the new ring buffer.
  • The minimum version of Clang needed to build the kernel is now 10.0.1.

The 5.10 merge window can be expected to close on October 25, after which the stabilization portion of the development cycle will begin. Stay tuned for LWN's coverage of the second half of the 5.10 merge window, to be published shortly after the window closes.

Index entries for this article
KernelReleases/5.10


to post comments

5.10 Merge window, part 1

Posted Oct 17, 2020 4:42 UTC (Sat) by roc (subscriber, #30627) [Link] (16 responses)

Nitro Enclaves are interesting. I guess this means Amazon isn't planning to follow Azure into nested virtualization. That's probably a good decision.

5.10 Merge window, part 1

Posted Oct 17, 2020 4:53 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

I've heard from Amazon engineers at the last re:Invent that they reasonably believe that nested virtualization is unsafe on the current Intel CPUs. Their _current_ (with emphasis on "current") Graviton2 ARM family also doesn't support it.

5.10 Merge window, part 1

Posted Oct 17, 2020 11:40 UTC (Sat) by mss (subscriber, #138799) [Link] (6 responses)

> that they reasonably believe that nested virtualization is unsafe on the current Intel CPUs.

Ohh, that's interesting.

Do you know whether they meant the current KVM nVMX implementation (which is tricky to get right with issues getting fixed all the time) or the VMX support itself in the CPU (as the expression "current CPUs" in your comment would suggest)?

5.10 Merge window, part 1

Posted Oct 17, 2020 19:54 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

No idea. Back when I was working at Amazon, there were several huge hardware-level scares that required mass reboots of client VMs (both for legacy Xen and newer KVM-based VMs).

5.10 Merge window, part 1

Posted Oct 18, 2020 16:05 UTC (Sun) by pbonzini (subscriber, #60935) [Link] (4 responses)

Google and Oracle have been offering nVMX for years and have contributed lots of changes. nVMX is definitely ready for production use and nSVM is getting there (though there is one processor erratum that complicates things).

What follows is my guess on what things really look like. First, Amazon plays complicated games with /dev/mem and memremap for EC2 in order to save the price of "struct page" for guest memory (that's 1.5% so nothing to sneeze at), and that makes nested virtualization slower. Second, their kernel is probably based on older versions of Linux and thus it lacks a lot of the improvements made to nested virtualization lately. And finally, Amazon sells bare metal instances at a higher price so they have no interest in covering virtualization workloads.

5.10 Merge window, part 1

Posted Oct 18, 2020 21:35 UTC (Sun) by roc (subscriber, #30627) [Link] (1 responses)

VMX seems so complicated to me that I expect there to be lots of hardware bugs in it accessible from the hypervisor. That's not a big deal if you trust the hypervisor, which you mostly do at the root, but it's deeply problematic for nested virtualization.

*Maybe* so much bug hunting has been done by people with variously-coloured hats, and so many bugs fixed, that this risk has been reduced to an acceptably low level. But if this has been done then I would expect some of those bugs to have been published, and I haven't seen that, not like we have for other attack surfaces.

5.10 Merge window, part 1

Posted Oct 18, 2020 22:02 UTC (Sun) by pbonzini (subscriber, #60935) [Link]

It is not *that* complicated actually, once you get familiar with it. Also, large parts of the state is not passed through to the processor by the root hypervisor, so the configuration for the nested guest ends up being not very different from that for the non-nested case.

5.10 Merge window, part 1

Posted Oct 19, 2020 4:44 UTC (Mon) by josh (subscriber, #17465) [Link] (1 responses)

> though there is one processor erratum that complicates things

What's the nSVM-related errratum on AMD?

5.10 Merge window, part 1

Posted Oct 19, 2020 9:25 UTC (Mon) by pbonzini (subscriber, #60935) [Link]

VMLOAD/VMRUN/VMSAVE instructions check their operand against a range of restricted addresses (such as the SMM TSeg) and generate an instruction if the operand is within that address. When nesting is on, the address should be checked after it has gone through nested page tables, but instead it is checked as is.

If the nested hypervisor is unlucky enough to place its VMCB at an address that the processor rejects, it will fail to enter the nested guest. There are various possible workarounds though (the simplest is to reduce the amount of memory below 4GB in the nested hypervisor to 1GB, because usually SMM TSeg is somewhere between 0x40000000 and 0xC0000000).

Apparently it's been there since the first SVM processors but we only noticed last year and it took a few months to find the root cause.

5.10 Merge window, part 1

Posted Oct 18, 2020 19:06 UTC (Sun) by pbonzini (subscriber, #60935) [Link] (7 responses)

Nitro enclaves are not a replacement for nested virtualization. The idea is that you offload part of the work, for example logging, to the enclave so that it is impossible to tamper with that data in case the parent VM is taken over. The parent is not necessarily trusses. With nested virtualization instead the parent has total control over the guest and must be trusted.

5.10 Merge window, part 1

Posted Oct 18, 2020 21:27 UTC (Sun) by roc (subscriber, #30627) [Link] (1 responses)

Thanks for clarifying that, I guess it makes sense as an alternative to SGX that is simpler and actually works.

But, is "logging" actually a valid use-case for Nitro enclaves?

> Enclaves are virtual machines attached to EC2 instances that come with no persistent storage, no administrator or operator access, and only secure local connectivity to your EC2 instance.

Not clear to me how a Nitro enclave can improve the security of logging if it can't actually write logs anywhere without cooperation from the untrusted parent instance.

5.10 Merge window, part 1

Posted Oct 18, 2020 22:11 UTC (Sun) by pbonzini (subscriber, #60935) [Link]

You could encrypt logs within the enclave and send them back to the parent VM that immediately sends them somewhere for storage. This doesn't guarantee that logs are actually forwarded and stored but it does guarantee confidentiality and integrity; in case of an attack this means that the attacker cannot wipe all of its traces. And at least it's more interesting than DRM. :-)

The enclave only gets a vsock connection to the outer would. One interesting feature would be the ability for the parent to configure (at enclave startup) a mapping from a vsock port to a remote TCP address/port, with the forwarding being done by the host so that the link cannot be broken.

5.10 Merge window, part 1

Posted Oct 19, 2020 4:47 UTC (Mon) by josh (subscriber, #17465) [Link] (4 responses)

> Nitro enclaves are not a replacement for nested virtualization

If you don't care about the tamper-proof attestation that Nitro Enclaves provide, what's the difference between a Nitro enclave and a locked-down KVM using nested virtualization with only a vsock, other than that nested virtualization seems likely to be slower?

5.10 Merge window, part 1

Posted Oct 19, 2020 9:28 UTC (Mon) by pbonzini (subscriber, #60935) [Link] (3 responses)

None, but that's a fairly big "if". The point of Nitro enclaves is that you don't trust the host.

5.10 Merge window, part 1

Posted Oct 20, 2020 21:41 UTC (Tue) by josh (subscriber, #17465) [Link] (2 responses)

Sure, but they're also the only approximation to "nested virtualization" on AWS, so they're wildly useful even if you don't care about keeping the host untrusted.

5.10 Merge window, part 1

Posted Oct 21, 2020 7:21 UTC (Wed) by pbonzini (subscriber, #60935) [Link] (1 responses)

You have to rewrite your application to use them. Nested virtualization is mostly useful for test deployments of kubernetes clusters and the like. If you want virtualization on AWS you have to fork out the money and get bare metal instances, period.

5.10 Merge window, part 1

Posted Oct 21, 2020 21:40 UTC (Wed) by josh (subscriber, #17465) [Link]

> You have to rewrite your application to use them.

Only if your application started out assuming more devices than just a vsock.

> If you want virtualization on AWS you have to fork out the money and get bare metal instances, period.

Money isn't the issue. Bare-metal instances take an incredibly long time to start up (~10 minutes), compared to virtualized instances (~5-10 seconds, which is already far too long).

5.10 Merge window, part 1

Posted Oct 17, 2020 5:07 UTC (Sat) by re:fi.64 (subscriber, #132628) [Link] (4 responses)

> Developers now have clear warning of a problem coming in 448 years, but chances are they will procrastinate on addressing it for at least 440 of them.

Terrifyingly accurate.

5.10 Merge window, part 1

Posted Oct 18, 2020 16:06 UTC (Sun) by pbonzini (subscriber, #60935) [Link] (1 responses)

430 based on experience with 2038.

5.10 Merge window, part 1

Posted Oct 19, 2020 4:06 UTC (Mon) by gus3 (guest, #61103) [Link]

If I'm still around in 2448, I promise I'll bring attention to this issue in XFS.

5.10 Merge window, part 1

Posted Oct 22, 2020 11:13 UTC (Thu) by eru (subscriber, #2753) [Link] (1 responses)

Such an odd extension (not the "virtual eternity" of 64 bits). I guess they tried to shoe-horn the extended time into some existing on-disk bits for backward-compatibility, so I guess existing XFS installations work seamlessly with this. Right?

5.10 Merge window, part 1

Posted Oct 28, 2020 18:53 UTC (Wed) by BenHutchings (subscriber, #37955) [Link]

ext4 had 32-bit seconds + 32-bit nanoseconds fields, and has reassigned 2 bits from nanoseconds to seconds. I would guess that XFS has made a similar change.

5.10 Merge window, part 1

Posted Oct 17, 2020 8:16 UTC (Sat) by darwi (subscriber, #131202) [Link] (10 responses)

> There is an entirely new user-space API for GPIO lines. Naturally, this API is rigorously undocumented.

Unfortunately, this is a sign of the maintainer not properly doing his or her job. Proper documentation should be a prerequisite of merging such patches.

5.10 Merge window, part 1

Posted Oct 17, 2020 11:37 UTC (Sat) by khim (subscriber, #9252) [Link] (8 responses)

No. That would be horrible waste of resources. APIs are changed so radically during review process that writing adequate documentation for them (with examples and diagrams) is usually pointless. It could be redone 2, 3, even 10 times in extreme cases.

But of course after new API is merged it's too late to write a documentation because, well, it's merged, why bother?

Not sure how could that issue be adequately resolved…

5.10 Merge window, part 1

Posted Oct 17, 2020 13:47 UTC (Sat) by magfr (subscriber, #16052) [Link]

Process.
Do not require (but do encourage) documentation for a proposed API but demand it in order to merge the API.

There is actually another benefit of this - forcing documentation is kind of an "explain it to the duck" session for developers that forces them to think about their new shiny interface from another angle.

5.10 Merge window, part 1

Posted Oct 17, 2020 14:34 UTC (Sat) by dxin (guest, #136611) [Link] (1 responses)

Don't reviewers need documentation as well?

5.10 Merge window, part 1

Posted Oct 18, 2020 21:39 UTC (Sun) by roc (subscriber, #30627) [Link]

Yes, exactly. Good documentation eases the review process for maintainers.

Writing good documentation is also a clarifying process for everyone to understand exactly what the feature does and how it can and cannot be used. It's too easy to write some code and only later realize that the API is actually very difficult to use correctly; writing the documentation usually exposes this.

5.10 Merge window, part 1

Posted Oct 17, 2020 17:11 UTC (Sat) by mkubecek (guest, #130791) [Link]

With ethtool netlink interface, I tried to go even farther: I tried to document each part of the API first, before starting with the implementation on either side. I don't think it was a waste of resources, I did actually find it quite helpful at times.

5.10 Merge window, part 1

Posted Oct 19, 2020 20:33 UTC (Mon) by darwi (subscriber, #131202) [Link] (3 responses)

> No. That would be horrible waste of resources. APIs are changed so radically during review process that writing adequate documentation for them (with examples and diagrams) is usually pointless. It could be redone 2, 3, even 10 times in extreme cases.

That’s a BS argument, sorry.

If the newly added user space API is really **that** contentious, then by the time the maintainer is going to merge the changes, he asks for a final iteration with proper documentation. Simply because if it’s not done at that moment, it might never be done in the future.

Stop justifying bad behavior please.

5.10 Merge window, part 1

Posted Oct 19, 2020 23:38 UTC (Mon) by Wol (subscriber, #4433) [Link] (1 responses)

The other thing is, you should document BEFORE you write code. Without a plan, how do you know if your API is going to work / do what it should / be of any use.

They say that no battle plan survives five minutes contact with the enemy, but without a plan you don't stand a chance.

Cheers,
Wol

5.10 Merge window, part 1

Posted Oct 20, 2020 8:32 UTC (Tue) by farnz (subscriber, #17727) [Link]

This is where kerneldoc is great - documentation for the API sits with the implementation, and if you're working on the implementation, the API docs are right there, staring you in the face with the code you're reviewing.

See the DRM top-level, which gets turned into this document based on both the top-level and the code included by the top-level.

5.10 Merge window, part 1

Posted Oct 23, 2020 20:55 UTC (Fri) by rgmoore (✭ supporter ✭, #75) [Link]

If the newly added user space API is really **that** contentious, then by the time the maintainer is going to merge the changes, he asks for a final iteration with proper documentation.

This sounds very reasonable. No documentation, no merge. The hard part is having the resolve to make it stick the first few times. Once everyone knows that's the rule and it will be enforced, they'll know to have the documentation done before the merge. Then the only hard part is keeping the documentation up to date as the code changes.

5.10 Merge window, part 1

Posted Oct 18, 2020 23:29 UTC (Sun) by deater (subscriber, #11746) [Link]

Urgh I just finished re-writing the examples/homeworks in my embedded system class to use the "new" GPIO interface on the theory that the old sysfs interface is deprecated and scheduled for removal in 2020.

only to find out it was a complete waste of time due to yet another gpio interface being added.

although I guess if the new interface has debounce support I won't have to spend a lot of time trying to convince students that sprinkling random usleep()s all over their code somehow counts as debouncing

5.10 Merge window, part 1

Posted Oct 17, 2020 12:53 UTC (Sat) by flussence (guest, #85566) [Link]

Is it still the case that AMD's on-cpu memory encryption is incompatible with AMD's own (potentially on-cpu) GPUs? That's a bit of an embarrassment.

5.10 Merge window, part 1

Posted Oct 17, 2020 18:46 UTC (Sat) by jrtc27 (subscriber, #107748) [Link] (3 responses)

> Proper use of this extension can trap use-after-free and buffer-overflow bugs (and attempts to exploit those bugs).

It's worth noting that this is only probabilistic (for use-after-free the memory tag will eventually cycle back round and be usable again with an old pointer, and for buffer overflows you in general have a 1/16 chance of the out-of-bounds memory having a matching tag), with the exception that you can catch overflows into adjacent memory deterministically by choosing your tags such that adjacent allocations' are always different, and that using the tags for both simultaneously multiplies the constraints together such that both mitigations become weaker (higher probability of an invalid memory access not being detected) compared with exclusively enabling one.

5.10 Merge window, part 1

Posted Oct 18, 2020 8:43 UTC (Sun) by Jandar (subscriber, #85683) [Link] (2 responses)

> for use-after-free the memory tag will eventually cycle back round and be usable again with an old pointer

One could reserve 1 of the 16 tags for un-allocated or freed memory, so no cycling required.

> for buffer overflows you in general have a 1/16 chance of the out-of-bounds memory having a matching tag

If the n tags used for this are assigned in a cycling pattern, than small overflows are detected reliable, and if n isn't divisible by 2 all offsets by any x-order page are detected.

While this doesn't detect overflows to 100% it's nevertheless a good defense.

5.10 Merge window, part 1

Posted Oct 18, 2020 18:01 UTC (Sun) by jrtc27 (subscriber, #107748) [Link] (1 responses)

> > for use-after-free the memory tag will eventually cycle back round and be usable again with an old pointer
>
> One could reserve 1 of the 16 tags for un-allocated or freed memory, so no cycling required.

So, the important thing about use-after-free is that it's actually not the real problem. If you free memory but continue to use it whilst it's still free, so long as the entire page isn't unmapped by the allocator (and so long as your allocator doesn't go and pattern-fill the memory to try and catch these things), you don't actually have a vulnerability. Where the vulnerabilities come from is use-after-*reallocation*, i.e. when you continue to use memory that has since been allocated to someone else. That allows you to then read and/or write to whatever their data structures are and do things like rewrite C++ vtables or conduct data-oriented attacks. Using a reserved tag for unallocated memory doesn't solve that problem, as when malloc reuses freed memory it needs to assign it a different tag from whatever you've reserved, which it can't do forever without eventually re-tagging that region of memory with the same as the original tag.

> > for buffer overflows you in general have a 1/16 chance of the out-of-bounds memory having a matching tag
>
> If the n tags used for this are assigned in a cycling pattern, than small overflows are detected reliable, and if n isn't divisible by 2 all offsets by any x-order page are detected.
>
> While this doesn't detect overflows to 100% it's nevertheless a good defense.

Yeah, you can do things like that, but once you get significantly out-of-bounds there won't be much of a useful pattern any more, both because the offset is likely fairly arbitrary and because you may be crossing into an allocation pool for a different bucket size, so for anything that's a few pages out of bounds it's highly likely to just degrade into a 1/n probability. It's better than nothing, but it doesn't solve the problem, it just creates additional work for attackers to ensure they get an offset that lands on a correctly-tagged region.

5.10 Merge window, part 1

Posted Oct 18, 2020 21:45 UTC (Sun) by roc (subscriber, #30627) [Link]

I bet attackers will very quickly learn how to groom the heap to ensure they get the right tag values for any fixed tag recycling scheme, at least when their wild-read/wild-write primitive allows an arbitrary offset. So I think allocators will have to use random tags.

5.10 Merge window, part 1

Posted Oct 18, 2020 11:24 UTC (Sun) by thumperward (guest, #34368) [Link] (2 responses)

Is there a year 2468 problem?

5.10 Merge window, part 1

Posted Oct 19, 2020 5:23 UTC (Mon) by matthias (subscriber, #94967) [Link]

Now, there is. XFS has chosen to use a 64 bit nanosecond counter with the minimal value being in December 1901.

https://www.phoronix.com/scan.php?page=news_item&px=X...

5.10 Merge window, part 1

Posted Oct 19, 2020 14:34 UTC (Mon) by willy (subscriber, #9762) [Link]

The art of choosing new limits is finding one which will not be force a format change. I'm anticipating a larger format change in the next 50 years, and at that time XFS can go to larger timestamps.


Copyright © 2020, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds