5.8 Merge window, part 1
Architecture-specific
- Branch-target identification and shadow call stacks (both described in this article) have been added to the Arm64 architecture. Both are hardening technologies that, with luck, will make Arm64 systems more resistant to attack. The shadow call stack support is likely to spread to other architectures in the near future.
Core kernel
- The new faccessat2() system call adds the flags argument that POSIX has always said should be there. The current support for faccessat() on Linux systems depends on emulation of the flags argument by the C library; faccessat2() will allow a better implementation in the kernel.
- Memory control groups have a new knob, memory.swap.high, which can be used to slow down tasks that are using large amounts of swap space; see this commit for a bit more information.
- The io_uring subsystem now supports the tee() system call.
- It is now possible to pass a pidfd to the setns() system call; in that case, it is possible to specify multiple namespace types. The calling process will be moved to all of the applicable namespaces in an atomic manner.
- The "BPF iterator" mechanism, which facilitates the dumping of kernel data structures to user space, has been merged; this feature was covered in this article in April.
- There is a new ring buffer for communicating data from BPF programs. It is intended to resemble the perf ring buffer while allowing sharing of the buffer across multiple CPUs. See this documentation commit for more information.
- The padata mechanism now supports multi-threaded jobs with load balancing; see this documentation commit for details.
- The kernel's swappiness tuning knob, which sets the balance between reclaiming file-backed and anonymous pages, has traditionally been used to bias the system away from swapping anonymous pages. With fast I/O devices, though, swapping may be faster than filesystem access, so it may be useful to bias the system toward swapping. Now swappiness can take values up to 200 to push things in that direction; see this commit for details.
Filesystems and block I/O
- Low-level support for inline encryption has been added to the block layer. Inline encryption is a hardware feature that encrypts (and decrypts) data moving between a block storage device and the CPU using a key provided by the CPU. Some more information can be found in this commit.
- There is a new statx() flag (STATX_ATTR_DAX) that indicates that the file in question is being accessed directly via the DAX mechanism. There is also a documentation patch that attempts to specify just how filesystems will behave when DAX is in use. More DAX-related changes can be expected during this merge window.
Hardware support
- Graphics: Leadtek LTK050H3146W panels, Northwest Logic MIPI DSI host controllers, Chrontel CH7033 video encoders, Visionox RM69299 panels, and ASUS Z00T TM5P5 NT35596 panels.
- Hardware monitoring: Maxim MAX16601 voltage regulators, AMD RAPL MSR-based energy sensors, Gateworks System Controller analog-to-digital converters, and Baikal-T1 process, voltage, and temperature sensors.
- Interrupt control: Loongson3 HyperTransport interrupt vector controllers, Loongson PCH programmable interrupt controllers, and Loongson PCH MSI controllers.
- Media: Rockchip video decoders and OmniVision OV2740 sensors. The "atomisp" driver has also been resurrected in the staging tree and seen vast amounts of cleanup work.
- Miscellaneous: AMD SPI controllers, Maxim 77826 regulators, Arm CryptoCell true random number generators, Amlogic Meson SDHC host controllers, Freescale eSDHC ColdFire controllers, and Loongson PCI controllers,
- Networking: Broadcom BCM54140 PHYs, Qualcomm IPQ4019 MDIO interfaces, MediaTek STAR Ethernet MACs, Realtek 8723DE PCI wireless network adapters, and MediaTek MT7915E wireless interfaces.
Miscellaneous
- The new initrdmem= boot-time option specifies an initial disk image found in RAM; see this commit for more information.
Networking
- The bridge code now supports the media redundancy protocol, where a ring of Ethernet switches can be used to survive single-unit failures. See this commit for more information.
- The new "gate" action for the traffic-control subsystem allows specific packets to be passed into the system during specified time slots. This action is naturally undocumented, but some information can be found in this commit.
- Some network devices can perform testing of attached network cables; the kernel and ethtool utility now support that functionality when it is available.
- The multiprotocol label switching routing algorithm is now available for IPv6 as well as IPv4.
- RFC 8229, which describes encapsulation of key-exchange and IPSec packets, is now supported.
Security-related
- The CAP_PERFMON capability has been added; a process with this capability can do performance monitoring with the perf events subsystem.
- The new CAP_BPF capability covers some BPF operations that previously required CAP_SYS_ADMIN. In general, most BPF operations will also require either CAP_PERFMON (for tracing and such) or CAP_NET_ADMIN; this commit gives a terse overview of which operations require which capabilities.
Internal kernel changes
- The "pstore" mechanism, which stashes away system-state information in case of a panic, has gained a new back-end that stores data to a block device. See this commit for documentation.
- There is a new read-copy-update (RCU) variant called "RCU rude"; it
delineates grace periods only at
context switches. Those wondering about the name might see the
comment in this
commit, which reads: "
It forces IPIs and context switches on all online CPUs, including idle ones, so use with caution
". - The RCU-tasks subsystem has a new "RCU tasks trace" variant suited to the needs of tracing and BPF programs; see this commit for details.
- "Local locks" have been brought over from the realtime preemption tree. These locks are intended to replace code that disables preemption and/or interrupts on a single processor. Advantages include a better realtime implementation and the ability to properly instrument locking; see this commit for more information.
- The API for managing file readahead has changed significantly; see this patch series for details.
- The kgdb kernel debugger is now able to work with the boot console, enabling debugging much earlier in the boot process; see this commit and this documentation patch for more information.
- There is a new buffer-allocation API intended to make the writing of XDP network drivers easier. Documentation is too much to hope for, but the API can be seen in this commit.
The 5.8 merge window can be expected to remain open until June 14;
after that, the actual 5.8 release should happen in early August. Stay
tuned; LWN will provide an update on the rest of this merge window after it
closes.
| Index entries for this article | |
|---|---|
| Kernel | Releases/5.8 |
Posted Jun 5, 2020 18:03 UTC (Fri)
by knurd (subscriber, #113424)
[Link]
TWIMC: documentation for features like this often gets added to iproute2 (together with the userland support for the feature), that's why the man page for the gate action can be found here: https://git.kernel.org/pub/scm/network/iproute2/iproute2....
Posted Jun 7, 2020 0:37 UTC (Sun)
by darwi (subscriber, #131202)
[Link] (7 responses)
This is amazing for debuggability on x86 laptops, which typically lack standardized ways of saving the kernel log on panic. I hope more and more block drivers implement the necessary hooks for this.
Posted Jun 7, 2020 1:59 UTC (Sun)
by flussence (guest, #85566)
[Link] (5 responses)
Posted Jun 7, 2020 13:37 UTC (Sun)
by josh (subscriber, #17465)
[Link] (4 responses)
Posted Jun 8, 2020 5:46 UTC (Mon)
by flussence (guest, #85566)
[Link] (3 responses)
At the time I also had `installkernel` set up to update the efibootmgr entries during make install, which made everything worse - couldn't change the default boot to a known-good entry, it was stuck on the bad one.
Posted Jun 8, 2020 5:55 UTC (Mon)
by amacater (subscriber, #790)
[Link] (1 responses)
Posted Jun 21, 2020 22:18 UTC (Sun)
by nix (subscriber, #2304)
[Link]
I suspect this is too paranoid of me -- but it still feels nicer to be able to use a blockdev. (Or it would if all my disks weren't fully partitioned already -- I don't imagine it would work on USB mass storage... expecting *that* to work in panic state seems like asking a lot.)
Posted Jun 8, 2020 8:04 UTC (Mon)
by mjg59 (subscriber, #23239)
[Link]
There's not actually a good answer here. EFI firmware updates need to be in SMM because that's the only mechanism Intel provide to allow authenticated writes to flash, and SMM can't run without all cores being in SMM, so if you do a variable update and need to rewrite an entire block of flash you're going to halt the OS for long enough that things will be unhappy. Getting into a situation where you allow the OS to make the machine unbootable obviously isn't a great answer to that (https://lore.kernel.org/patchwork/patch/300747/ is arguably more egregious in this respect), but the singular bug that actually led us to this point is an understandable one.
(The -ENOSPC behaviour is accompanied by the kernel then attempting to create a variable it knows is too big in order to force the firmware to do a garbage collection run on next boot. If you boot via the UEFI boot stub then the kernel will do this while still in UEFI boot services, which should trigger the garbage collection before the kernel starts. As a result, this *should* now be largely invisible to users without putting systems at risk, but obviously we have no way whatsoever kf knowing how firmware actually works before we try it)
Posted Jun 7, 2020 16:35 UTC (Sun)
by ebiederm (subscriber, #35028)
[Link]
I am just a little astonished. Last round this was tested (admittedly with more data because people wanted a kernel core dump) using drivers in the kernel to write to on a kernel panic only worked on developers machines. Under actual read world failure conditions the kernel was always too compromised to successfully (safely?) write to a block device.
The listed driver restrictions might be enough to make the code reliable in a crash scenario. I would love to see a report on their testing.
Posted Jun 9, 2020 19:59 UTC (Tue)
by hailfinger (subscriber, #76962)
[Link] (2 responses)
- RFC 8229 in IPv4 is already supported since Linux 5.6 (see https://lwn.net/Articles/811198/ ). It's a great way to establish an IPsec tunnel even if the network only allows ports 80/TCP and 443/TCP like some public hotspots do.
- RFC 8229 in IPv6 support (ESP/IKE in TCP in IPv6) is new. It helps establishing IPsec tunnels to IPv6 servers even in restricted networks.
- You now can embed ESP in UDP in IPv6 packets. Once this feature lands in an Android kernel, you can finally use IPsec apps to connect to IPv6-only VPN gateways (direct use of ESP is unavailable to apps). This feature also enables users behind crappy ISP-provided DSL/cable routers to connect to dual-stack or IPv6-only IPsec gateways (quite a lot of those routers drop ESP).
I have been eagerly waiting for these changes for years. Cisco was able to run IPsec over TCP (admittedly nonstandard, but it worked) for over a decade before Linux got the feature, and they have been marketing it ever since as one of the unique selling points of their IPsec solution. Yes, running e.g. TCP over IPsec over TCP has funny side effects, but it beats not being able to connect at all. Some other VPN solutions (WireGuard) require third-party tooling to run over TCP, but with RFC 8226 support IPsec finally surpasses that.
Posted Jun 10, 2020 2:28 UTC (Wed)
by wahern (subscriber, #37304)
[Link] (1 responses)
I want to switch over to IKEv2 completely as basic road warrior VPN configurations are simpler--no L2TP, no duplicative auth, etc. It's supported natively by all the common user platforms, including Windows, iOS, and macOS, except Android.
Posted Jun 11, 2020 10:48 UTC (Thu)
by grawity (subscriber, #80596)
[Link]
Native IKEv2 on Windows in contrast is always troublesome for me, to the point I wish it wasn't there and I wouldn't feel bad about installing a standalone client.
Documentation for "gate" action
5.8 Merge window, part 1
5.8 Merge window, part 1
5.8 Merge window, part 1
5.8 Merge window, part 1
5.8 Merge window, part 1
5.8 Merge window, part 1
5.8 Merge window, part 1
5.8 Merge window, part 1
5.8 Merge window, part 1
5.8 Merge window, part 1
5.8 Merge window, part 1
