Leading items
Welcome to the LWN.net Weekly Edition for December 12, 2019
This edition contains the following feature content:
- OpenBSD system-call-origin verification: a new anti-ROP approach from OpenBSD.
- Debian votes on init systems: the voting to determine the Debian project's position on init systems (and systemd in particular) has begun.
- The end of the 5.5 merge window: another 6,000 patches worth of new stuff goes into the kernel.
- Developers split over split-lock detection: should this feature be on by default, even if it might break some applications?
- Working toward securing PyPI downloads: the process of securing the Python Package Index is slow and halting.
- New features for the Kubernetes scheduler: what the core Kubernetes developers are up to.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
OpenBSD system-call-origin verification
A new mechanism to help thwart return-oriented programming (ROP) and similar attacks has recently been added to the OpenBSD kernel. It will block system calls that are not made via the C library (libc) system-call wrappers. Instead of being able to string together some "gadgets" that make a system call directly, an attacker would need to be able to call the wrapper, which is normally at a randomized location.
Theo de Raadt introduced the feature in a late November posting to the OpenBSD tech mailing list. The idea is to restrict where system calls can come from to the parts of a process's address space where they are expected to come from. OpenBSD already disallows system calls (and any other code) executing from writable pages with its W^X implementation. Since OpenBSD is largely in control of its user-space applications, it can enforce restrictions that would be difficult to enact in a more loosely coupled system like Linux, however. Even so, the restrictions that have been implemented at this point are less strict than what De Raadt would like to see.
The eventual goal would be to disallow system calls from anywhere but the region mapped for libc, but the Go language port for OpenBSD currently makes system calls directly. That means the main program text segment needs to be in the list of memory regions where system calls are allowed to come from. Static binaries will also need to have their text segment included, since libc will inhabit part of that address space. In his message, he described the full list of allowed memory regions:
For dynamic binaries, valid regions are ld.so's text segment, the signal trampoline, and libc.so's text segment... AND the main program's text.
Switching Go to use the libc wrappers (as is already done on Solaris and macOS) is something that De Raadt would like to see. It may allow dropping the main program text segment from the list of valid regions in the future:
The kernel can mark most of the regions valid as it starts up a process, but it will not know what address space holds libc.so for a dynamic binary. Marking that region as valid is left to the ld.so dynamic linker. It has been changed to execute a new system call, msyscall(), to mark the region occupied by libc.so as one that is valid for making system calls. msyscall() can only be called once by a process, so changes or additions to the valid regions are not possible once ld.so has done so. Any process that makes a system call from a non-approved region will be killed.
Another OpenBSD security measure plays a role here as well. On boot, libc is re-linked in a random order, so that the locations of the system-call wrappers within libc are different for every running system. This change forces attackers to use those wrappers, which will be difficult to find reliably for a non-local attack.
Things move quickly in the OpenBSD world (as we saw with the kernel address-space randomized link (KARL)
feature in 2017), so it was no surprise that the code for this new feature
was committed to
the OpenBSD tree just two days after it was first posted. It is an ABI
break for the operating system, but that is no big deal for OpenBSD.
De Raadt said: "we here at
OpenBSD are the kings of ABI-instability
". He suggested that relying on the OpenBSD ABI is
fraught with peril:
I have altered the ABI. Pray I do not alter it further.
This is an incremental security improvement; it is a hardening measure that makes it more difficult for attackers to reliably exploit a weakness that they find. There was no real dissent seen in the thread, so it would seem to be a relatively non-controversial change. But, once Go is changed and the main program text is not allowed to make system calls, that might change if there are other applications that need the raw system-call capability. For Linux, though, that kind of ABI change would never get very far.
Debian votes on init systems
In November, the topic of init systems and, in particular, support for systems other than systemd reappeared on the Debian mailing lists. After one month of sometimes fraught discussion, this issue has been brought to the project's developers to decide in the form of a general resolution (GR) — the first such since the project voted on the status of debian-private discussions in 2016. The issues under discussion are complex, so the result is one of the most complex ballots seen for some time in Debian, with seven options to choose from.Debian being what it is, the actual voting period for a contentious issue can be rather anticlimactic; the real debate happens during the framing of the ballot to be voted on. This time around was no exception, with extensive discussions on how to best represent various proposed policies, and even a debate over the use of the word "diversity" to describe the issue at hand (it was eventually taken out). Debian practice dictates that ballots work best if each option is written by its proponents, so there are many hands involved in creating the final product.
That process also takes some time. Debian project leader Sam Hartman has seemed somewhat impatient throughout, trying to keep the discussion period to the minimum required by the Debian constitution. By his count, the minimum period ended on November 30; he issued his call for votes three days later. That, however, was too soon for Ian Jackson, who posted a proposal to override Hartman's call for votes so that further work could be done on the framing of the issues on the ballot. While some participants agreed with this idea, the project as a whole appears to be somewhat tired of this discussion and ready to make a decision. As Russ Allbery put it:
It is also not clear that overriding Hartman in this way is something
allowed by the Debian constitution. Overriding the Debian project leader
can be done, but Hartman made it clear
that he is proposing the GR posting the call for votes as
a Debian developer, not as the leader. So
the delay in the vote seems unlikely to happen; project secretary Kurt
Roeckx has duly posted a draft
ballot that sets the beginning of the voting for December 7.
Many developers are undoubtedly glad to get on with this issue and put it behind them. One should realize, though, that the last big decision made regarding systemd (by the technical committee) involved a ballot pushed over Jackson's objections. That whole process left scars that took a long time to heal — if, indeed, they have fully healed.
Be that as it may, the ballot to be voted has seven options:
- Focus on
systemd. This option makes the project's policy read that systemd
is "
the only officially supported init system
", and allows packages to make use of systemd-only features. Some lip service is given to supporting other systems, though; they are still welcome in Debian but would not be allowed to hold up distribution releases. - Systemd but we support exploring alternatives. This is a weaker option, calling systemd the preferred alternative, but encouraging developers to work on alternatives as long as they take on all of the effort involved. Use of systemd-specific features would be allowed.
- Support for multiple init systems is Important. Under this option, every Debian package would be required to work on systems where the init system is something other than systemd. A failure to work on non-systemd installations would be considered an "important" bug, and non-maintainer uploads to fix such bugs would be allowed.
- Support non-systemd systems, without blocking progress. Packages are expected to work on non-systemd installations, but a failure to work is not considered a release-critical bug — unless the necessary support exists but has not been enabled by the package maintainer. Use of systemd-specific features is only allowed if those features are documented and alternative implementations are feasible to implement.
- Support for
multiple init systems is Required. This short option states that:
"
Every package MUST work with pid1 != systemd, unless it was designed by upstream to work exclusively with systemd and no support for running without systemd is available
". - Support portability and multiple implementations. This is the vaguest and most hand-wavy of the proposals, stating that hardware and software portability are important, but giving nothing in the way of specific guidance about what that would mean for project policy. Making this proposal more concrete is one of the things Jackson wanted to do before the ballot went to a vote.
- Further discussion. At the moment, it would appear that the project has little appetite for any further talk on this issue, but one never knows.
That may seem like a lot of choices, but the final ballot will contain one
more. Jackson has been pushing an option called "Support
portability, without blocking progress". It is a lengthy and detailed
proposal that is essentially a
combination of options 4 and 6 above; it's an attempt to fill in some
of the details that option 6 currently lacks. It was not clear
whether this late-arriving option could be added to the ballot without
resetting the discussion clock, but there
was no real opposition to doing so. Hartman remains
firm that he wants the vote to proceed, but has declared himself
"neutral
" on whether that vote should have one more choice to
consider. So Roeckx has indicated
that he will be adding this choice to the ballot.
The available options would thus appear to fall in numerous places along a spectrum of opinions on how central systemd should be to the Debian distribution. In many voting systems, a ballot like this would be certain to split the votes of like-minded people across many options, leading to an outcome that is not supported by the majority. Debian's Condorcet voting system, though, requires developers to rank the options according to their preferences; a complex algorithm then tries to determine which option has the most support overall.
The hope is that this resolution will succeed in determining which policy is the most acceptable to the largest subset of Debian developers. If it works, the project should not have to argue about init systems for a long time. The voting period goes for three weeks so, if the schedule remains unchanged, we'll have to wait until December 27 to get the answer.
The end of the 5.5 merge window
By the end of the merge window, 12,632 non-merge changesets had been pulled into the mainline repository for the 5.5 release. This is thus a busy development cycle — just like the cycles that preceded it. Just over half of those changesets were pulled after the writing of our first 5.5 merge-window summary. As is often the case later in the merge window, many of those changes were relatively boring fixes. There were still a number of interesting changes, though; read on for a summary of what happened in the second half of this merge window.
Architecture-specific
- The RISC-V architecture has gained support for the seccomp() system call (including filtering with BPF).
- RISC-V systems without a memory-management unit are now supported.
- The xtensa architecture can now boot from execute-in-place kernels.
Core kernel
- The new IORING_OP_CONNECT command for io_uring allows connect() calls to be performed asynchronously.
- After years of deprecation, the sysctl() system call has been removed.
- Synthetic trace events can be created with the new injection mechanism. The use case appears to be testing of software that reacts to trace events.
Filesystems and block I/O
- The XFS "iomap" code has been moved into into the virtual filesystem layer, making this infrastructure available to other filesystems. The ext4 filesystem has been modified to use this code. The end result is simpler, more consistent, and hopefully less buggy direct I/O in a number of filesystems.
- The CIFS filesystem now supports the flock() system call. CIFS has also gained multichannel support, which should improve performance.
- The hugetlbfs filesystem now supports creating files with the O_TMPFILE option.
- The NFS client has gained support for cross-device offloaded copy operations — copying a file directly from one remote server to another.
Hardware support
- Clock: Qualcomm QCS404 Q6SSTOP clock controllers, Qualcomm SC7180 global clock controllers, Qualcomm MSM8998 graphics clock controllers, Ingenic X1000 clock generators, and Bitmain BM1880 clock controllers.
- DMA: NXP DPAA2 QDMA controllers, Milbeaut AHB and AXI DMA controllers, and Sifive PDMA controllers.
- Miscellaneous: Crane EL15203000 LED controllers, RDA Micro GPIO controllers, Broadcom XGS iProc GPIO controllers, Mellanox BlueField firmware boot control units, Amlogic G12 thermal sensors, Samsung EXYNOS5422 dynamic memory controllers, Qualcomm on-chip memory controllers, Marvell MMP3 USB PHYs, and NVIDIA Tegra30 external memory controllers.
Security-related
- The seccomp() user-space notification mechanism has gained a new return code, SECCOMP_USER_NOTIF_FLAG_CONTINUE, which instructs the kernel to allow the system call in question to continue executing.
Internal kernel changes
- It is now possible for kernel subsystems to set up their own tracing instances without worrying about interfering with any tracing done from user space.
- The DMA-BUF heaps subsystem — meant to serve as a replacement for the Android-specific ION allocator — was merged but then reverted. It seems that it lacks a demonstrated open-source user-space user as required by the DRM subsystem. This feature will probably have to wait for 5.6.
- Most of the ioctl() compatibility code has been pushed out into the drivers that need it; this code should disappear entirely in the relatively near future.
- Code-testing coverage with kcov can now monitor execution by background kernel threads. Some special annotation is required; see this commit for details.
One item that (as expected) did not make it into 5.5 is the WireGuard virtual private network system. That long-awaited feature is coming soon, though: it has already been merged into the networking tree for the 5.6 release. Meanwhile, the 5.5 kernel is now in the stabilization period where, with luck, all of the new bugs will be fixed. The final 5.5 release can be expected around the beginning of February 2020.
Developers split over split-lock detection
A "split lock" is a low-level memory-bus lock taken by the processor for a memory range that crosses a cache line. Most processors disallow split locks, but x86 implements them. Split locking may be convenient for developers, but it comes at a cost: a single split-locked instruction can occupy the memory bus for around 1,000 clock cycles. It is thus understandable that interest in eliminating split-lock operations is high. What is perhaps less understandable is that a patch set intended to detect split locks has been pending since (at least) May 2018, and it still is not poised to enter the mainline.Split locks are, in essence, a type of misaligned memory access — something that the x86 architecture has always been relatively willing to forgive. But there is a difference: while a normal unaligned operation will slow down the process performing that operation, split locks will slow down the entire system. The 1,000-cycle delay may be particularly painful on realtime systems, but split locks can be used as an effective denial-of-service attack on any system. There is little disagreement among developers that their use should not be allowed on most systems.
Recent Intel processors can be configured to execute a trap when a split lock is attempted, allowing the operating system to decide whether to allow the operation to continue. Fenghua Yu first posted a patch set enabling this trap in May 2018; LWN covered this work one year later. While many things have changed in this patch set, the basic idea has remained constant: when this feature is enabled, user-space processes that attempt a split lock will receive a SIGBUS signal. What happens when split locks are created in other contexts has varied over time; in the version-10 patch set posted in late November, split locks in the kernel or system firmware will cause a kernel panic.
This severe response should result in problems being fixed quickly; Yu cites a couple of kernel fixes for split locks detected by this work in the patch posting. That will only happen, though, if this feature is enabled on systems where the code tries to create a split lock, but that may not happen as often as developers would like:
Disabling the feature by default guarantees that it will not be widely
used; that has led to some complaints. Ingo Molnar asserted that
"for this feature to be useful it must be default-enabled
",
and Peter Zijlstra said:
Zijlstra is particularly concerned about split locks created by code at the
firmware level, something he sees as being likely: "from long and
painful experience we all know that if a BIOS can be wrong, it will
be
". Enabling split-lock detection by default will hopefully cause
firmware-created split locks to be fixed. Otherwise, he fears, those split
locks will lurk in the firmware indefinitely, emerging occasionally to burn
users who enable split-lock detection in the hope of finding problems in
their own code.
Forcing problems to be fixed by enabling split-lock detection by default has some obvious appeal. But the reasoning behind leaving it disabled also makes some sense: killing processes that create split locks is an ABI change that may create problems for users. As Tony Luck put it:
#include <linus/all-caps-rant-about-backwards-compatability.h>
and the patches being reverted.
Zijlstra is not worried, though; he feels that the kernel issues have mostly been fixed and that problems in user-space code will be rare because other architectures have never allowed split locks. For those who are worried, though, he posted a follow-up patch allowing split-lock detection to be controlled at boot time and adding a "warn-only" mode that doesn't actually kill any processes.
In that patch set, he noted that "it requires we get the kernel and
firmware clean
", and said that fixing up the kernel should be
"pretty simple
". But it turns out to be perhaps not quite
that simple after all. In particular, there is the problem pointed
out by David Laight: the kernel's atomic
bitmask functions can easily create split-lock operations. The core
problem here is that the type of a bitmask is defined as unsigned
long, but it's natural for developers to use a simple int
instead. In such cases, the creation of misaligned accesses is easy, and
those accesses may occasionally span a cache-line boundary and lead to
split locks.
Opinions on how to solve this problem globally vary. Yu posted a
complex series of cases meant to make atomic bit operations work for
almost all usage patterns, but that strikes some as too much complexity;
Zijlstra said
simply that this solution is "never going to happen
". An
alternative, suggested
by Andy Lutomirski, is to change the atomic bit operations to work on
32-bit values. That would, obviously, limit the number of bits that could
be manipulated to 32. Zijlstra noted
that some architectures (alpha, ia64) already implement atomic bit
operations that way, so it may well be that 32 bits is all that the
kernel needs.
There is one other "wrinkle", according to Sean Christopherson, getting in the way of merging split-lock detection: the processor bit that controls split-lock detection affects the entire core on which it's set, not just the specific CPU that sets it. So toggling split-lock detection will affect all hyperthreaded siblings together. This particular wrinkle was only discovered in September, after the split-lock patch set had already been through nine revisions, leaving the development community less than fully impressed. But now that this behavior is known, it must be dealt with in the kernel.
If split-lock detection is enabled globally, there is no problem. But if there is a desire to enable and disable it at specific times, things may well not work as expected. Things get particularly difficult when virtualization is involved; guest systems may differ with each other — or with the host — about whether split-lock detection should be enabled. Potential solutions to this problem include disallowing control of split-lock detection in guests (the current solution) or only allowing it when hyperthreading is disabled. Nobody has yet suggested using core scheduling to ensure that all processes running on a given core are in agreement about split-lock detection, but it's only a matter of time.
All of this adds up to a useful feature that is apparently not yet ready for prime time. The question at this point is whether it should be merged soon and improved in place, or whether it needs more work out of tree. Perhaps both things could happen, since 5.6 is the earliest time it could be pulled into the mainline at this point. Split-lock detection exists to eliminate unwanted delays, but it still seems subject to some delays itself.
Working toward securing PyPI downloads
An effort to protect package downloads from the Python Package Index (PyPI) has resulted in a Python Enhancement Proposal (PEP) and, perhaps belatedly, some discussion in the wider community. The basic idea is to use The Update Framework (TUF) to protect PyPI users from some malicious actors who are aiming to interfere with the installation and update of Python modules. But the name of the PEP and its wording, coupled with some recent typosquatting problems on PyPI, caused some confusion along the way. There are some competing interests and different cultures coming together over this PEP; the process has not run as smoothly as anyone might want, though that seems to be resolving itself at this point.
PEP 458
("Surviving a Compromise of PyPI
") has been around since 2013
or so; LWN
looked at the PEP and the related PEP 480
("Surviving a Compromise of PyPI: The Maximum Security Model
")
in 2015. PEP 458 proposes a mechanism to provide
cryptographic-signature protection for PyPI packages using TUF. No changes
would be required for package authors or those downloading the packages,
but client-side programs like pip would be able to ensure that
they are getting the latest version of the code—and that the code itself is
what is stored on PyPI.
A message was posted to the Python discussion forum in mid-November about the PEP. One of the PEP's authors, Trishank Kuppusamy, wondered about the status of it, but the message mostly seemed to be aimed at Donald Stufft, who is the "BDFL-Delegate" for the PEP. That means that the steering council has deferred the decision on the PEP to Stufft; a week or so later, he had seemingly replied to the request but via some other channel.
Neither message was a general call for discussion of the PEP, however, which might be expected, so it would seem that many may have skipped over the message entirely. In early December, Guido van Rossum noted in the thread that he had merged a pull request (PR) for a bunch of updates to the PEP, but cautioned that discussion of the PEP should be done on the forum rather than in a PR. That is the normal course of action on a PEP; clearly Van Rossum was a bit surprised that it wasn't being followed in this case.
In a reply to Van Rossum, Sumana Harihareswara said
that the PEP was ready for community review; "[...] the plan is for
contractors to start work on PyPI this month (on implementing the
foundations for cryptographic signing (and malware detection, which is not
relevant to this PEP))
". Van Rossum and Paul Moore were both
rather puzzled about the process being followed, some of the terminology,
and, to a certain extent,
the intent of the PEP itself. Van Rossum said:
Moore pointed out that the PEP is confusing even for some who are familiar with Python packaging:
Culture clash
A reply from Kuppusamy did not really address the concerns, however. It summarized the issues that had been raised, but indicated an intent to address them via PR, rather than through a discussion as is typical in the PEP process. That was not what Moore and Van Rossum were after, however. Moore said:
Part of the difficulty is that the PEP is being written by TUF researchers,
so it reads to some extent like the use of TUF is a foregone conclusion.
As Moore put it: "The abstract does actually give a reasonable
overview of what's being proposed, as long as I take 'implement TUF' as a
goal in itself [...]
" Van Rossum suggested
that there may be a culture clash present. Steve Dower agreed
with the culture-clash assessment, but noted it was not really the
packaging community behind it, as Van Rossum proposed; "The clash is
with the people behind TUF, not the packaging community in general.
"
Harihareswara said:
"There have been some miscommunications here but no one has meant to
bypass the community.
" She went on to outline the history of the
PEP, along with a number of links to discussions and the like in various
places, including in person with members of the packaging community and in
multiple threads on the distutils-sig
mailing list. She also noted that there was a 2018 gift from Facebook of
$100,000 "to be used for PyPI security work, specifically on
cryptographic signing and malware detection
".
That money is part of what revived PEP 458, which had been languishing for a few years. The Python Software Foundation (PSF), which administers the gift money, put out a Request for Information (RFI) in late August with a rather ambitious schedule given that getting the PEP accepted was part of it. The intent was to start work on an implementation on December 2. So in part the culture clash may also have been between the needs of the procurement process and that of the wider Python community.
PEP 458 had been marked as "deferred" in March, but was revived in some in-person discussions at PyCon shortly thereafter. Because the gift was partly targeting cryptographic signing, PEP 458 seemed like it might fit the bill.
But, as Christian Heimes said, the PEP is not really living up to its title:
Furthermore, he doesn't see the compromise of PyPI as the most important problem it is facing. There have been no reports of PyPI corruption along the way, but there are an ever-increasing number of other problems that cause bad code to end up on users' systems:
Moving forward
In a lengthy post covering replies to multiple messages along the way, Stufft pointed out that the proposal is coming from volunteers, so what they want to focus on is not really in the purview of the Python community per se:
We do have some funds that we plan on using to implement this PEP if it is accepted, and perhaps @sumanah or @EWDurbin [Ernest W. Durbin III] could better answer this question, those funds were given to us with the understanding we'd use them to implement, among other things, "cryptographic signing" (though it was left open ended what exactly that entailed) so even in that case we have limitations on how we're able to direct work to be done since part of it needs to be implementing a cryptographic signature scheme for PyPI (part of it is also implementing malicious package detection, but that doesn't have a PEP because it's just a new feature of the PyPI code base and doesn't have ramifications for projects beyond PyPI really).
He said that there were three paths forward that he could see. PEP 458 could be discussed and refined until it is ready to be accepted, someone could write a competing PEP that would offer a choice, or the whole thing could just be dropped, which leaves the status of the gift somewhat up in the air. No one argued in favor of dropping the idea, nor has anyone stepped up with an alternative. The conversation generally turned toward making the PEP better with an eye toward getting it accepted. To that end, Moore offered some concrete suggestions on improvements, including revising the title. He also noted that the situation is something new for the project:
Another of the PEP co-authors, Marina Moore, posted a list of items that needed to be addressed for the PEP, which received a number of "like" votes. As noted in the thread, though, "like" has no clear semantics; Moore said that his "like" meant that he was in favor of addressing those items and was awaiting further proposed text, which did not seem to be forthcoming:
Part of the problem is that the PEP authors are not up to speed on how the Python community works, as Harihareswara pointed out. She suggested that they be provided with concrete guidance on how to proceed with a discussion of this sort. Normally, a core developer will shepherd a PEP, which would have helped here. Stufft, as BDFL-Delegate, could also have guided the process, but he has been distracted with some real-life issues of late. Given that, Harihareswara thought it would make sense to look for a sponsor; Moore would seem an obvious choice, but he is also busy with other things right now.
In the meantime, the process does seem to be getting back on track. There have been some answers and proposed wording posts from Moore, along with comments on those, revisions, and so on. All of that is ongoing as of this writing. One gets the sense that it is moving in a good direction, though perhaps not at the speed some would hope for. While it gets sorted out, Durbin said that the PSF will use a subset of the funding to add automated malware detection capabilities to the upload portion of PyPI.
No one really seemed opposed to using TUF to try to ensure the integrity of packages that users get from PyPI (and local mirrors of PyPI), but the process to get there has been a bit haphazard, perhaps. As enumerated in the PEP, TUF does handle an impressive array of attacks against a repository like PyPI. Thwarting those attacks certainly seems worth doing even if there are other more common and prominent threats to the integrity of the PyPI ecosystem that also need work.
New features for the Kubernetes scheduler
The Kubernetes scheduler is being overhauled with a series of improvements that will introduce a new framework and enhanced capabilities that could help cluster administrators to optimize performance and utilization. Abdullah Gharaibeh, co-chair of the Kubernetes scheduling special interest group (SIG Scheduling), detailed what has been happening with the scheduler in recent releases and what's on the roadmap in a session at KubeCon + CloudNativeCon North America 2019.
The scheduler component of Kubernetes is a controller that assigns pods to nodes. A Kubernetes pod is a group of containers that is scheduled together, while a node is a worker machine (real or virtual) within the cluster that has all the services needed to run a pod.
Gharaibeh explained that as pods are created they are added into a scheduling queue that is sorted by priority and then processed through two phases. In the first, the pod is run through a filter that makes a determination of what nodes are feasible for that pod to run on. For example, the filter checks to make sure the node has enough resources (CPU and memory) to run the pod. The pods then go through the scoring phase, where the pods are ranked according to additional criteria such as node affinity. If for some reason a pod ends up with no feasible nodes it will be placed back into the queue so it can be scheduled when that becomes possible.
Scheduling framework
There are several enhancements that are being worked on for the scheduler, with the most important from Gharaibeh's perspective being the new scheduling framework. The goal of the framework is to turn the scheduler into an engine that executes callbacks from different extension points. "We wanted to define a number of extension points that we've identified from our past experiences with the scheduler, and with those extension points you can register callback functions," Gharaibeh explained. "A collection of callback functions that define a specific behavior, we're going to call it a plugin."
![Abdullah Gharaibeh [Abdullah Gharaibeh]](https://static.lwn.net/images/2019/kcna-gharaibeh-sm.jpg)
The Kubernetes Enhancement Proposal (KEP) for the scheduling framework outlines all of the different extension points for plugins. One of the key extension points is called "queue sort", which he said enables an administrator to define a single plugin that defines how the default scheduling queue will be sorted. That's different than the current default scheduling behavior where pods are sorted by priority.
Other extension points include "pre-filter",
which checks to make sure that certain pod conditions are met, and "filter",
which is used to rule out nodes that can't run a given pod. The "reserve"
extension point provides an opportunity to help better
schedule pods that are running applications or services that need to
maintain state, according to the KEP. "Plugins which maintain runtime
state (aka 'stateful
plugins') should use this extension point to be notified by the scheduler
when resources on a node are being reserved for a given Pod.
"
The framework also includes what are known as permit plugins, which allow other scheduling plugins to delay binding a pod to a node until a specific condition has been satisfied. Gharaibeh noted that the permit plugins can be useful to enable gang scheduling of a group of pods that need to all be deployed at the same time. "One interesting use case we had was trying to make the scheduler friendlier for batch workloads," he said.
Overall, Gharaibeh emphasized that the scheduling framework will make it easier for Kubernetes developers to extend and add new features into the scheduler. The scheduling framework is currently targeted for the Kubernetes 1.19 milestone, which will be out in mid-2020.
Observability improvements
Another area that developers have been working on is observability for scheduler performance and traffic. For example, a new metric now tracks the amount of time it takes to schedule a pod, from the time it was picked up from the queue until it is bound to a node. The ability to track the number of incoming pods per second has also been added so it's possible to see how fast the scheduler is able to drain the scheduling queue. Additionally, the scheduler now provides visibility on the number of scheduling attempts made per second. "We just want more insights into the scheduler," he said.
Gharaibeh also noted that, in Kubernetes 1.17, improvements have been made to better account for pod overhead in order to more accurately determine resource usage. He explained that when pods are started on a node, there is typically some overhead. The kubelet that runs on each node has some state associated with running pods that consumes cluster resources. Sandbox pods, which provide isolation for workloads using technologies such as gVisor or Kata Containers also needed better resource tracking. Both of those approaches have their own agents running beside a pod that end up consuming resources that were not being accounted for.
The way Kubernetes has dealt with pod overhead in the past is to reserve a predefined amount of resources on the node for system components, though it's an approach that Gharaibeh said doesn't fully account for all of the actual pod overhead. The new "pod overhead" feature that is coming to Kubernetes can be used to define the amount of resources that should be allocated per pod. In Kubernetes 1.17, the scheduler will be aware of pod overhead and when pods are scheduled the required amount of overhead will be added to the resources that are requested by the pod, he said.
Looking forward to the Kubernetes 1.18 release, there is development in progress to enable in-place updates of pod resource requirements. Gharaibeh explained that changing the resource allocation for a pod currently requires that the pod be be recreated since PodSpec, which defines the container resources that are required for the pod, is immutable. With the changes set to come in Kubernetes 1.18, PodSpec will become mutable with regard to resources.
It's hard to understate the critical role that the scheduler plays in Kubernetes and the impact that the changes coming to its capabilities will have for users. The new framework approach offers the promise of improved extensibility and customizability that could lead to better overall resource utilization. Adding the improved visibility for the scheduler means that there is a lot for Kubernetes administrators to look forward to in the coming releases.
Page editor: Jonathan Corbet
Next page:
Brief items>>