Leading items

Welcome to the LWN.net Weekly Edition for December 12, 2019

This edition contains the following feature content:

OpenBSD system-call-origin verification: a new anti-ROP approach from OpenBSD.
Debian votes on init systems: the voting to determine the Debian project's position on init systems (and systemd in particular) has begun.
The end of the 5.5 merge window: another 6,000 patches worth of new stuff goes into the kernel.
Developers split over split-lock detection: should this feature be on by default, even if it might break some applications?
Working toward securing PyPI downloads: the process of securing the Python Package Index is slow and halting.
New features for the Kubernetes scheduler: what the core Kubernetes developers are up to.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

OpenBSD system-call-origin verification

By Jake Edge
December 11, 2019

A new mechanism to help thwart return-oriented programming (ROP) and similar attacks has recently been added to the OpenBSD kernel. It will block system calls that are not made via the C library (libc) system-call wrappers. Instead of being able to string together some "gadgets" that make a system call directly, an attacker would need to be able to call the wrapper, which is normally at a randomized location.

Theo de Raadt introduced the feature in a late November posting to the OpenBSD tech mailing list. The idea is to restrict where system calls can come from to the parts of a process's address space where they are expected to come from. OpenBSD already disallows system calls (and any other code) executing from writable pages with its W^X implementation. Since OpenBSD is largely in control of its user-space applications, it can enforce restrictions that would be difficult to enact in a more loosely coupled system like Linux, however. Even so, the restrictions that have been implemented at this point are less strict than what De Raadt would like to see.

The eventual goal would be to disallow system calls from anywhere but the region mapped for libc, but the Go language port for OpenBSD currently makes system calls directly. That means the main program text segment needs to be in the list of memory regions where system calls are allowed to come from. Static binaries will also need to have their text segment included, since libc will inhabit part of that address space. In his message, he described the full list of allowed memory regions:

For static binaries, the valid regions are the base program's text segment and the signal trampoline page.

For dynamic binaries, valid regions are ld.so's text segment, the signal trampoline, and libc.so's text segment... AND the main program's text.

Switching Go to use the libc wrappers (as is already done on Solaris and macOS) is something that De Raadt would like to see. It may allow dropping the main program text segment from the list of valid regions in the future:

If go is adapted to call library-based system call stubs on OpenBSD as well, this problem will go away. There may be other environments creating raw system calls. I guess we'll need to find them as time goes by, and hope in time we can repair those also.

The kernel can mark most of the regions valid as it starts up a process, but it will not know what address space holds libc.so for a dynamic binary. Marking that region as valid is left to the ld.so dynamic linker. It has been changed to execute a new system call, msyscall(), to mark the region occupied by libc.so as one that is valid for making system calls. msyscall() can only be called once by a process, so changes or additions to the valid regions are not possible once ld.so has done so. Any process that makes a system call from a non-approved region will be killed.

Another OpenBSD security measure plays a role here as well. On boot, libc is re-linked in a random order, so that the locations of the system-call wrappers within libc are different for every running system. This change forces attackers to use those wrappers, which will be difficult to find reliably for a non-local attack.

Things move quickly in the OpenBSD world (as we saw with the kernel address-space randomized link (KARL) feature in 2017), so it was no surprise that the code for this new feature was committed to the OpenBSD tree just two days after it was first posted. It is an ABI break for the operating system, but that is no big deal for OpenBSD. De Raadt said: "we here at OpenBSD are the kings of ABI-instability". He suggested that relying on the OpenBSD ABI is fraught with peril:

Program to the API rather than the ABI. When we see benefits, we change the ABI more often than the API.

I have altered the ABI. Pray I do not alter it further.

This is an incremental security improvement; it is a hardening measure that makes it more difficult for attackers to reliably exploit a weakness that they find. There was no real dissent seen in the thread, so it would seem to be a relatively non-controversial change. But, once Go is changed and the main program text is not allowed to make system calls, that might change if there are other applications that need the raw system-call capability. For Linux, though, that kind of ABI change would never get very far.

Comments (53 posted)

Debian votes on init systems

By Jonathan Corbet
December 5, 2019

In November, the topic of init systems and, in particular, support for systems other than systemd reappeared on the Debian mailing lists. After one month of sometimes fraught discussion, this issue has been brought to the project's developers to decide in the form of a general resolution (GR) — the first such since the project voted on the status of debian-private discussions in 2016. The issues under discussion are complex, so the result is one of the most complex ballots seen for some time in Debian, with seven options to choose from.

Debian being what it is, the actual voting period for a contentious issue can be rather anticlimactic; the real debate happens during the framing of the ballot to be voted on. This time around was no exception, with extensive discussions on how to best represent various proposed policies, and even a debate over the use of the word "diversity" to describe the issue at hand (it was eventually taken out). Debian practice dictates that ballots work best if each option is written by its proponents, so there are many hands involved in creating the final product.

That process also takes some time. Debian project leader Sam Hartman has seemed somewhat impatient throughout, trying to keep the discussion period to the minimum required by the Debian constitution. By his count, the minimum period ended on November 30; he issued his call for votes three days later. That, however, was too soon for Ian Jackson, who posted a proposal to override Hartman's call for votes so that further work could be done on the framing of the issues on the ballot. While some participants agreed with this idea, the project as a whole appears to be somewhat tired of this discussion and ready to make a decision. As Russ Allbery put it:

I'm also rather dubious that anything is going to fundamentally change in the next few days. Ian already has an excellent, well-thought-out proposal on the ballot that reflects his position. The available solution space seems well-covered by the options available. It's getting harder to keep the discussion productive.

It is also not clear that overriding Hartman in this way is something allowed by the Debian constitution. Overriding the Debian project leader can be done, but Hartman made it clear that he is ~~proposing the GR~~ posting the call for votes as a Debian developer, not as the leader. So the delay in the vote seems unlikely to happen; project secretary Kurt Roeckx has duly posted a draft ballot that sets the beginning of the voting for December 7.

Many developers are undoubtedly glad to get on with this issue and put it behind them. One should realize, though, that the last big decision made regarding systemd (by the technical committee) involved a ballot pushed over Jackson's objections. That whole process left scars that took a long time to heal — if, indeed, they have fully healed.

Be that as it may, the ballot to be voted has seven options:

Focus on systemd. This option makes the project's policy read that systemd is "the only officially supported init system", and allows packages to make use of systemd-only features. Some lip service is given to supporting other systems, though; they are still welcome in Debian but would not be allowed to hold up distribution releases.
Systemd but we support exploring alternatives. This is a weaker option, calling systemd the preferred alternative, but encouraging developers to work on alternatives as long as they take on all of the effort involved. Use of systemd-specific features would be allowed.
Support for multiple init systems is Important. Under this option, every Debian package would be required to work on systems where the init system is something other than systemd. A failure to work on non-systemd installations would be considered an "important" bug, and non-maintainer uploads to fix such bugs would be allowed.
Support non-systemd systems, without blocking progress. Packages are expected to work on non-systemd installations, but a failure to work is not considered a release-critical bug — unless the necessary support exists but has not been enabled by the package maintainer. Use of systemd-specific features is only allowed if those features are documented and alternative implementations are feasible to implement.
Support for multiple init systems is Required. This short option states that: "Every package MUST work with pid1 != systemd, unless it was designed by upstream to work exclusively with systemd and no support for running without systemd is available".
Support portability and multiple implementations. This is the vaguest and most hand-wavy of the proposals, stating that hardware and software portability are important, but giving nothing in the way of specific guidance about what that would mean for project policy. Making this proposal more concrete is one of the things Jackson wanted to do before the ballot went to a vote.
Further discussion. At the moment, it would appear that the project has little appetite for any further talk on this issue, but one never knows.

That may seem like a lot of choices, but the final ballot will contain one more. Jackson has been pushing an option called "Support portability, without blocking progress". It is a lengthy and detailed proposal that is essentially a combination of options 4 and 6 above; it's an attempt to fill in some of the details that option 6 currently lacks. It was not clear whether this late-arriving option could be added to the ballot without resetting the discussion clock, but there was no real opposition to doing so. Hartman remains firm that he wants the vote to proceed, but has declared himself "neutral" on whether that vote should have one more choice to consider. So Roeckx has indicated that he will be adding this choice to the ballot.

The available options would thus appear to fall in numerous places along a spectrum of opinions on how central systemd should be to the Debian distribution. In many voting systems, a ballot like this would be certain to split the votes of like-minded people across many options, leading to an outcome that is not supported by the majority. Debian's Condorcet voting system, though, requires developers to rank the options according to their preferences; a complex algorithm then tries to determine which option has the most support overall.

The hope is that this resolution will succeed in determining which policy is the most acceptable to the largest subset of Debian developers. If it works, the project should not have to argue about init systems for a long time. The voting period goes for three weeks so, if the schedule remains unchanged, we'll have to wait until December 27 to get the answer.

Comments (122 posted)

The end of the 5.5 merge window

By Jonathan Corbet
December 9, 2019

By the end of the merge window, 12,632 non-merge changesets had been pulled into the mainline repository for the 5.5 release. This is thus a busy development cycle — just like the cycles that preceded it. Just over half of those changesets were pulled after the writing of our first 5.5 merge-window summary. As is often the case later in the merge window, many of those changes were relatively boring fixes. There were still a number of interesting changes, though; read on for a summary of what happened in the second half of this merge window.

Architecture-specific

The RISC-V architecture has gained support for the seccomp() system call (including filtering with BPF).
RISC-V systems without a memory-management unit are now supported.
The xtensa architecture can now boot from execute-in-place kernels.

Core kernel

The new IORING_OP_CONNECT command for io_uring allows connect() calls to be performed asynchronously.
After years of deprecation, the sysctl() system call has been removed.
Synthetic trace events can be created with the new injection mechanism. The use case appears to be testing of software that reacts to trace events.

Filesystems and block I/O

The XFS "iomap" code has been moved into into the virtual filesystem layer, making this infrastructure available to other filesystems. The ext4 filesystem has been modified to use this code. The end result is simpler, more consistent, and hopefully less buggy direct I/O in a number of filesystems.
The CIFS filesystem now supports the flock() system call. CIFS has also gained multichannel support, which should improve performance.
The hugetlbfs filesystem now supports creating files with the O_TMPFILE option.
The NFS client has gained support for cross-device offloaded copy operations — copying a file directly from one remote server to another.

Hardware support

Clock: Qualcomm QCS404 Q6SSTOP clock controllers, Qualcomm SC7180 global clock controllers, Qualcomm MSM8998 graphics clock controllers, Ingenic X1000 clock generators, and Bitmain BM1880 clock controllers.
DMA: NXP DPAA2 QDMA controllers, Milbeaut AHB and AXI DMA controllers, and Sifive PDMA controllers.
Miscellaneous: Crane EL15203000 LED controllers, RDA Micro GPIO controllers, Broadcom XGS iProc GPIO controllers, Mellanox BlueField firmware boot control units, Amlogic G12 thermal sensors, Samsung EXYNOS5422 dynamic memory controllers, Qualcomm on-chip memory controllers, Marvell MMP3 USB PHYs, and NVIDIA Tegra30 external memory controllers.

Security-related

The seccomp() user-space notification mechanism has gained a new return code, SECCOMP_USER_NOTIF_FLAG_CONTINUE, which instructs the kernel to allow the system call in question to continue executing.

Internal kernel changes

It is now possible for kernel subsystems to set up their own tracing instances without worrying about interfering with any tracing done from user space.
The DMA-BUF heaps subsystem — meant to serve as a replacement for the Android-specific ION allocator — was merged but then reverted. It seems that it lacks a demonstrated open-source user-space user as required by the DRM subsystem. This feature will probably have to wait for 5.6.
Most of the ioctl() compatibility code has been pushed out into the drivers that need it; this code should disappear entirely in the relatively near future.
Code-testing coverage with kcov can now monitor execution by background kernel threads. Some special annotation is required; see this commit for details.

One item that (as expected) did not make it into 5.5 is the WireGuard virtual private network system. That long-awaited feature is coming soon, though: it has already been merged into the networking tree for the 5.6 release. Meanwhile, the 5.5 kernel is now in the stabilization period where, with luck, all of the new bugs will be fixed. The final 5.5 release can be expected around the beginning of February 2020.

Comments (1 posted)

Developers split over split-lock detection

By Jonathan Corbet
December 6, 2019

A "split lock" is a low-level memory-bus lock taken by the processor for a memory range that crosses a cache line. Most processors disallow split locks, but x86 implements them. Split locking may be convenient for developers, but it comes at a cost: a single split-locked instruction can occupy the memory bus for around 1,000 clock cycles. It is thus understandable that interest in eliminating split-lock operations is high. What is perhaps less understandable is that a patch set intended to detect split locks has been pending since (at least) May 2018, and it still is not poised to enter the mainline.

Split locks are, in essence, a type of misaligned memory access — something that the x86 architecture has always been relatively willing to forgive. But there is a difference: while a normal unaligned operation will slow down the process performing that operation, split locks will slow down the entire system. The 1,000-cycle delay may be particularly painful on realtime systems, but split locks can be used as an effective denial-of-service attack on any system. There is little disagreement among developers that their use should not be allowed on most systems.

Recent Intel processors can be configured to execute a trap when a split lock is attempted, allowing the operating system to decide whether to allow the operation to continue. Fenghua Yu first posted a patch set enabling this trap in May 2018; LWN covered this work one year later. While many things have changed in this patch set, the basic idea has remained constant: when this feature is enabled, user-space processes that attempt a split lock will receive a SIGBUS signal. What happens when split locks are created in other contexts has varied over time; in the version-10 patch set posted in late November, split locks in the kernel or system firmware will cause a kernel panic.

This severe response should result in problems being fixed quickly; Yu cites a couple of kernel fixes for split locks detected by this work in the patch posting. That will only happen, though, if this feature is enabled on systems where the code tries to create a split lock, but that may not happen as often as developers would like:

The split lock detection is disabled by default because potential split lock issues can cause kernel panic or kill user processes. It is enabled only for real time or debug purpose through a kernel parameter "split_lock_detect".

Disabling the feature by default guarantees that it will not be widely used; that has led to some complaints. Ingo Molnar asserted that "for this feature to be useful it must be default-enabled", and Peter Zijlstra said:

This feature MUST be default enabled, otherwise everything will be/remain broken and we'll end up in the situation where you can't use it even if you wanted to.

Zijlstra is particularly concerned about split locks created by code at the firmware level, something he sees as being likely: "from long and painful experience we all know that if a BIOS can be wrong, it will be". Enabling split-lock detection by default will hopefully cause firmware-created split locks to be fixed. Otherwise, he fears, those split locks will lurk in the firmware indefinitely, emerging occasionally to burn users who enable split-lock detection in the hope of finding problems in their own code.

Forcing problems to be fixed by enabling split-lock detection by default has some obvious appeal. But the reasoning behind leaving it disabled also makes some sense: killing processes that create split locks is an ABI change that may create problems for users. As Tony Luck put it:

Enabling by default at this point would result in a flurry of complaints about applications being killed and kernels panicking. That would be followed by:

#include <linus/all-caps-rant-about-backwards-compatability.h>

and the patches being reverted.

Zijlstra is not worried, though; he feels that the kernel issues have mostly been fixed and that problems in user-space code will be rare because other architectures have never allowed split locks. For those who are worried, though, he posted a follow-up patch allowing split-lock detection to be controlled at boot time and adding a "warn-only" mode that doesn't actually kill any processes.

In that patch set, he noted that "it requires we get the kernel and firmware clean", and said that fixing up the kernel should be "pretty simple". But it turns out to be perhaps not quite that simple after all. In particular, there is the problem pointed out by David Laight: the kernel's atomic bitmask functions can easily create split-lock operations. The core problem here is that the type of a bitmask is defined as unsigned long, but it's natural for developers to use a simple int instead. In such cases, the creation of misaligned accesses is easy, and those accesses may occasionally span a cache-line boundary and lead to split locks.

Opinions on how to solve this problem globally vary. Yu posted a complex series of cases meant to make atomic bit operations work for almost all usage patterns, but that strikes some as too much complexity; Zijlstra said simply that this solution is "never going to happen". An alternative, suggested by Andy Lutomirski, is to change the atomic bit operations to work on 32-bit values. That would, obviously, limit the number of bits that could be manipulated to 32. Zijlstra noted that some architectures (alpha, ia64) already implement atomic bit operations that way, so it may well be that 32 bits is all that the kernel needs.

There is one other "wrinkle", according to Sean Christopherson, getting in the way of merging split-lock detection: the processor bit that controls split-lock detection affects the entire core on which it's set, not just the specific CPU that sets it. So toggling split-lock detection will affect all hyperthreaded siblings together. This particular wrinkle was only discovered in September, after the split-lock patch set had already been through nine revisions, leaving the development community less than fully impressed. But now that this behavior is known, it must be dealt with in the kernel.

If split-lock detection is enabled globally, there is no problem. But if there is a desire to enable and disable it at specific times, things may well not work as expected. Things get particularly difficult when virtualization is involved; guest systems may differ with each other — or with the host — about whether split-lock detection should be enabled. Potential solutions to this problem include disallowing control of split-lock detection in guests (the current solution) or only allowing it when hyperthreading is disabled. Nobody has yet suggested using core scheduling to ensure that all processes running on a given core are in agreement about split-lock detection, but it's only a matter of time.

All of this adds up to a useful feature that is apparently not yet ready for prime time. The question at this point is whether it should be merged soon and improved in place, or whether it needs more work out of tree. Perhaps both things could happen, since 5.6 is the earliest time it could be pulled into the mainline at this point. Split-lock detection exists to eliminate unwanted delays, but it still seems subject to some delays itself.

Comments (19 posted)

Working toward securing PyPI downloads

By Jake Edge
December 11, 2019

An effort to protect package downloads from the Python Package Index (PyPI) has resulted in a Python Enhancement Proposal (PEP) and, perhaps belatedly, some discussion in the wider community. The basic idea is to use The Update Framework (TUF) to protect PyPI users from some malicious actors who are aiming to interfere with the installation and update of Python modules. But the name of the PEP and its wording, coupled with some recent typosquatting problems on PyPI, caused some confusion along the way. There are some competing interests and different cultures coming together over this PEP; the process has not run as smoothly as anyone might want, though that seems to be resolving itself at this point.

PEP 458 ("Surviving a Compromise of PyPI") has been around since 2013 or so; LWN looked at the PEP and the related PEP 480 ("Surviving a Compromise of PyPI: The Maximum Security Model") in 2015. PEP 458 proposes a mechanism to provide cryptographic-signature protection for PyPI packages using TUF. No changes would be required for package authors or those downloading the packages, but client-side programs like pip would be able to ensure that they are getting the latest version of the code—and that the code itself is what is stored on PyPI.

A message was posted to the Python discussion forum in mid-November about the PEP. One of the PEP's authors, Trishank Kuppusamy, wondered about the status of it, but the message mostly seemed to be aimed at Donald Stufft, who is the "BDFL-Delegate" for the PEP. That means that the steering council has deferred the decision on the PEP to Stufft; a week or so later, he had seemingly replied to the request but via some other channel.

Neither message was a general call for discussion of the PEP, however, which might be expected, so it would seem that many may have skipped over the message entirely. In early December, Guido van Rossum noted in the thread that he had merged a pull request (PR) for a bunch of updates to the PEP, but cautioned that discussion of the PEP should be done on the forum rather than in a PR. That is the normal course of action on a PEP; clearly Van Rossum was a bit surprised that it wasn't being followed in this case.

In a reply to Van Rossum, Sumana Harihareswara said that the PEP was ready for community review; "[...] the plan is for contractors to start work on PyPI this month (on implementing the foundations for cryptographic signing (and malware detection, which is not relevant to this PEP))". Van Rossum and Paul Moore were both rather puzzled about the process being followed, some of the terminology, and, to a certain extent, the intent of the PEP itself. Van Rossum said:

A bit of feedback: to a relative outsider of the packaging world, various references to "The RFI" (or "The RFP"?) are very mysterious. It took me way too long to understand what was going on (there's funding from Facebook for PyPI and/or pip work?), and I'm still not sure! Next time can you all be a little clearer about that?

Moore pointed out that the PEP is confusing even for some who are familiar with Python packaging:

Even as an insider of the packaging world, I can make little sense of this PEP. I just skimmed the current version of the PEP and was confronted with a wall of text covering security related issues, that I don't really follow. And yet, from @sumanah's comment, it sounds like we might have people lining up to implement this.

Culture clash

A reply from Kuppusamy did not really address the concerns, however. It summarized the issues that had been raised, but indicated an intent to address them via PR, rather than through a discussion as is typical in the PEP process. That was not what Moore and Van Rossum were after, however. Moore said:

I think you're missing the point - posting here isn't to gather feedback that you can go away and address, it's to have a discussion, where you engage with the community and come to a consensus about the way to address people's concerns. As it stands, this PEP seems to have been developed largely out of sight of the community, and that's what bothers me. Maybe no-one will be interested in having an extensive discussion, given that this is a pretty specialised proposal. But people should still be given the opportunity to engage, and discussions on a github PR in the PEPs repository isn't, in my view, sufficient for that purpose.

Part of the difficulty is that the PEP is being written by TUF researchers, so it reads to some extent like the use of TUF is a foregone conclusion. As Moore put it: "The abstract does actually give a reasonable overview of what's being proposed, as long as I take 'implement TUF' as a goal in itself [...]" Van Rossum suggested that there may be a culture clash present. Steve Dower agreed with the culture-clash assessment, but noted it was not really the packaging community behind it, as Van Rossum proposed; "The clash is with the people behind TUF, not the packaging community in general."

Harihareswara said: "There have been some miscommunications here but no one has meant to bypass the community." She went on to outline the history of the PEP, along with a number of links to discussions and the like in various places, including in person with members of the packaging community and in multiple threads on the distutils-sig mailing list. She also noted that there was a 2018 gift from Facebook of $100,000 "to be used for PyPI security work, specifically on cryptographic signing and malware detection".

That money is part of what revived PEP 458, which had been languishing for a few years. The Python Software Foundation (PSF), which administers the gift money, put out a Request for Information (RFI) in late August with a rather ambitious schedule given that getting the PEP accepted was part of it. The intent was to start work on an implementation on December 2. So in part the culture clash may also have been between the needs of the procurement process and that of the wider Python community.

PEP 458 had been marked as "deferred" in March, but was revived in some in-person discussions at PyCon shortly thereafter. Because the gift was partly targeting cryptographic signing, PEP 458 seemed like it might fit the bill.

But, as Christian Heimes said, the PEP is not really living up to its title:

It took me a while to realize what I don't like about PEP 458. It mixes the issue "How to [survive] a compromise of PyPI" with a technical solution (TUF). It feels like the PEP is tailored for TUF without exploring alternatives or even verifying if the PEP is asking the right questions. TUF might be the only viable solution, but it's impossible to gauge when the text is written as "Any security framework you like as long as it is TUF".

Furthermore, he doesn't see the compromise of PyPI as the most important problem it is facing. There have been no reports of PyPI corruption along the way, but there are an ever-increasing number of other problems that cause bad code to end up on users' systems:

Personally I see malicious content and package trust as a more pressing issue than a compromise of PyPI infrastructure. As a member of the Python security team (PSRT) I'm getting reports about typo squatting or malicious packages every week. (Fun fact: There [were] four email threads about malicious content on PyPI this month and today is just Dec 4.)

Moving forward

In a lengthy post covering replies to multiple messages along the way, Stufft pointed out that the proposal is coming from volunteers, so what they want to focus on is not really in the purview of the Python community per se:

While we could discuss the relevant merits of different problems to focus on solving, we don't really get to dictate what exactly we have people willing to work on. This particular PEP was written by volunteers (If memory serves me correct they were grad students at the time) and wasn't directed work from the PSF or the community. We don't get to tell volunteers what to work on (unless they ask us), so we have to judge contributions as they come in, not what we'd rather they work on. So in terms of whether we ultimately decide to accept this PEP we don't really get to decide the effort spent writing/discussing it would be better spent on something else.

We do have some funds that we plan on using to implement this PEP if it is accepted, and perhaps @sumanah or @EWDurbin [Ernest W. Durbin III] could better answer this question, those funds were given to us with the understanding we'd use them to implement, among other things, "cryptographic signing" (though it was left open ended what exactly that entailed) so even in that case we have limitations on how we're able to direct work to be done since part of it needs to be implementing a cryptographic signature scheme for PyPI (part of it is also implementing malicious package detection, but that doesn't have a PEP because it's just a new feature of the PyPI code base and doesn't have ramifications for projects beyond PyPI really).

He said that there were three paths forward that he could see. PEP 458 could be discussed and refined until it is ready to be accepted, someone could write a competing PEP that would offer a choice, or the whole thing could just be dropped, which leaves the status of the gift somewhat up in the air. No one argued in favor of dropping the idea, nor has anyone stepped up with an alternative. The conversation generally turned toward making the PEP better with an eye toward getting it accepted. To that end, Moore offered some concrete suggestions on improvements, including revising the title. He also noted that the situation is something new for the project:

There's a somewhat new situation here that we're having to navigate. We have got some volunteers, we've got some money to let them do what they propose, but we still need to ensure (as a community) that we want what they are offering, and someone is willing to pay for.

Another of the PEP co-authors, Marina Moore, posted a list of items that needed to be addressed for the PEP, which received a number of "like" votes. As noted in the thread, though, "like" has no clear semantics; Moore said that his "like" meant that he was in favor of addressing those items and was awaiting further proposed text, which did not seem to be forthcoming:

There still seems to be this misunderstanding that the response to feedback should be a revised PEP - and that's absolutely not the point here, we want a discussion first with consensus before the PEP gets updated.

Part of the problem is that the PEP authors are not up to speed on how the Python community works, as Harihareswara pointed out. She suggested that they be provided with concrete guidance on how to proceed with a discussion of this sort. Normally, a core developer will shepherd a PEP, which would have helped here. Stufft, as BDFL-Delegate, could also have guided the process, but he has been distracted with some real-life issues of late. Given that, Harihareswara thought it would make sense to look for a sponsor; Moore would seem an obvious choice, but he is also busy with other things right now.

In the meantime, the process does seem to be getting back on track. There have been some answers and proposed wording posts from Moore, along with comments on those, revisions, and so on. All of that is ongoing as of this writing. One gets the sense that it is moving in a good direction, though perhaps not at the speed some would hope for. While it gets sorted out, Durbin said that the PSF will use a subset of the funding to add automated malware detection capabilities to the upload portion of PyPI.

No one really seemed opposed to using TUF to try to ensure the integrity of packages that users get from PyPI (and local mirrors of PyPI), but the process to get there has been a bit haphazard, perhaps. As enumerated in the PEP, TUF does handle an impressive array of attacks against a repository like PyPI. Thwarting those attacks certainly seems worth doing even if there are other more common and prominent threats to the integrity of the PyPI ecosystem that also need work.

Comments (1 posted)

New features for the Kubernetes scheduler

December 10, 2019

This article was contributed by Sean Kerner

KubeCon NA

The Kubernetes scheduler is being overhauled with a series of improvements that will introduce a new framework and enhanced capabilities that could help cluster administrators to optimize performance and utilization. Abdullah Gharaibeh, co-chair of the Kubernetes scheduling special interest group (SIG Scheduling), detailed what has been happening with the scheduler in recent releases and what's on the roadmap in a session at KubeCon + CloudNativeCon North America 2019.

The scheduler component of Kubernetes is a controller that assigns pods to nodes. A Kubernetes pod is a group of containers that is scheduled together, while a node is a worker machine (real or virtual) within the cluster that has all the services needed to run a pod.

Gharaibeh explained that as pods are created they are added into a scheduling queue that is sorted by priority and then processed through two phases. In the first, the pod is run through a filter that makes a determination of what nodes are feasible for that pod to run on. For example, the filter checks to make sure the node has enough resources (CPU and memory) to run the pod. The pods then go through the scoring phase, where the pods are ranked according to additional criteria such as node affinity. If for some reason a pod ends up with no feasible nodes it will be placed back into the queue so it can be scheduled when that becomes possible.

Scheduling framework

There are several enhancements that are being worked on for the scheduler, with the most important from Gharaibeh's perspective being the new scheduling framework. The goal of the framework is to turn the scheduler into an engine that executes callbacks from different extension points. "We wanted to define a number of extension points that we've identified from our past experiences with the scheduler, and with those extension points you can register callback functions," Gharaibeh explained. "A collection of callback functions that define a specific behavior, we're going to call it a plugin."

The Kubernetes Enhancement Proposal (KEP) for the scheduling framework outlines all of the different extension points for plugins. One of the key extension points is called "queue sort", which he said enables an administrator to define a single plugin that defines how the default scheduling queue will be sorted. That's different than the current default scheduling behavior where pods are sorted by priority.

Other extension points include "pre-filter", which checks to make sure that certain pod conditions are met, and "filter", which is used to rule out nodes that can't run a given pod. The "reserve" extension point provides an opportunity to help better schedule pods that are running applications or services that need to maintain state, according to the KEP. "Plugins which maintain runtime state (aka 'stateful plugins') should use this extension point to be notified by the scheduler when resources on a node are being reserved for a given Pod."

The framework also includes what are known as permit plugins, which allow other scheduling plugins to delay binding a pod to a node until a specific condition has been satisfied. Gharaibeh noted that the permit plugins can be useful to enable gang scheduling of a group of pods that need to all be deployed at the same time. "One interesting use case we had was trying to make the scheduler friendlier for batch workloads," he said.

Overall, Gharaibeh emphasized that the scheduling framework will make it easier for Kubernetes developers to extend and add new features into the scheduler. The scheduling framework is currently targeted for the Kubernetes 1.19 milestone, which will be out in mid-2020.

Observability improvements

Another area that developers have been working on is observability for scheduler performance and traffic. For example, a new metric now tracks the amount of time it takes to schedule a pod, from the time it was picked up from the queue until it is bound to a node. The ability to track the number of incoming pods per second has also been added so it's possible to see how fast the scheduler is able to drain the scheduling queue. Additionally, the scheduler now provides visibility on the number of scheduling attempts made per second. "We just want more insights into the scheduler," he said.

Gharaibeh also noted that, in Kubernetes 1.17, improvements have been made to better account for pod overhead in order to more accurately determine resource usage. He explained that when pods are started on a node, there is typically some overhead. The kubelet that runs on each node has some state associated with running pods that consumes cluster resources. Sandbox pods, which provide isolation for workloads using technologies such as gVisor or Kata Containers also needed better resource tracking. Both of those approaches have their own agents running beside a pod that end up consuming resources that were not being accounted for.

The way Kubernetes has dealt with pod overhead in the past is to reserve a predefined amount of resources on the node for system components, though it's an approach that Gharaibeh said doesn't fully account for all of the actual pod overhead. The new "pod overhead" feature that is coming to Kubernetes can be used to define the amount of resources that should be allocated per pod. In Kubernetes 1.17, the scheduler will be aware of pod overhead and when pods are scheduled the required amount of overhead will be added to the resources that are requested by the pod, he said.

Looking forward to the Kubernetes 1.18 release, there is development in progress to enable in-place updates of pod resource requirements. Gharaibeh explained that changing the resource allocation for a pod currently requires that the pod be be recreated since PodSpec, which defines the container resources that are required for the pod, is immutable. With the changes set to come in Kubernetes 1.18, PodSpec will become mutable with regard to resources.

It's hard to understate the critical role that the scheduler plays in Kubernetes and the impact that the changes coming to its capabilities will have for users. The new framework approach offers the promise of improved extensibility and customizability that could lead to better overall resource utilization. Adding the improved visibility for the scheduler means that there is a lot for Kubernetes administrators to look forward to in the coming releases.

Comments (1 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>