Leading items

Welcome to the LWN.net Weekly Edition for December 5, 2019

This edition contains the following feature content:

A static-analysis framework for GCC: bringing a much-needed diagnostic feature to the GCC compiler.
5.5 Merge window, part 1: what the first 6,300 changesets brought into the mainline for 5.5.
Virtio without the "virt": the virtio specification isn't just for software anymore.
Fixing SCHED_IDLE: a longstanding but little-used scheduler feature is finally being worked into shape.
Fedora's modularity mess: the Fedora modularity initiative has run into a number of obstacles without clear solutions.
Creating Kubernetes distributions: can a Kubernetes distribution become more like a Linux distribution?

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

A static-analysis framework for GCC

By Jake Edge
December 4, 2019

One of the features of the Clang/LLVM compiler that has been rather lacking for GCC may finally be getting filled in. In a mid-November post to the gcc-patches mailing list, David Malcolm described a new static-analysis framework for GCC that he wrote. It could be the starting point for a whole range of code analysis for the compiler.

According to the lengthy cover letter for the patch series, the analysis runs as an interprocedural analysis (IPA) pass on the GIMPLE static single assignment (SSA) intermediate representation. State machines are used to represent the code parsed and the analysis looks for places where bad state transitions occur. Those state transitions represent constructs where warnings can be emitted to alert the user to potential problems in the code.

There are two separate checkers that are included with the patch set: malloc() pointer tracking and checking for problems in using the FILE * API from stdio. There are also some other proof-of-concept state machines included: one to track sensitive data, such as passwords, that might be leaked into log files and another to follow potentially tainted input data that is being used for array indexes and the like.

The malloc() state machine is found in sm-malloc.cc, which is added by this patch, looks for typical problems that can occur with pointers returned from malloc(): double free, null dereference, passing a non-heap pointer to free(), and so on. Similarly, one of the patches adds sm-file.c for the FILE * checking. It looks for double calls to fclose() and for the failure to close a file.

The handling of diagnostic output required additional features to support the new types of warnings. For example, in order to provide more easily interpreted warnings, the code path leading to a detected problem needs to be determined and stored. That information, including the warning triggered and the locations in the code (both line number and position in the line) that trigger the warning, will be recorded so that it can be displayed by the compiler. There are examples in the cover letter as well as links to some colorized output such as this one (also seen below).

Beyond that, Malcolm also extended the diagnostic facility to allow more metadata to be added to the warnings. In particular, he linked them to entries in the Common Weakness Enumeration (CWE) list, but other kinds of metadata could also be associated with the diagnostic message. In a terminal that is capable of it, the CWE number is a clickable link to the entry's web page (e.g. CWE-690).

The analyzer pass is invoked with the ‑‑analyzer command-line option; there are options to turn on or off the individual warnings as well. It is implemented, currently, as a GCC "in-tree" plugin—one that would be distributed with GCC itself. But Richard Biener suggested that it might be better to simply build the analyzer into GCC—with a configuration option to disable it. He also mentioned rewriting the GCC plugin API, perhaps with an eye toward plugins that could work with both GCC and LLVM, but that is clearly a much longer term project.

Malcolm said that he chose to use the plugin API in part as a way to indicate the immaturity of the code:

My reasoning here is that the analyzer is middle-end code, but isn't as mature as the rest of the middle-end (but I'm working on getting it more mature).

I want some way to label the code as a "technology preview", that people may want to experiment with, but to set expectations that this is a lot of new code and there will be bugs - but to make it available to make it easier for adventurous users to try it out.

[...] I went down the "in-tree plugin" path by seeing the analogy with frontends, but yes, it would probably be simpler to just build it into GCC, guarded with a configure-time variable. It's many thousand lines of non-trivial C++ code, and associated selftests and DejaGnu tests.

In its current state, the analyzer adds roughly 2.5% to the GCC code base, but that did not deter Jakub Jelinek and Biener from preferring it to simply be built into GCC. Malcolm seems favorably disposed, as well, so that switch will be coming. In the meantime, he has posted an update to the patch set to fix some link-time optimization (LTO) compatibility issues.

The "Rationale" section of the cover letter describes the motivation and goals behind the analyzer project:

There's benefit in integrating a checker directly into the compiler, so that the programmer can see the diagnostics as he or she works on the code, rather than at some later point. I think that if the analyzer can be made sufficiently fast that many people would opt-in to deeper but more expensive warnings. (I'm aiming for 2x compile time as my rough estimate of what's reasonable in exchange for being told up-front about various kinds of pointer snafu).

Overall, the reaction has been quite positive to the idea; there has been some code review going on in the thread as well. As Eric Gallager noted, there have been lots of user requests over the years for warnings of the sort that the analyzer could produce. At this point, it looks like there is a way forward to address that missing feature in GCC. With luck, a few years down the road ‑‑analyzer will be widely used in the free-software world, which can only help produce better code for our projects.

Comments (3 posted)

5.5 Merge window, part 1

By Jonathan Corbet
December 2, 2019

The 5.5 merge window got underway immediately after the release of the 5.4 kernel on November 24. The first week has been quite busy despite the US Thanksgiving holiday landing in the middle of it. Read on for a summary of what the first 6,300 changesets brought for the next major kernel release.

Architecture-specific

The arm64 architecture now supports full ftrace functionality with access to function arguments.
MIPS now supports code-coverage analysis with kcov.
The iopl() system call is now emulated on the x86 architecture; as a result, iopl() users are no longer able to disable or enable interrupts.

Core kernel

A number of enhancements have been made to the io_uring subsystem, including the ability to modify the set of files being operated on without starting over, user-specifiable completion-ring sizes, absolute timeouts, and support for accept() calls.
The new CLONE_CLEAR_SIGHAND flag to the clone3() system call clears all signal handlers in the newly created process.
Suitably privileged callers of clone3() can now chose which process ID will be assigned to the new process in each namespace that contains it. See this commit for a description of this feature and this one for an example of its use.
Live-patch state tracking makes it easier to combine multiple live patches on a running system; see this documentation patch for some details.
BPF programs invoked from tracepoints are now subject to type checking of their pointer arguments, eliminating a whole class of potential errors.
The new "BPF trampoline" mechanism allows for much quicker calls between the kernel and BPF programs; see this commit for more information.
The CPU scheduler's load-balancing algorithm has been replaced wholesale. The pull request said: "We hope there are no performance regressions left - but statistically it's highly probable that there *is* going to be some workload that is hurting from these changes. If so then we'd prefer to have a look at that workload and fix its scheduling, instead of reverting the changes".
The new "hmem" driver allows the kernel to make use of special-purpose memory designated by the system firmware. This memory is intended for specific applications, such as those needing especially high memory bandwidth. The driver can export this memory as a device, or the memory can be added to the system memory pool.

Filesystems and block I/O

The Btrfs filesystem has gained support for the xxhash64, blake2b, and sha256 checksum algorithms. The Btrfs RAID1 implementation can now replicate data over three or four devices (it was previously limited to two).
The statx() system call can now indicate whether a given file is protected with fs-verity.

Hardware support

Industrial I/O: Analog Devices ADUX1020 photometric sensors, Analog Devices AD7292 analog-to-digital converters, Intel Merrifield Basin Cove analog-to-digital converters, Texas Instruments enhanced quadrature encoder pulse counters, NXP FXOS8700 accelerometer/magnetometers, Analog Devices multi-sensor thermometers, and Vishay VEML6030 ambient light sensors.
Media: Sony IMX290 sensors, Allwinner deinterlace units, and Hynix Hi-556 sensors.
Miscellaneous: NVMe hardware-monitoring features, Cadence NAND controllers, ST-Ericsson AB8500 general-purpose analog-to-digital converters, Analog Devices LTC2947 power and energy monitors, Texas Instruments TMP513 system monitors, Socionext Milbeaut SDHCI controllers, Actions Semi Owl SD/MMC host controllers, Rockchip OTP controllers, Rockchip Innosilicon MIPI/LVDS/TTL PHYs, Qualcomm MSM8974 interconnect controllers, and Syncoam SEPS525 LCD controllers.
Networking: NXP pn532 UARTs, Texas Instruments DP83869 Gigabit PHYs, Texas Instruments CPSW switches, Microchip VSC9959 network switches, and Silicon Labs WF200 wireless interfaces.
Pin control: Qualcomm 8976 pin controllers, Renesas r8a77961 and r8a774b1 pin controllers, Intel Tiger Lake pin controllers, Intel Lightning Mountain SoC pin controllers, and Meson a1 SoC pin controllers.
Security-related: H1 Secure cr50-based trusted platform modules, Nuvoton NCPM random-number generators, HiSilicon HPRE crypto accelerators, HiSilicon V2 true random-number generators, HiSilicon SEC2 crypto block cipher accelerators, Amlogic cryptographic offloaders, and Allwinner Crypto Engine cryptographic offloaders.
Sound: Texas Instruments TAS2770 and TAS2562 amplifiers and Analog Devices ADAU7118 PDM-to-I2S/TDM converters.
USB: TI HD3SS3220 Type-C DRP port controllers, NVIDIA Tegra Superspeed USB 3.0 device controllers, and Allwinner H6 SoC USB3 PHYs.

Miscellaneous

The KUnit unit-testing framework has been added; see this documentation patch for more information.

Networking

There is a new mechanism for adding alternative names to network interfaces, which can now have multiple names; alternative names can be longer than the previous limit as well. See this commit message for details and usage information.
The transparent inter-process communication (TIPC) subsystem can now support encryption and authentication of all messages. The feature is severely undocumented; some information can be found in this commit.
The VSOCK address family has gained support for multiple simultaneous transports; see this email for a little more information.
Airtime queue limits, described in this article, have been added to the mac80211 layer. The result should be better queue control for WiFi, leading to better performance.

Security-related

The crypto layer has gained support for the blake2b digest algorithm,
Many of the Zinc crypto interfaces needed for the WireGuard virtual private network have been merged. That should clear the path for merging WireGuard itself in the relatively near future.
There is a new set of security-module hooks controlling access to the perf_event_open() system call; see this commit for some details.

Virtualization and containers

KVM now has stolen-time support on Arm processors and can handle nested five-level page tables on x86.

Internal kernel changes

There is a new, simplified workqueue mechanism that was added for the io_uring subsystem.
The new %pe directive to printk() can be used to print symbolic error names.
The performance of the generic refcount_t code has been improved to the point that there is no real need for architecture-specific versions. Those versions have been removed, and the generic code has been unconditionally enabled for all architectures.

The 5.5 merge window will close on December 8, assuming that the usual schedule holds. That implies that the final 5.5 mainline release will happen on January 26 or February 2. Before the merge window closes, though, there will be several thousand more changesets merged; keep an eye on LWN for a summary of those changes once the merge window ends.

Comments (9 posted)

Virtio without the "virt"

November 22, 2019

This article was contributed by Stefan Hajnoczi and Michael Tsirkin

KVM Forum

When virtio was merged in Linux v2.6.24, its author, Rusty Russell, described the goal as being for "common drivers to be efficiently used across most virtual I/O mechanisms". Today, much progress has been made toward that goal, with virtio supported by multiple hypervisors and guest drivers shipped by many operating systems. But these applications of virtio are implemented in software, whereas Michael Tsirkin's "VirtIO without the Virt" talk at KVM Forum 2019 laid out how to implement virtio in hardware.

Motivation

One might ask why it makes sense to implement virtio devices in hardware. After all, they were originally designed for hypervisors and have been optimized for software rather than hardware implementation. Now that virtio support is widespread, the network effects allow hardware implementations to reuse the guest drivers and infrastructure. The virtio 1.1 specification defines ten device types, among them a network interface, SCSI host bus adapter, and console. Implementing a standards-compliant device interface lets hardware implementers focus on delivering the best device instead of designing a new device interface and writing guest drivers from scratch. Moreover, existing guests will work with the device out of the box, and applications utilizing user-space drivers, such as the DPDK packet processing toolkit, do not need to be relinked with new drivers — this is especially helpful when static linking is utilized.

Implementing virtio in hardware also makes it easy to switch between hardware and software implementations. A software device can be substituted without changing guest drivers if the hardware device is acting up. Similarly, if the driver is acting up, it is possible to substitute a software device to make debugging the driver easier. It is possible to assign hardware devices to performance-critical guests while assigning software devices to the other guests; this decision can be changed in the future to balance resource needs. Finally, implementing virtio in hardware makes it possible to live-migrate virtual machines more easily. The destination host can have either software or hardware virtio devices.

Implementing virtio PCI devices

Virtio has a number of useful properties for hardware implementers. Many device types have optional features so that device implementers can choose a subset that meets their use case. These optional features are negotiated during device initialization for forward and backward compatibility. This means hardware devices will continue working with guest drivers even after new versions of the virtio specification become widespread. Old guest drivers will work with newer devices too.

Historically, virtio was performance-optimized for software implementations. They used guest physical addresses instead of PCI bus addresses that are translated by an IOMMU. Memory coherency was also assumed and DMA memory-ordering primitives were therefore unnecessary. In preparation for hardware virtio implementations, the VIRTIO_F_ORDER_PLATFORM and VIRTIO_F_ACCESS_PLATFORM feature bits were introduced in virtio 1.1. A device that advertises these feature bits requires a driver that uses bus addresses and DMA memory-ordering primitives.

At least three approaches exist for hardware virtio PCI devices: full offloading, virtual data path acceleration (vDPA), and vDPA partitioning. Full offloading passes the entire device or a PCI SR-IOV virtual function (VF), which is a sub-device available on PCI adapters designed for virtualization, to the guest. All device accesses are handled in hardware — both those related to device initialization and to the data-path device operation. In this setup, all software is completely vendor-independent.

By comparison, vDPA is a hybrid software/hardware approach where a vendor-specific driver intercepts control path (discovery and initialization) accesses from the virtio driver and handles them in software, while the data path is implemented in hardware in a way compliant with the virtio specification. Performance is still good since the data path is handled directly in hardware.

The final approach is vDPA partitioning based on fine-grained memory protection between guests, such as the PCI process address space ID (PASID), which allow multiple virtual address spaces for device accesses instead of just one. This allows flexible resource allocation because users can configure the host driver to pass resources to guests as they wish. PASID support is not yet widespread, so this approach has not been explored as much as the alternatives.

Hardware bugs in fully offloaded devices that are not fixable in a firmware update can be assigned new virtio feature bits. Workarounds can be added to generic virtio drivers when these feature bits are seen. Hardware vendors can make the device's feature bits programmable, for example via a firmware update, so that the device refuses to start if the driver does not support a workaround for a critical bug. Bugs in vDPA devices can be worked around in the vendor's driver.

Live migration

Users often wish to move a running guest to another host with minimal downtime. When hardware devices are passed through to the guest, this becomes challenging because saving and restoring device state is not yet widely implemented for hardware devices. The details of representing device state are not covered by the virtio 1.1 specification, so hardware implementers must tackle this issue themselves.

QEMU can help with live-migration compatibility by locking down the virtio feature bits that were negotiated on the source host and enforcing them on the destination host. This way, live migration ensures the availability of features that the guest is using. If the hardware device on the destination does not support the feature bits currently enabled on the source host, live migration is not possible.

During live migration, it is necessary to track writes to guest RAM because RAM is migrated incrementally in the background while the guest continues to run on the old host for a period of time. If writes are missed, the destination host receives an outdated and incomplete copy of guest RAM. Hardware devices must participate in this process of logging writes. Infrastructure for this is expected to land in the VFIO mediated device (mdev) driver subsystem in the future.

Looking further into the future, both vendor-independent support for live migration and the elimination of memory pinning should become possible as IOMMU capabilities grow. The new shared virtual addressing (SVA) support in Linux and associated IOMMU hardware allows devices to access a process address space instead of using a dedicated IOMMU page table. Using unpinned memory would be attractive because it enables swappable pages and memory overcommit. In addition, this will make write logging for live migration simpler because device writes into memory can cause faults and be tracked in a vendor-independent way.

PCI page request interface (PRI) is the mechanism that allows IOMMU fault handling, but it might not be sufficient to support post-copy live migration, where the guest immediately runs on the destination host without prior migration of guest RAM. In post-copy live migration, guest RAM is faulted in from the source host on demand with unpredictable latencies, something that might not be appropriate for PRI. Virtio might be able to help by standardizing a way for a device to request a page and to pause and resume request processing. The out-of-order properties of virtio queues mean that the device can proceed even as a specific request is blocked waiting for a page to be faulted.

Future optimizations

Finally, changes can be made to how virtio works to make hardware implementations faster. The amount of outstanding work available in a queue needs to be retrieved by the device from memory. Pushing this information to the device from the driver might help devices avoid memory accesses.

Today there exists an interrupt-suppression mechanism called "event index" that stores the associated state in guest RAM. Guest RAM accesses require hardware devices to perform DMA transfers, which can be expensive and waste PCI bus bandwidth if there have been no changes to RAM. A more hardware-friendly mechanism would be welcome here. In a similar vein, interrupt coalescing is a common technique to reduce CPU consumption due to interrupts being raised frequently. In hardware implementations it is easy to take advantage of this.

Participation in the virtio Technical Committee standardization process is easy and open to anyone. Hardware vendors are welcome to participate in order to improve support for their hardware.

Conclusion

The virtio specification was originally intended for software device implementations but is now being implemented in hardware devices as well. Tsirkin's presentation outlined how virtio 1.1 enables hardware implementations but also identified areas where further work is necessary, for example for live migration. Although hardware virtio devices are not common yet, the interest in hardware implementation from silicon vendors and cloud providers suggests the day is not far off when these "virtual" devices become physical.

Comments (8 posted)

Fixing SCHED_IDLE

November 26, 2019

This article was contributed by Viresh Kumar

The Linux kernel scheduler is a complicated beast and a lot of effort goes into improving it during every kernel release cycle. The 5.4 kernel release includes a few improvements to the existing SCHED_IDLE scheduling policy that can help users improve the scheduling latency of their high-priority (interactive) tasks if they use the SCHED_IDLE policy for the lowest-priority (background) tasks.

Scheduling classes and policies

The scheduler implements many "scheduling classes", an extensible hierarchy of modules, and each class may further encapsulate "scheduling policies" that are handled by the scheduler core in a policy-independent way. The scheduling classes are described below in descending priority order; the Stop class has the highest priority, and Idle class has the lowest.

The Stop scheduling class is a special class that is used internally by the kernel. It doesn't implement any scheduling policy and no user task ever gets scheduled with it. The Stop class is, instead, a mechanism to force a CPU to stop running everything else and perform a specific task. As this is the highest-priority class, it can preempt everything else and nothing ever preempts it. It is used by one CPU to stop another in order to run a specific function, so it is only available on SMP systems. The Stop class creates a single, per-CPU kernel thread (or kthread) named migration/N, where N is the CPU number. This class is used by the kernel for task migration, CPU hotplug, RCU, ftrace, clock events, and more.

The Deadline scheduling class implements a single scheduling policy, SCHED_DEADLINE, and it handles the highest-priority user tasks in the system. It is used for tasks with hard deadlines, like video encoding and decoding. The task with the earliest deadline is served first under this policy. The policy of a task can be set to SCHED_DEADLINE using the sched_setattr() system call by passing three parameters: the run time, deadline, and period.

To ensure deadline-scheduling guarantees, the kernel must prevent situations where the current set of SCHED_DEADLINE threads is not schedulable within the given constraints. The kernel thus performs an admittance test when setting or changing SCHED_DEADLINE policy and attributes. This admission test calculates whether the change can be successfully scheduled; if not, sched_setattr() fails with the error EBUSY.

The POSIX realtime (or RT) scheduling class comes after the deadline class and is used for short, latency-sensitive tasks, like IRQ threads. This is a fixed-priority class that schedules higher-priority tasks before lower-priority tasks. It implements two scheduling policies: SCHED_FIFO and SCHED_RR. In SCHED_FIFO, a task runs until it relinquishes the CPU, either because it blocks for a resource or it has completed its execution. In SCHED_RR (round-robin), a task will run for the maximum time slice; if the task doesn't block before the end of its time slice, the scheduler will put it at the end of the round-robin queue of tasks with the same priority and select the next task to run. The priority of the tasks under the realtime policies range from 1 (low) to 99 (high).

The CFS (completely fair scheduling) class hosts most of the user tasks; it implements three scheduling policies: SCHED_NORMAL, SCHED_BATCH, and SCHED_IDLE. A task under any of these policies gets a chance to run only if no other tasks are enqueued in the deadline or realtime classes (though by default the scheduler reserves 5% of the CPU for CFS tasks regardless). The scheduler tracks the virtual runtime (vruntime) for all tasks, runnable and blocked. The lower a task's vruntime, the more deserving the task is for time on the processor. CFS accordingly moves low-vruntime tasks toward the front of the scheduling queue.

The priority of a task is calculated by adding 120 to its nice value, which ranges from -20 to +19. The priority of the task is used to set the weight of the task, which in turn affects the vruntime of the task; the lower the nice value, the higher the priority. The task's weight will thus be higher, and its vruntime will increase more slowly as the task runs.

The SCHED_NORMAL policy (called SCHED_OTHER in user space) is used for most of the tasks that run in a Linux environment, like the shell. The SCHED_BATCH policy is used for batch processing by non-interactive tasks — tasks that should run uninterrupted for a period of time and hence are normally scheduled only after finishing all the SCHED_NORMAL activity. The SCHED_IDLE policy is designed for the lowest-priority tasks in the system; these tasks get a chance to run only if there is nothing else to run. Though, in practice, even in the presence of other SCHED_NORMAL tasks a SCHED_IDLE task will get some time to run (around 1.4% for a task with a nice value of zero). This policy isn't widely used currently and efforts are being made to improve how it works.

Last is the Idle scheduling class (which should not be confused with the SCHED_IDLE scheduling policy). This is the lowest-priority scheduling class; like the Stop class, it doesn't manage any user tasks and so doesn't implement a policy. It only keeps a single per-CPU kthread which is named swapper/N, where N is the CPU number. These kthreads are also called the "idle threads" and aren't visible to user space. These threads are responsible for saving system power by putting the CPUs into deep idle states when there is no work to do.

Scheduling classes in the kernel

The scheduling classes are represented by struct sched_class in the kernel source code:

    struct sched_class {
	const struct sched_class *next;
	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
	struct task_struct *(*pick_next_task) (struct rq *rq, struct task_struct *prev, 
			    struct rq_flags *rf);
	/* many fields omitted */
    };

This structure mostly consists of function pointers (callbacks) to class-specific implementations that are called by the scheduler core in a class-independent manner. The classes are kept in a singly linked list in descending order of their priorities; the head node points to the Stop scheduling class (highest priority) and the last node in the list points to the Idle class (lowest priority).

The Linux kernel calls schedule() when it needs to pick a new task to run on the local CPU, which further calls pick_next_task() to find the next task. pick_next_task() traverses the list of scheduling classes, with the help of the for_each_class() macro, to find the highest-priority scheduling class that has a task available to run. Once a task is found, it is returned to the caller, which then runs it on the local CPU. There should always be a task available to run in the Idle class, which will run only if there is nothing else to run.

`SCHED_IDLE` improvements

The CFS scheduler tries to be fair to all tasks by giving more CPU time to the higher-priority tasks as compared to the lower-priority tasks. It normally doesn't provide special treatment to tasks based on their scheduling policy, for example tasks running under the SCHED_NORMAL and SCHED_IDLE policies are managed in the same way. They are all kept in the same CFS run queues, the load and utilization of the CPUs change in the same way for all the tasks, and the PELT signal and CPU-frequency changes are impacted similarly by all tasks. The only differentiating factor is the priority (derived from the nice value) of the tasks, which affects the weight of the tasks.

The weight of a task defines how the load and utilization of the CPU will change because of that task. For this reason, we don't see a lot of SCHED_IDLE policy-related code in the CFS scheduler. As the SCHED_IDLE policy tasks have the lowest priority, they automatically get scheduled for the least amount of time. Also, since there aren't many known users of the SCHED_IDLE policy in the Linux community, no one attempted to improve it since it was first introduced in Linux 2.6.23.

When a newly woken-up task is available to run, the scheduler core finds the target run queue (i.e. a CPU to run it on) by calling the select_task_rq() callback of the respective scheduling class. This callback returns the CPU where the task should be enqueued. Once the task is enqueued there, the scheduler checks if that task should preempt the currently running task on that CPU by calling the check_preempt_curr() callback of the respective scheduling class.

Until now, the SCHED_IDLE policy was getting special treatment only in the check_preempt_curr() callback, where a currently running SCHED_IDLE task will be immediately preempted by a newly woken-up SCHED_NORMAL task. But this preemption will only happen if the newly woken-up task is enqueued on a CPU that is running a SCHED_IDLE task currently. As there was no special handling of the SCHED_IDLE policy in the select_task_rq() callback, there was no specific effort made to enqueue the newly woken-up SCHED_NORMAL task on a CPU running a SCHED_IDLE task.

Normally, the scheduler tries to spread tasks across the available CPUs by searching for an idle CPU for newly woken-up tasks. The 5.4 kernel contains a patch set that makes the necessary changes to the CFS scheduler's select_task_rq() callback to queue tasks more aggressively on CPUs that are running only SCHED_IDLE tasks, even if a few CPUs are currently idle. There are two separate code paths in the CFS select_task_rq() callback: the slow path and the fast path. The slow path is mostly executed for newly forked tasks, where it tries to find the optimal CPU to run the task on. The fast path, instead, is taken for existing tasks that have become runnable again; it tries to find a target CPU (an idle CPU if possible) as soon as possible even if it is not the optimal one.

Both these code paths were updated by the new patch set to consider a CPU that is running only SCHED_IDLE tasks as equivalent to an idle CPU. The scheduler now prefers to queue the newly woken-up tasks on CPUs with only SCHED_IDLE activity; the newly queued task will immediately preempt the currently running SCHED_IDLE task when check_preempt_curr() is called. This reduces the scheduling latency for the newly queued task as compared to selecting a fully idle CPU, as we don't need to bring an idle CPU out of its deep idle state, which normally takes a few milliseconds to complete.

The results of this change

This patch set was initially tested with rt-app on an arm64 octa-core HiKey platform, where all the CPUs change frequency together. Rt-app is a test application that starts multiple periodic threads in order to simulate a realtime periodic load. For this test, eight SCHED_OTHER tasks and five SCHED_IDLE tasks were created. The tasks weren't bound to any particular CPU and could be queued anywhere by the scheduler. The SCHED_NORMAL tasks executed (busy loops) for 5333µs out of a period of 7777µs, while the SCHED_IDLE tasks kept on running forever. The idea was to check whether the SCHED_NORMAL tasks were being scheduled together (thus preempting each other) or if they were able to preempt SCHED_IDLE tasks instead. The result showed that the average scheduling latency (wu_lat field in rt-app results) for the SCHED_NORMAL tasks reduced to 102µs after the patch set was applied, down from 1116µs without the patch set; that was a reduction of 90% in scheduling latency for the SCHED_NORMAL tasks, which looks quite encouraging.

Further testing showed that the average scheduling latency of a SCHED_NORMAL task, on the above-mentioned arm64 platform is 64µs when it preempts a SCHED_IDLE task, 177µs when it runs on a shallow-idle (no cluster idle) CPU, and 307µs when it runs on a deep-idle (cluster idle) CPU. The same behavior can be observed with the kernel function tracer; the traces are shown below with help of the KernelShark tool. First, the output from the 5.3 kernel:

If you look closely at the above figure, you can see that occasionally, for long periods of time, a few CPUs were running a single task (solid single-color lines) without being preempted by another task. The long-running tasks are the SCHED_IDLE tasks which should ideally be preempted by the SCHED_NORMAL tasks, but that wasn't happening then.

The results from the 5.4 kernel are different:

If you look closely at the above figure, you can see that the pattern is quite consistent now. The SCHED_IDLE tasks are preempted by the SCHED_NORMAL tasks as soon as one is available to run, which then runs for 5333µs and then gives the CPU back to a SCHED_IDLE task. This is exactly the behavior this patch set was meant to create.

Other applications

Recently, Song Liu was trying to solve a problem seen on servers at Facebook. The servers running latency-sensitive workloads usually weren't fully loaded for various reasons, including disaster readiness. The machines running Facebook's interactive workloads (referred as the main workload) have a lot of spare CPU cycles that they would like to use for opportunistic side jobs like video encoding. However, Liu's experiments showed that the side workload has a strong impact on the latency of main workload. Liu was asked to try the SCHED_IDLE patch set and he found that it solved the problems he was facing to a great extent; though he tested an earlier version of the patch set where only the fast path was updated.

Another potential user of this work is the Android operating system, which has knowledge about the importance of a task for the current user's experience ranging from "background" (not important) to "top-app" (most important). The SCHED_IDLE policy can potentially be used for all the background tasks as that would increase the probability of finding an idle CPU for top-app tasks by preempting the background tasks.

Clearly this work has a lot of potential. More mainstream products should be using the SCHED_IDLE policy, though there may be a need for more SCHED_IDLE policy-specific optimizations in the CFS scheduler for that. One such optimization is under discussion right now on the kernel mailing list, where I am trying to be more aggressive in selecting a SCHED_IDLE CPU in both the slow and fast paths of the CFS scheduler. Also, improvements can be made to the CFS load balancer, which doesn't give any special treatment to the SCHED_IDLE CPUs currently and rather attempts to spread the tasks to all the CPUs; that is future work, though.

Comments (10 posted)

Fedora's modularity mess

By Jonathan Corbet
November 21, 2019

Fedora's Modularity initiative has been no stranger to controversy since its inception in 2016. Among other things, there were enough problems with the original design that Modularity went back to the drawing board in early 2018. Modularity has since been integrated with both the Fedora and Red Hat Enterprise Linux (RHEL) distributions, but the controversy continues, with some developers asking whether it's time for yet another redesign — or to abandon the idea altogether. Over the last month or so, several lengthy, detailed, and heated threads have explored this issue; read on for your editor's attempt to integrate what was said.

The core idea behind Modularity is to split the distribution into multiple "streams", each of which allows a user to follow a specific project (or set of projects) at a pace that suits them. A Fedora user might appreciate getting toolchain updates as soon as they are released upstream while sticking with a long-term stable release of LibreOffice, for example. By installing the appropriate streams, this sort of behavior should be achievable, allowing a fair degree of customization.

Much of the impetus — and development resources — behind Modularity come from the RHEL side of Red Hat, which has integrated Modularity into the RHEL 8 release as "Application Streams". This feature makes some sense in that setting; RHEL is famously slow-moving, to the point that RHEL 7 did not even support useful features like Python 3. Application Streams allow Red Hat (or others) to make additional options available with support periods that differ from that of the underlying distribution, making RHEL a bit less musty and old, but only for the applications a specific user cares about.

The use case for Modularity in Fedora is arguably less clear. A given Fedora release has a support lifetime of 13 months, so there are limits to the level of stability that it can provide. But there is still clearly an appetite for Modularity; Fedora leader Matthew Miller articulated three goals during one of the threads:

1. Users should have alternate streams of software available.

2. Those alternate streams should be able to have different lifecycles.

3. Packaging an individual stream for multiple outputs should be easier than before.

Thus far, it is far from clear that any of those goals have been met. In particular, there are few modules in Fedora with multiple streams, and the lifecycles of modules are required, by policy, to line up with a Fedora release. But that is just the beginning of the problems for Modularity in Fedora.

The trouble with Modularity

The list of complaints being raised against Modularity is long; some of them were described by Modularity developer Stephen Gallagher in a thread dedicated to that purpose. Covering them all would create a long article indeed, so only a few of them will be discussed here.

The first of those, and the immediate cause of at least one of the longer mailing list threads, has to do with how modules interact with upgrades of the distribution itself. One of the policy decisions behind Modularity states that once users pick a particular stream, they will not be moved to a different one. That plays poorly with upgrades, though, where some streams are discontinued and others pick up; the result may be blocked upgrade operations and irritated users.

Consider the case of libgit2, which is packaged as a module in Fedora 30. Some module maintainers also make non-modular packages available (in an unbearable bit of jargon, these are sometimes called "ursine packages" because they are "bare"), but that is not the case for libgit2. When the DNF tool is given a request to install one of these module-only packages (perhaps as a dependency for the package the user really wants), it will silently install the module, giving users a system with Modularity without their explicit wish and, probably, knowledge.

In this case, a Fedora 30 user installing a package with a dependency on libgit2 would end up getting the libgit2 module, on the 0.28 stream. That stream, however, does not exist in Fedora 31, so any dependencies on libgit2 will not be satisfiable and the upgrade fails. There is an ursine libgit2 package in Fedora 31, but once a module is installed, there is no provision for moving a system back to a non-modular version, so it could not be used.

This was a rather late and unpleasant surprise as the project was trying to get the Fedora 31 release together. The initial solution that was arrived at was to have the upgrade process print out a URL at the beginning to indicate where affected users could find information on how to fix their systems. This struck some developers as being less than fully user-friendly so, after further discussion, another solution was adopted: a rather inelegant hack to DNF that would forcibly reset the stream for libgit2 when upgrading to Fedora 31. This violates the "no stream changes" policy and it clearly is not a sustainable solution going forward, but it got the project past the problem for now.

One of the promises that came with Modularity was that users would not have to deal with it if they didn't want to. It has become increasingly clear, though, that this promise cannot be kept, for a number of reasons. The existence of module-only packages is one of those, along with the inability to revert back to a non-modular version once a given module is installed. Once a package goes module-only, any other package that depends on it must also make that transition, forcing some developers to create and maintain modules whether they want to or not. These problems have led some community members, such as Kevin Kofler, to suggest that module-only packages should be banned, as should modules for anything but leaf packages that no others depend on.

Then, there is the issue of buildroot-only modules. Fedora's use of "buildroot" does not refer to the Buildroot buildsystem; instead, it describes the environment in which packages and modules are built. A buildroot-only module is one that is created solely for the purpose of building another module in the buildroot environment; it is not shipped as part of the distribution itself. The idea behind these modules is to make life easier for packagers to deal with dependencies that are not available in Fedora; they can use a buildroot-only module to embed that dependency in their module without having to support it for other users of the distribution.

The problem with buildroot-only modules is that they make it difficult (if not impossible) for others to rebuild a module themselves. The process of rebuilding a module in general is rather more involved than building from a source RPM package. Buildroot-only modules make things far worse by hiding away the dependencies that were used to build the original module. The result, according to Kofler, "contradicts both the transparency expected from a community-developed project and the self-hosting expectations". Or, as Stephen Smoogen put it:

I find however that modularity is being used as a tool to weld parts of the engine away and it drives me bonkers. I can't just take a bunch of rpms from download.fedoraproject.org and rebuild them with some options to get what I want.

For an extreme example, see this post from Neal Gompa, describing why it is essentially impossible to rebuild Rust modules in Fedora 30. Among other things, those modules build with a private module containing a special version of RPM, so the source packages they create don't work on the target distribution.

Is it all worth it?

Given the difficulties, it is not entirely surprising that there have been some calls for Fedora to drop the Modularity initiative entirely. Simo Sorce, for example, asked whether it would be better just to use containers for the use cases targeted by Modularity. Gallagher responded that "one of the main points of Modularity is to provide a trusted source of software to install into containers". Daniel Mach, one of the Modularity developers, argued that containers can't solve all of the problems, and that Modularity is needed for the reliable provisioning of containers in the first place. He also worried that some of the problems with Modularity cannot be fixed without massive changes, though.

While others might like to see the end of Modularity, its total removal is not something many developers are actively arguing for. That may be because they realize that it is unlikely to go away regardless of what they think. As Gompa put it: "because it's in RHEL now, no one can afford to let it fail". Gompa also said, though, that "while it is a hard problem to solve, it's a worthy one" and lamented that the Fedora project lacks the infrastructure resources it needs to implement this idea properly. Miller said that the problem is "a fundamental one we need to solve in order to continue to be relevant not just as an upstream for RHEL but in general". So the project is unlikely to walk away from Modularity at this time.

Modularity 3.0?

Over the course of these discussions, a number of approaches for addressing at least some of these problems have been raised. One of those is to do what the project has already done once: drop the current Modularity implementation and design a new one. Adam Williamson, for example, stated that it was a mistake to deploy this version of Modularity into RHEL 8, but that "inventing new stuff is hard" and that is how we learn:

We just need to hold our noses and fix the icky problems, and then sit down and think about the design issues that have become apparent in Modularity v2 through our actually implementing it and using it (which is what Fedora is for, remember!) and figure out how to address them in Modularity v3.

Robbie Harwood asserted that "starting from scratch should be an option". Gallagher replied that there would be no fundamental redesign of Modularity, though. He pointed out that development on Modularity is funded by Red Hat, and the developers so funded are committed to supporting the current Modularity implementation in RHEL 8. "A full redesign in Fedora is not realistically possible with the people and resources we have available to us while also maintaining the current implementation for ten years". Instead, he said, the focus will be fixing problems in the current implementation.

Miller also stated his support for the current Modularity effort:

But the team in Fedora actually working on Modularity today includes some pretty smart, very invested Fedora people and I don't feel bad at all about standing up for their wanting to continue to refine the path they've chosen and are working on.

He suggested that if others want to see a fundamental redesign of Modularity, they should work on creating it and the results could be evaluated once a prototype exists.

Fixing Modularity 2.0

With regard to the upgrade issue, there are a few ideas in circulation. One of those is, as mentioned before, to disallow module-only packages from Fedora. That seems unlikely to fly, though, for a couple of reasons. Gallagher pointed out that converting module-only packages back would be difficult at best, especially for those that rely on buildroot-only modules. Forcing packagers to create both modules and ursine packages would add to their workload, which could cause some of them to leave the project. Smoogen noted that the number of packagers is already in decline; developers will be leery of changes that could accelerate that trend.

Miro Hrončok suggested that the solution to upgrades is to make the default stream for modules behave like an ordinary package. He quickly followed up with a grumpy note after the Modularity maintainers voted not to pursue that idea, saying instead that they would "implement a mechanism of following default streams to give people the experience they want". In other words, users will be left dealing with modules whether they want them or not, and the proposed solution leaves many feeling less than entirely happy.

The current round of discussions was actually touched off by this proposal from Gallagher on how to handle the update problem. It involved a complex set of module states meant to record the user's "intention" for a module, along with rules for state transitions. In short, it would allow a module that was installed as a dependency to switch to a new stream if the depending module changed.

Later, Gallagher posted an alternative proposal involving a new "upgrades:" attribute on modules that would specify a stream name. A module tagged in this way would, during a system upgrade, replace the current stream if it matches the given name (and if a few other conditions are met). Neither proposal was received all that well; to quote Zbigniew Jędrzejewski-Szmek:

The first form of the proposal was already staggeringly complex — "default", "dep_enabled", "default_enabled", "default", …. Recording user intent when the users interacts directly with the thing might be OK, but mapping that intent onto dependencies that are pulled in automatically is not something that can be well defined. My expectation is that we'd forever be fighting broken expectations and unexpected cases.

But the amended proposal actually makes things *worse*, even more complex. We would have two parallel sets of dependency specifications: on the rpms level and on the module level. The interactions between them would be hard to understand for users.

In short, the project does not yet have any sort of convincing answer for the upgrade problem. That, in turn, suggests that these discussions are far from done.

With regard to the problem of the one-way transition to Modularity, Gallagher said that the problem is being looked into. Reverting to an ursine package is something that would be useful in a number of situations, he said. "We haven't figured this one out yet, but it's on the queue."

With regard to modular dependencies forcing other packages to turn into modules: the current proposal is to change the rules so that the non-modular buildroot can contain module streams. This plan, dubbed "Ursa Prime", would make dependencies available for other packages to build on without those packages having to be modules themselves. The November 11 meeting of the Fedora Engineering Steering Committee approved Ursa Prime for the Fedora 32 release, though it will start with only two modules. Not everybody is at ease with this plan, but this test will at least show how it will work in practice.

Buildroot-only modules are another outstanding problem. They are included in a set of requirements for Modularity posted by Gallagher: "Build-time only dependencies for an alternative version may be excluded from the installable output artifacts". Many developers would like to ban them as being fundamentally incompatible with how Fedora is supposed to work, but there is a tradeoff: as Smoogen pointed out, an alternative to modules using buildroot-only dependencies may be the removal of those modules altogether. Still, that may well be a price that many in the Fedora project are willing to pay.

Inventing new stuff is hard

Established Linux distributions tend to have a following of longtime users who are often hostile to fundamental changes. Those who have adopted a solution because it works, and who have since found ways of dealing with the parts that don't work as well, tend not to react well if others want to come in and shake things up. That is one reason why any big change in the free-software world tends to be accompanied by interminable mailing-list threads.

Distributors, though, have reason to worry about their relevance in a changing world. Software distribution has changed considerably since the 1990s, when many of the organizing principles behind most Linux distributions were laid down. A failure to change along with the wider industry and provide features that new users want is not likely to lead to good results in the long run. So it is not surprising that distributions like Fedora are experimenting with ideas like Modularity; they have to do that to have a chance of remaining relevant in the future.

Incorporating such changes can create pain for users, but that does not always mean that the changes are fundamentally bad. Those of us who lived through the ELF transition might have been happy to live with a.out forever, but that would not have been good for Linux as a whole. One of the truths of software development is that it is often impossible to see all the consequences of a change before implementing it. The value of free software is that we can implement those changes, see where things don't work, and fix them. The downside is that we have to live through the "see where things don't work" phase of the process.

Fedora is deeply within that stage when it comes to Modularity. Since it's a free-software project, all of the difficulties with Modularity are being exposed in a highly public way. But, for the same reasons, those problems are being noted and effort is going into proposing solutions. Eventually, the Fedora project will figure out how Modularity fits into its distribution and how to make it all work well. Users may wonder someday how they ever did without it. But there will be a lot of emails produced between now and then.

Comments (60 posted)

Creating Kubernetes distributions

December 4, 2019

This article was contributed by Sean Kerner

KubeCon NA

Making a comparison between Linux and Kubernetes is often one of apples to oranges. There are, however, some similarities and there is an effort within the Kubernetes community to make Kubernetes more like a Linux distribution. The idea was outlined in a session about Kubernetes release engineering at KubeCon + CloudNativeCon North America 2019. "You might have heard that Kubernetes is the Linux of the cloud and that's like super easy to say, but what does it mean? Cloud is pretty fuzzy on its own," Tim Pepper, the Kubernetes release special interest group (SIG Release) co-chair said. He proceeded to provide some clarity on how the two projects are similar.

Pepper explained that Kubernetes is a large open-source project with lots of development work around a relatively monolithic core. The core of Kubernetes doesn't work entirely on its own and relies on other components around it to enable a workload to run, in a model that isn't all that dissimilar to a Linux distribution. Likewise, Pepper noted that Linux also has a monolithic core, which is the kernel itself. Alongside the Linux kernel is a whole host of other components that are chosen to work together to form a Linux distribution. Much like a Linux distribution, a Kubernetes distribution is a package of core components, configuration, networking, and storage on which application workloads can be deployed.

Linux has community distributions, such as Debian, where there is a group of people that help to build the distribution, as well as a community of users that can install and run the distribution on their own. Pepper argued that there really isn't a community Kubernetes distribution like Debian, one that uses open-source tools to build a full Kubernetes platform that can then be used by anyone to run their workloads. With Linux, community-led distributions have become the foundation for user adoption and participation, whereas with Kubernetes today, distributions are almost all commercially driven.

Why distributions matter

The real value that comes from Kubernetes and from Linux in Pepper's view, is not from the core, but rather from the user applications that a full distribution enables. Distributions are purpose-built, opinionated assemblies of configurations and tools. Distributions also serve to align different versions of tooling and subprojects into a working release that is easier for users to install and maintain. "One of the things in open source that is really amazing is you have this multiplier effect and distributions are a key part of that," Pepper said.

A Kubernetes distribution is a bit different than a Linux distribution in several respects. With Kubernetes, the Cloud Native Computing Foundation (CNCF) has developed a Kubernetes conformance program to certify that a given platform is in fact Kubernetes. Pepper noted that Linux makes use of a reciprocal open-source license, which means that any code that is forked and distributed needs be shared. Kubernetes uses a permissive license (Apache version 2.0), which Pepper warned comes with the risk of divergent forking. "So where Linux didn't necessarily have conformance testing, we need something like that in Kubernetes to make sure that Kubernetes as a word means something, and that we can understand what that means," he said.

Linux has a large stable of community distributions, such as Debian, Arch, and Fedora, as well as commercial enterprise distributions. "Where are our Kubernetes community distributions?" Pepper asked. "Of the hundred conformant offerings, most of them are commercial." The full list of conformant Kubernetes offerings is maintained and regularly updated by the CNCF.

Building a community Kubernetes distribution

Pepper outlined several potential reasons why there isn't a community Kubernetes distribution, including the fact that there are some missing technical components. He started by attempting to define what the base of a community distribution could include. There are the raw Go language binaries and some other code artifacts from the Kubernetes release, but those are only parts of a distribution. There are also several tools needed, including kubeadm, which helps to bootstrap a basic Kubernetes cluster, kops for managing Kubernetes operations, and kubespray, which is a used to deploy a production-ready Kubernetes cluster. Pepper emphasized that the existing open-source tools are intended to help build a cluster and not a distribution.

The Kubernetes community is currently lacking build tools for distributions as well as more robust dependency management, he said. "One of the really useful benefits you see from distros is that they they kind of grok all of the dependencies and give you that coherent opinionated set of things that are going to work together," Pepper said. "Where is our Kubernetes equivalent of koji or Launchpad?" He also wondered why there was no Kubernetes version of Ubuntu's personal package archives (PPAs).

Release engineering

While Kubernetes currently is missing pieces for enabling a true community distribution, work is ongoing in multiple Kubernetes Special Interest Groups (SIGs), including SIG Release and SIG Testing that could point the way forward to a future community distribution.

Stephen Augustus, another SIG Release co-chair, explained that a release-managers group that deals with the build process as well as patch and branch management has started to take shape. The idea behind the group is to codify the process by which Kubernetes releases are produced. "There are scripts that you can check out that have copyright dates of 2016 and they are actually the ones that are responsible for releasing Kubernetes," Augustus said. "We want to get to the point where we can start tearing down some of the technical debt that we've built up in the project over time."

Among the Kubernetes release scripts that date back to 2016 is anago, which is an 1,800-line bash script for releasing Kubernetes. Anago imports three separate libraries, each with another 500 lines of shell code. "It's time to not do that anymore," Augustus said.

The group is starting to rewrite some of the release scripts, one of the first targets is branchff, which is a utility that fast-forwards a branch to the master. Another tool that is being rewritten is push-build, which is responsible for pushing all of the Kubernetes builds up to the Google Cloud.

As part of the overall effort to improve release engineering, there is also the new Kubernetes release toolbox project known as "krel" that Augustus noted is just getting started. The goal is to take all of the various release shell scripts and move them into the toolbox as a set of commands. Another new effort that is getting underway is the kubepkg tool that will enable developers to create deb and RPM packages based on Kubernetes project binaries. "We want there to be a dead simple way to produce debs and RPMs for Kubernetes."

Augustus commented that many companies have built their own tools for Kubernetes releases because there have not been any great tools in the upstream project, but that's now changing. "We're trying to kind of flip that story, change the narrative, and build tools that are actually useful for not just the community, but for for vendors, and for hobbyists to consume as well."

Whether or not a real Kubernetes community distribution will emerge remains to be seen. What is clear is that, as Augustus said, there is a need to remove the technical debt for release engineering, updating complex shell scripts with more modern tools that can help both the project and the broader community to build Kubernetes distributions.

Comments (16 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>