|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The 4.0 kernel was released on April 12. In the announcement, Linus noted:

Feature-wise, 4.0 doesn't have all that much special. Much have been made of the new kernel patching infrastructure, but realistically, that not only wasn't the reason for the version number change, we've had much bigger changes in other versions. So this is very much a 'solid code progress' release.

Beyond the (incomplete) live-patching mechanism, this release includes the removal of the remap_file_pages() system call, improved persistent memory support, the lazytime mount option, and the kernel address sanitizer.

The 4.1 merge window is now open; see the article below for a summary of what has been merged thus far.

Stable updates: 3.19.4, 3.14.38, and 3.10.74 were released on April 13.

Comments (none posted)

Quotes of the week

Every time a vendor supplies a 32-bit time_t OS to a 32-bit computer user, it creates a potential circumstance someone might be running that close to one of the numerous issues coming close to 2038 (there are a number of them, read the wikipedia page, and pay close attention to NTP as well). The potential is low today, but increases as time_t ticks...
Theo de Raadt

YOU’RE PING. WHY DO YOU EVEN CARE WHAT KERNEL VERSION IS RUNNING.
Dave Jones

Next, the police stops the bus and wants to know who's on it. As the programmers usually don't respond when being spoken to, especially if it is the police, the bus conductor hands out a list of all the passport copies he gathered. That is called introspection that is not backed by cooperative bus members. The conductor makes a copy of each OF THE PASSPORT of the people entering the bus, to help the police (i.e. debuggers) determine who is on the bus.
— Daniel Mack (paraphrased by Greg Kroah-Hartman) explains kdbus

This thread has made me realize that even as I am able to carve out more time to work on things like IB maintaintership, I no longer have the desire to spend my time maintaining the IB subsystem. Since my current level of activity is clearly hurting the community, I've decided to step down as maintainer.
— InfiniBand maintainer Roland Dreier moves on (thanks to Yann Droneaud)

Comments (5 posted)

A mailing list for year-2038 issues

A new mailing list has been set up for developers working on preventing the year-2038 apocalypse. "Please join if you are interested in discussing these or want to send patches. The intention at the moment much to keep this list explicitly open to newbies, so we will get a lot of incorrect patches there along with other patches. Patches can get sent here for an initial review before they get sent to the real maintainers, or you can put the list on Cc when sending a patch that should be applied and you already know what you are doing."

Full Story (comments: none)

Obstacles for kdbus

By Jonathan Corbet
April 15, 2015
The kdbus patch set, which adds a D-Bus-like messaging facility to the kernel, has been through several rounds of review over the course of the last year. The number of comments has been dropping with each review cycle, and the code seemed like it could be on track for a relatively easy merge into the 4.1 kernel. A closer look, though, reveals that there was some residual unhappiness from the last rounds that was always likely to flare up into active opposition when an attempt to merge kdbus was made. And, indeed, that is exactly what happened when Greg Kroah-Hartman sent a pull request to Linus on April 13.

This conversation is in full swing as of this writing, so an attempt to fully summarize it would be futile. In brief, though, the complaints take a number of forms. There is unhappiness with the performance of kdbus — a bit surprising, since performance is one of the motivating factors behind this development. There are a number of security-related concerns, especially around how the bus collects and transmits metadata about connected processes. Kdbus is still said to not play well with containers. Some developers find the complexity daunting. And so on.

The core of the disagreement, arguably, can be found in this message from Greg. There, he agreed that the design was "unfortunate" (though he later retracted that statement), and said that kdbus needed to be taken in its current form even if it is not ideal:

D-Bus is a specification that has been out there for over a decade, and we are not designing anything new here, but rather implementing it as designed. We have to be compatible to the existing users of the DBus system, and don't have the luxury of being able to change core things like this and expect the world to be able to change just because the design is not as clean as it should/could be.

Again, just like getting horrid hardware to work properly, sometimes we have to write odd code. Or having to implement a network protocol that doesn't seem to be designed "perfectly", yet is used by a few hundred million systems so we have to remain compatible. This is all that we are doing here for stuff like this.

Remember, this is called kDBUS, not kGENERICIPC, no matter how much we would have liked that to happen from a kernel standpoint. :)

It is probably fair to say that those who are opposed to kdbus in its current form would rather that it were, indeed, kGENERICIPC. They seem to feel that it should be able to support what is needed to implement D-Bus efficiently, but the D-Bus-specific parts, perhaps, should go into user space. After all, there are only so many interprocess communication mechanisms that can be merged into the kernel; the one that goes in, many developers think, should be free of known flaws and should be able to do more than reimplement the D-Bus protocol.

It is hard to say at this point how this discussion will play out or what Linus will decide to do in the end. The chances are good, though, that enough high-profile developers have expressed opposition to derail the merging of kdbus in this development cycle. Complete consensus is not always required to get code into the kernel, but getting code merged when there is serious opposition is still quite hard. This story, it seems, may go on for a while yet.

Comments (58 posted)

Kernel development news

4.1 Merge window, part 1

By Jonathan Corbet
April 15, 2015
Linus started merging changes for the 4.1 development cycle on April 13; as of this writing, a total of 3,643 non-merge changesets have been pulled into the mainline. In other words, things are just getting started. Still, some interesting changes have found their way in, though many of them will be of interest mainly to kernel developers.

Some of the more interesting, user-visible changes merged so far include:

  • Basic support for live kernel patching has been added to the S/390 architecture. What has been removed from S/390, instead, is support for the 31-bit mode, once needed to get past that pesky 16MB memory limit.

  • KVM virtualization on the MIPS architecture has gained support for the floating-point unit and the SIMD mode. KVM on ARM now supports interrupt injection via irqfd().

  • Load tracking in the CPU scheduler has been reworked to make the calculated process loads be independent of CPU speeds. That will enable better load-balancing decisions in the presence of frequency scaling and improve support for asymmetric systems like big.LITTLE where different types of CPUs are found in the same package.

  • New hardware support includes:

    • I2C: Digicolor I2C controllers, Ingenic JZ4780 I2C controllers, and Broadcom XLP9xx/XLP5xx I2C controllers.

    • IIO: Capella CM3323 color light sensors and Measurement Specialties MS5611 pressure sensors.

    • Input: Broadcom keypad controllers, MAXIM MAX77843 haptic controllers, iPAQ h3100/h3600/h3700 buttons, Semtech SX8654 I2C touchscreens, Qualcomm PM8941 performance management IC (PMIC) power keys, Broadcom IPROC touchscreens, and ChipOne icn8318 I2C touchscreen controllers.

    • Miscellaneous: Nuvoton NCT7904 hardware-monitoring chips, Broadcom IPROC SD/MMC and PCIe controllers, Dialog DA9150 charger and fuel-gauge controllers, X-Powers AXP288 fuel gauges, Nokia modems implementing the CMT speech protocol, Silicon Motion SM750 framebuffers, Ilitek ILI9163 LCD controllers, and Freescale Management Complex buses.

    • Multi-function device: Wolfson Microelectronics WM8280/WM8281 controllers, MediaTek MT6397 PMICs, Maxim Semiconductor MAX77843 PMICs, Intel Quark controllers, and Skyworks Solutions SKY81452 controllers.

    • Pin control: Marvell Armada 39x pin controllers, NVIDIA Tegra210 pinmux controllers, Broadcom Cygnus IOMUX controllers, Mediatek mt8135 pin controllers, AMD platform pin controllers, and Intel Sunrisepoint pin controllers.

    • USB: AltusMetrum ChaosKey random-number generators, TI dm816x USB PHYs, and Allwinner sun9i USB PHYs.

Changes visible to kernel developers include:

  • The kernel self-test code has gained an install target that installs test binaries into a special directory in the kernel tree. There is also a new set of timer self tests in the test suite.

  • The new efi=debug boot option causes extra information to be printed at boot time on systems with EFI firmware.

  • The long-deprecated IRQF_DISABLED interrupt flag has finally been removed from the kernel.

  • The "tracefs" virtual filesystem has been added. Tracefs contains the usual set of directories and files to control tracing, but it has the advantage that it can be mounted independently of debugfs. It thus allows system administrators to enable tracing without bringing in the other, potentially dangerous knobs found in debugfs. By default, tracefs will be mounted in the usual place (/sys/kernel/debug/tracing) when debugfs is mounted.

  • The new TRACE_DEFINE_ENUM() macro can be used to output values from enum types in tracepoints.

  • As usual, the perf tool has seen a long list of additions and improvements; see the top-level merge commit for details. Some of the more significant features include the ability to attach BPF programs to kernel probes, support for Intel's upcoming processor trace functionality ("a hardware tracer on steroids"), support for Intel's upcoming cache quality-of-service monitoring feature, and more.

  • The I2C subsystem can now function in "slave" mode, responding to a master controller elsewhere on the bus; see Documentation/i2c/slave-interface for details. The I2C layer has also gained a new quirk mechanism that can be used to describe the limitations of specific controllers.

Unless something surprising happens, the merge window can be expected to stay open through April 27. There will likely be a lull in the middle while Linus travels, but that has tended to not slow things down too much in the past. As usual, we will continue to track and report on the significant changes merged for the 4.1 development cycle.

Comments (none posted)

Persistent memory support progress

By Jonathan Corbet
April 15, 2015
Persistent memory (or non-volatile memory) has a number of nice features: it doesn't lose its contents when power is cycled, it is fast, and it is expected to be available in large quantities. Enabling proper support for this memory in the kernel has been a topic of discussion and development for some years; it was, predictably, an important topic at this year's Linux Storage, Filesystem, and Memory Management Summit. The 4.1 kernel will contain a new driver intended to improve support for persistent memory, but there is still a fair amount of work to be done.

At a first glance, persistent memory looks like normal RAM to the processor, so it might be tempting to simply use it that way. There are, though, some good reasons for not doing that. The performance characteristics of persistent memory are still not quite the same as RAM; in particular, write operations can be slower. Persistent memory may not wear out as quickly as older flash arrays did, but it is still best to avoid rewriting it many times per second, as could happen if it were used as regular memory. And the persistence of persistent memory is a valuable feature to take advantage of in its own right — but, to do so, the relevant software must know which memory ranges in the system are persistent. So persistent memory needs to be treated a bit differently.

The usual approach, at least for a first step, is to separate persistent memory from normal RAM and treat it as if it were a block device. Various drivers implementing this type of access have been circulating for a while now. It appears that this driver from Ross Zwisler will be merged for the 4.1 release. It makes useful reading as it is something close to the simplest possible example of a working block device driver. It takes a region of memory, registers a block device to represent that memory, and implements block read and write operations with memcpy() calls.

In his pull request to merge this driver, Ingo Molnar noted that a number of features that one might expect, including mmap() and execute-in-place, are not supported yet, and that persistent-memory contents would be copied in the page cache. What Ingo had missed is that the DAX patch set providing direct filesystem access to persistent memory was merged for the 4.0 release. If a DAX-supporting filesystem (ext4 now, XFS soon) is built in a persistent memory region, file I/O will avoid the page cache and operations like mmap() will be properly supported.

That said, there are a few things that still will not work quite as expected. One of those is mlock(), which, as Yigal Korman pointed out, may seem a bit strange: data stored in persistent memory is almost by definition locked in memory. As noted by Kirill Shutemov, though, supporting mlock() is not a simple no-op; the required behavior depends on how the memory mapping was set up in the first place. Private mappings still need copy-on-write semantics, for example. A perhaps weirder case is direct I/O: if a region of persistent memory is mapped into a process's address space, the process cannot perform direct I/O between that region and an ordinary file. There may also be problems with direct memory access (DMA) I/O operations, some network transfers, and the vmsplice() system call, among others.

Whither struct page?

In almost all cases, the restrictions with persistent memory come down to the lack of page structures for that memory. A page structure represents a page of physical memory in the system memory map; it contains just about everything the kernel knows about that page and how it is being used. See this article for the gory details of what can be found there. These structures are used with many internal kernel APIs that deal with memory. Persistent memory, lacking corresponding page structures, cannot be used with those APIs; as a result, various things don't work with persistent memory.

Kernel developers have hesitated to add persistent memory to the system memory map because persistent-memory arrays are expected to be large — in the terabyte range. With the usual 4KB page size, 1TB of persistent memory would need 256 million page structures which would occupy several gigabytes of RAM. And they do need to be stored in RAM, rather than in the persistent memory itself; page structures can change frequently, so storing them in memory that is subject to wear is not advisable. Rather than dedicate a large chunk of RAM to the tracking of persistent memory, the development community has, so far, chosen to treat that memory as a separate type of device.

At some point, though, a way to lift the limitations around persistent memory will need to be found. There appear to be two points of view on how that might be done. One says that page structures should never be used with persistent memory. The logical consequence of this view is that the kernel interfaces that currently use page structures need to be changed to use something else — page-frame numbers, for example — that works with both RAM and persistent memory. Dan Williams posted a patch removing struct page usage from the block layer in March. It is not for the faint of heart: just over 100 files are touched to make this change. That led to complaints from some developers that getting rid of struct page usage in APIs would involve a lot of high-risk code churn and remove a useful abstraction while not necessarily providing a lot of benefit.

The alternative would be to bite the bullet and add struct page entries for persistent memory regions. Boaz Harrosh posted a patch to that end in August 2014; it works by treating persistent memory as a range of hot-pluggable memory and allocating the memory-map entries at initialization time. The patch is relatively simple, but it does nothing to address the memory-consumption issue.

In the long run, the solution may take the form of something like a page structure that represents a larger chunk of memory. One obvious possibility is to make a version of struct page that refers to a huge page; that has the advantage of using a size that is understood by the processor's memory-management unit and would integrate well with the transparent huge page mechanism. An alternative would be a variable-size extent structure as is used by more recent filesystems. Either way, the changes required would be huge, so this is not something that is going to happen in the near future.

What will happen is that persistent memory devices will work on Linux as a storage medium for the major filesystems, providing good performance. There will be some rough edges with specific features that do not work, but most users are unlikely to run into them. With 4.1, the kernel will have a level of support for persistent-memory devices to allow that hardware to be put to good use, and to allow users to start figuring out what they actually want to do with that much fast, persistent storage.

Comments (3 posted)

Load tracking in the scheduler

April 15, 2015

This article was contributed by Preeti U Murthy.

The scheduler is an essential part of an operating system, tasked with, among other things, ensuring that processes get their fair share of CPU time. This is not as easy as it may seem initially. While some processes perform critical operations and have to be completed at high priority, others are not time-constrained. The former category of processes expect a bigger share of CPU time than the latter so as to finish as quickly as possible. But how big a share should the scheduler allocate to them?

Another factor that adds to the complexity in scheduling is the CPU topology. Scheduling on uniprocessor systems is simpler than scheduling on the multiprocessor systems that are more commonly found today. The topology of CPUs is only getting more complex with hyperthreads and heterogeneous processors like the big.LITTLE taking the place of symmetric processors. Scheduling a process on the wrong processor can adversely affect its performance. Thus, designing a scheduling algorithm that can keep all processes happy with the computing time allocated to them can be a formidable challenge.

The Linux kernel scheduler has addressed many of these challenges and matured over the years. Today there are different scheduling algorithms (or "scheduling classes") in the kernel to suit processes having different requirements. The Completely Fair Scheduling (CFS) class is designed to suit a majority of today's workloads. The realtime and deadline scheduling classes are designed for latency-sensitive and deadline-driven processes respectively. So we see that the scheduler developers have answered a range of requirements.

The Completely Fair Scheduling class

The CFS class is the class to which most tasks belong. In spite of the robustness of this algorithm, an area that has always had scope for improvement is process-load estimation.

If a CPU is associated with a number C that represents its ability to process tasks (let's call it "capacity"), then the load of a process is a metric that is expressed in units of C, indicating the number of such CPUs required to make satisfactory progress on its job. This number could also be a fraction of C, in which case it indicates that a single such CPU is good enough. The load of a process is important in scheduling because, besides influencing the time that a task spends running on the CPU, it helps to estimate overall CPU load, which is required during load balancing.

The question is how to estimate the load of a process. Should it be set statically or should it be set dynamically at run time based on the behavior of the process? Either way, how should it be calculated? There have been significant efforts at answering these questions in the recent past. As a consequence, the number of load-tracking metrics has grown significantly and load estimation itself has gotten quite complex. This landscape appears quite formidable to reviewers and developers of CFS. The aim of this article is to bring about clarification on this front.

Before proceeding, it is helpful to point out that the granularity of scheduling in Linux is at a thread level and not at a process level. However, the scheduling jargon for thread is "task." Hence throughout this article the term "task" has been used to mean a thread.

Scheduling entities and task groups

The CFS algorithm defines a time duration called the "scheduling period," during which every runnable task on the CPU should run at least once. This way no task gets starved for longer than a scheduling period. The scheduling period is divided among the tasks into time slices, which are the maximum amount of time that a task runs within a scheduling period before it gets preempted. This approach may seem to avoid task starvation at first. However it can lead to an undesirable consequence.

Linux is a multi-user operating system. Consider a scenario where user A spawns ten tasks and user B spawns five. Using the above approach, every task would get ~7% of the available CPU time within a scheduling period. So user A gets 67% and user B gets 33% of the CPU time during their runs. Clearly, if user A continues to spawn more tasks, he can starve user B of even more CPU time. To address this problem, the concept of "group scheduling" was introduced in the scheduler, where, instead of dividing the CPU time among tasks, it is divided among groups of tasks.

In the above example, the tasks spawned by user A belong to one group and those spawned by user B belong to another. The granularity of scheduling is at a group level; when a group is picked to run, its time slice is further divided between its tasks. In the above example, each group gets 50% of the CPU's time and tasks within each group divide this share further among themselves. As a consequence, each task in group A gets 5% of the CPU and each task in group B gets 10% of the CPU. So the group that has more tasks to run gets penalized with less CPU time per task and, more importantly, it is not allowed to starve sibling groups.

Group scheduling is enabled only if CONFIG_FAIR_GROUP_SCHED is set in the kernel configuration. A group of tasks is called a "scheduling entity" in the kernel and is represented by the sched_entity data structure:

    struct sched_entity { 
	struct load_weight load;
	struct sched_entity *parent;
	struct cfs_rq *cfs_rq;
	struct cfs_rq *my_rq;
	struct sched_avg avg;
	/* ... */
    };

Before getting into the details of how this structure is used, it is worth considering how and when groups of tasks are created. This happens under two scenarios:

  1. Users may use the control group ("cgroup") infrastructure to partition system resources between tasks. Tasks belonging to a cgroup are associated with a group in the scheduler (if the scheduler controller is attached to the group).

  2. When a new session is created through the set_sid() system call. All tasks belonging to a specific session also belong to the same scheduling group. This feature is enabled when CONFIG_SCHED_AUTOGROUP is set in the kernel configuration.

Outside of these scenarios, a single task becomes a scheduling entity on its own. A task is represented by the task_struct data structure:

    struct task_struct {
	struct sched_entity se;
	/* ... */
    };

Scheduling is always at the granularity of a sched_entity. That is why every task_struct is associated with a sched_entity data structure. CFS also accommodates nested groups of tasks. Each scheduling entity contains a run queue represented by:

    struct cfs_rq {
	struct load_weight load;
	unsigned long runnable_load_avg;
	unsigned long blocked_load_avg;
	unsigned long tg_load_contrib;
	/* ... */
    };

Each scheduling entity may, in turn, be queued on a parent scheduling entity's run queue. At the lowest level of this hierarchy, the scheduling entity is a task; the scheduler traverses this hierarchy until the end when it has to pick a task to run on the CPU.

The parent run queue on which a scheduling entity is queued is represented by cfs_rq, while the run queue that it owns is represented by my_rq in the sched_entity data structure. The scheduling entity gets picked from the cfs_rq when its turn arrives, and its time slice gets divided among the tasks on my_rq.

Let us now extend the concept of group scheduling to multiprocessor systems. Tasks belonging to a group can be scheduled on any CPU. Therefore it is not sufficient for a group to have a single scheduling entity; instead, every group must have one scheduling entity for each CPU. Tasks belonging to a group must move between the run queues in these per-CPU scheduling entities only, so that the footprint of the task is associated with the group even during task migrations. The data structure that represents scheduling entities of a group across CPUs is:

    struct task_group {
	struct sched_entity **se;
	struct cfs_rq **cfs_rq;
	unsigned long shares;
	atomic_long_t load_avg;
	/* ... */
    };

For every CPU c, a given task_group tg has a sched_entity called se and a run queue cfs_rq associated with it. These are related as follows:

    tg->se[c] = &se;
    tg->cfs_rq[c] = &se->my_rq;

So when a task belonging to tg migrates from CPUx to CPUy, it will be dequeued from tg->cfs_rq[x] and enqueued on tg->cfs_rq[y].

Time slice and task load

The concept of a time slice was introduced above as the amount of time that a task is allowed to run on a CPU within a scheduling period. Any given task's time slice is dependent on its priority and the number of tasks on the run queue. The priority of a task is a number that represents its importance; it is represented in the kernel by a number between zero and 139. The lower the value, the higher the priority. A task that has a stricter time requirement needs to have higher priority than others.

But the priority value by itself is not helpful to the scheduler, which also needs to know the load of the task to estimate its time slice. As mentioned above, the load must be the multiple of the capacity of a standard CPU that is required to make satisfactory progress on the task. Hence this priority number must be mapped to such a value; this is done in the array prio_to_weight[].

A priority number of 120, which is the priority of a normal task, is mapped to a load of 1024, which is the value that the kernel uses to represent the capacity of a single standard CPU. The remaining values in the array are arranged such that the multiplier between two successive entries is ~1.25. This number is chosen such that if the priority number of a task is reduced by one level, its gets 10% higher share of CPU time than otherwise. Similarly if the priority number is increased by one level, the task will get a 10% lower share of the available CPU time.

Let us consider an example to illustrate this. If there are two tasks, A and B, running at a priority of 120, the portion of available CPU time given to each task is calculated as:

    1024/(1024*2) = 0.5

However if the priority of task A is increased by one level to 121, its load becomes:

    (1024/1.25) = ~820

(Recall that higher the number, lesser is the load). Then, task A's portion of the CPU becomes:

    820/(1024+820)) = ~0.45

while task B will get:

    (1024/(1024+820)) = ~0.55

This is a 10% decrease in the CPU time share for Task A.

The load value of a process is stored in the weight field of the load_weight structure (which is, in turn, found in struct sched_entity):

    struct load_weight {
	unsigned long weight;
    };

A run queue (struct cfs_rq) is also characterized by a "weight" value that is the accumulation of weights of all tasks on its run queue.

The time slice can now be calculated as:

    time_slice = (sched_period() * se.load.weight) / cfs_rq.load.weight;

where sched_period() returns the scheduling period as a factor of the number of running tasks on the CPU. We see that the higher the load, the higher the fraction of the scheduling period that the task gets to run on the CPU.

Runtime and task load

We have seen how long a task runs on a CPU when picked, but how does the scheduler decide which task to pick? The tasks are arranged in a red-black tree in increasing order of the amount of time that they have spent running on the CPU, which is accumulated in a variable called vruntime. The lowest vruntime found in the queue is stored in cfs_rq.min_vruntime. When a new task is picked to run, the leftmost node of the red-black tree is chosen since that task has had the least running time on the CPU. Each time a new task forks or a task wakes up, its vruntime is assigned to a value that is the maximum of its last updated value and cfs_rq.min_vruntime. If not for this, its vruntime would be very small as an effect of not having run for a long time (or at all) and would take an unacceptably long time to catch up to the vruntime of its sibling tasks and hence starve them of CPU time.

Every periodic tick, the vruntime of the currently-running task is updated as follows:

    vruntime += delta_exec * (NICE_0_LOAD/curr->load.weight);

where delta_exec is the time spent by the task since the last time vruntime was updated, NICE_0_LOAD is the load of a task with normal priority, and curr is the currently-running task. We see that vruntime progresses slowly for tasks of higher priority. It has to, because the time slice for these tasks is large and they cannot be preempted until the time slice is exhausted.

Per-entity load-tracking metrics

The load of a CPU could have simply been the sum of the load of all the scheduling entities running on its run queue. In fact, that was once all there was to it. This approach has a disadvantage, though, in that tasks are associated with load values based only on their priorities. This approach does not take into account the nature of a task, such as whether it is a bursty or a steady task, or whether it is a CPU-intensive or an I/O-bound task. While this does not matter for scheduling within a CPU, it does matter when load balancing across CPUs because it helps estimate the CPU load more accurately. Therefore the per-entity load tracking metric was introduced to estimate the nature of a task numerically. This metric calculates task load as the amount of time that the task was runnable during the time that it was alive. This is kept track of in the sched_avg data structure (stored in the sched_entity structure):

    struct sched_avg {
	u32 runnable_sum, runnable_avg_period;
	unsigned long load_avg_contrib;
    };

Given a task p, if the sched_entity associated with it is se and the sched_avg of se is sa, then:

    sa.load_avg_contrib = (sa.runnable_sum * se.load.weight) / sa.runnable_period;

where runnable_sum is the amount of time that the task was runnable, runnable_period is the period during which the task could have been runnable.

Therefore load_avg_contrib is the fraction of the time that the task was ready to run. Again, the higher the priority, the higher the load.

So tasks showing peaks of activity after long periods of inactivity and tasks that are blocked on disk access (and thus non-runnable) most of the time have a smaller load_avg_contrib than CPU-intensive tasks such as code doing matrix multiplication. In the former case, runnable_sum would be a fraction of the runnable_period. In the latter, both these numbers would be equal (i.e. the task was runnable throughout the time that it was alive), identifying it as a high-load task.

The load on a CPU is the sum of the load_avg_contrib of all the scheduling entities on its run queue; it is accumulated in a field called runnable_load_avg in the cfs_rq data structure. This is roughly a measure of how heavily contended the CPU is. The kernel also tracks the load associated with blocked tasks. When a task gets blocked, its load is accumulated in the blocked_load_avg metric of the cfs_rq structure.

Per-entity load tracking in presence of task groups

Now what about the load_avg_contrib of a scheduling entity, se, when it is a group of tasks? The cfs_rq that the scheduling entity owns accumulates the load of its children in runnable_load_avg as explained above. From there, the parent task group of cfs_rq is first retrieved:

    tg = cfs_rq->tg;

The load contributed by this cfs_rq is added to the load of the task group tg:

    cfs_rq->tg_load_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
    tg->load_avg += cfs_rq->tg_load_contrib;

The load_avg_contrib of the scheduling entity se is now calculated as:

    se->avg.load_avg_contrib =
	  (cfs_rq->tg_load_contrib * tg->shares / tg->load_avg);

Where tg->shares is the maximum allowed load for the task group. This means that the load of a sched_entity should be a fraction of the shares of its parent task group, which is in proportion to the load of its children.

tg->shares can be set by users to indicate the importance of a task group. As is clear now, both the runnable_load_avg and and blocked_load_avg are required to estimate the load contributed by the task group.

There are still drawbacks in load tracking. The load metrics that are currently used are not CPU-frequency invariant. So if the CPU frequency increases, the load of the currently running task may appear smaller than otherwise. This may upset load-balancing decisions. The current load-tracking algorithm also falls apart in a few places when run on big.LITTLE processors. It either underestimates or overestimates the capacity of these processors. There are efforts ongoing to fix these problems. So there is good scope for improving the load-tracking heuristics in the scheduler. Hopefully this writeup has laid out the basics to help ease understanding and reviewing of the ongoing improvements on this front.

Comments (3 posted)

Patches and updates

Kernel trees

Ima Sheep Linux 4.0 released ?
Greg KH Linux 3.19.4 ?
Greg KH Linux 3.14.38 ?
Kamal Mostafa Linux 3.13.11-ckt19 ?
Jiri Slaby Linux 3.12.40 ?
Greg KH Linux 3.10.74 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Filesystems and block I/O

Janitorial

Richard Weinberger Remove execution domain support ?

Memory management

Networking

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2015, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds