Kernel development
Brief items
Kernel release status
The 4.0 kernel was released on April 12. In the announcement, Linus noted:
Beyond the (incomplete) live-patching mechanism, this release includes the removal of the remap_file_pages() system call, improved persistent memory support, the lazytime mount option, and the kernel address sanitizer.
The 4.1 merge window is now open; see the article below for a summary of what has been merged thus far.
Stable updates: 3.19.4, 3.14.38, and 3.10.74 were released on April 13.
Quotes of the week
A mailing list for year-2038 issues
A new mailing list has been set up for developers working on preventing the year-2038 apocalypse. "Please join if you are interested in discussing these or want to send patches. The intention at the moment much to keep this list explicitly open to newbies, so we will get a lot of incorrect patches there along with other patches. Patches can get sent here for an initial review before they get sent to the real maintainers, or you can put the list on Cc when sending a patch that should be applied and you already know what you are doing."
Obstacles for kdbus
The kdbus patch set, which adds a D-Bus-like messaging facility to the kernel, has been through several rounds of review over the course of the last year. The number of comments has been dropping with each review cycle, and the code seemed like it could be on track for a relatively easy merge into the 4.1 kernel. A closer look, though, reveals that there was some residual unhappiness from the last rounds that was always likely to flare up into active opposition when an attempt to merge kdbus was made. And, indeed, that is exactly what happened when Greg Kroah-Hartman sent a pull request to Linus on April 13.This conversation is in full swing as of this writing, so an attempt to fully summarize it would be futile. In brief, though, the complaints take a number of forms. There is unhappiness with the performance of kdbus — a bit surprising, since performance is one of the motivating factors behind this development. There are a number of security-related concerns, especially around how the bus collects and transmits metadata about connected processes. Kdbus is still said to not play well with containers. Some developers find the complexity daunting. And so on.
The core of the disagreement, arguably, can be found in this message from Greg. There, he agreed that
the design was "unfortunate
" (though he later retracted that
statement), and said that kdbus needed to be taken in its current form even
if it is not ideal:
Again, just like getting horrid hardware to work properly, sometimes we have to write odd code. Or having to implement a network protocol that doesn't seem to be designed "perfectly", yet is used by a few hundred million systems so we have to remain compatible. This is all that we are doing here for stuff like this.
Remember, this is called kDBUS, not kGENERICIPC, no matter how much we would have liked that to happen from a kernel standpoint. :)
It is probably fair to say that those who are opposed to kdbus in its current form would rather that it were, indeed, kGENERICIPC. They seem to feel that it should be able to support what is needed to implement D-Bus efficiently, but the D-Bus-specific parts, perhaps, should go into user space. After all, there are only so many interprocess communication mechanisms that can be merged into the kernel; the one that goes in, many developers think, should be free of known flaws and should be able to do more than reimplement the D-Bus protocol.
It is hard to say at this point how this discussion will play out or what Linus will decide to do in the end. The chances are good, though, that enough high-profile developers have expressed opposition to derail the merging of kdbus in this development cycle. Complete consensus is not always required to get code into the kernel, but getting code merged when there is serious opposition is still quite hard. This story, it seems, may go on for a while yet.
Kernel development news
4.1 Merge window, part 1
Linus started merging changes for the 4.1 development cycle on April 13; as of this writing, a total of 3,643 non-merge changesets have been pulled into the mainline. In other words, things are just getting started. Still, some interesting changes have found their way in, though many of them will be of interest mainly to kernel developers.Some of the more interesting, user-visible changes merged so far include:
- Basic support for live kernel patching has been added to the S/390
architecture. What has been removed from S/390, instead, is support
for the 31-bit mode,
once needed to get past that pesky 16MB memory limit.
- KVM virtualization on the MIPS architecture has gained support for the
floating-point unit and the SIMD mode. KVM on ARM now
supports interrupt injection via irqfd().
- Load tracking in the CPU scheduler has been reworked to make the
calculated process loads be independent of CPU speeds. That will enable
better load-balancing decisions in the presence of frequency scaling
and improve support for asymmetric systems like big.LITTLE where
different types of CPUs are found in the same package.
- New hardware support includes:
- I2C:
Digicolor I2C controllers,
Ingenic JZ4780 I2C controllers, and
Broadcom XLP9xx/XLP5xx I2C controllers.
- IIO:
Capella CM3323 color light sensors and
Measurement Specialties MS5611 pressure sensors.
- Input:
Broadcom keypad controllers,
MAXIM MAX77843 haptic controllers,
iPAQ h3100/h3600/h3700 buttons,
Semtech SX8654 I2C touchscreens,
Qualcomm PM8941 performance management IC (PMIC) power keys,
Broadcom IPROC touchscreens, and
ChipOne icn8318 I2C touchscreen controllers.
- Miscellaneous:
Nuvoton NCT7904 hardware-monitoring chips,
Broadcom IPROC SD/MMC and PCIe controllers,
Dialog DA9150 charger and fuel-gauge controllers,
X-Powers AXP288 fuel gauges,
Nokia modems implementing the CMT speech protocol,
Silicon Motion SM750 framebuffers,
Ilitek ILI9163 LCD controllers, and
Freescale Management Complex buses.
- Multi-function device:
Wolfson Microelectronics WM8280/WM8281 controllers,
MediaTek MT6397 PMICs,
Maxim Semiconductor MAX77843 PMICs,
Intel Quark controllers, and
Skyworks Solutions SKY81452 controllers.
- Pin control:
Marvell Armada 39x pin controllers,
NVIDIA Tegra210 pinmux controllers,
Broadcom Cygnus IOMUX controllers,
Mediatek mt8135 pin controllers,
AMD platform pin controllers, and
Intel Sunrisepoint pin controllers.
- USB: AltusMetrum ChaosKey random-number generators, TI dm816x USB PHYs, and Allwinner sun9i USB PHYs.
- I2C:
Digicolor I2C controllers,
Ingenic JZ4780 I2C controllers, and
Broadcom XLP9xx/XLP5xx I2C controllers.
Changes visible to kernel developers include:
- The kernel self-test code has gained an install target that
installs test binaries into a special directory in the kernel tree.
There is also a new set of timer self tests in the test suite.
- The new efi=debug boot option causes extra information to be
printed at boot time on systems with EFI firmware.
- The long-deprecated IRQF_DISABLED interrupt flag has finally
been removed from the kernel.
- The "tracefs" virtual filesystem has been added. Tracefs contains the
usual set of directories and files to control tracing, but it has the
advantage that it can be mounted independently of debugfs. It thus
allows system administrators to enable tracing without bringing in the
other, potentially dangerous knobs found in debugfs. By default,
tracefs will be mounted in the usual place
(/sys/kernel/debug/tracing) when debugfs is mounted.
- The new TRACE_DEFINE_ENUM() macro can be used to output
values from enum types in tracepoints.
- As usual, the perf tool has seen a long list of additions and
improvements; see the
top-level merge commit for details. Some of the more significant
features include the ability to attach BPF programs to kernel probes,
support for Intel's upcoming processor
trace functionality ("
a hardware tracer on steroids
"), support for Intel's upcoming cache quality-of-service monitoring feature, and more. - The I2C subsystem can now function in "slave" mode, responding to a master controller elsewhere on the bus; see Documentation/i2c/slave-interface for details. The I2C layer has also gained a new quirk mechanism that can be used to describe the limitations of specific controllers.
Unless something surprising happens, the merge window can be expected to stay open through April 27. There will likely be a lull in the middle while Linus travels, but that has tended to not slow things down too much in the past. As usual, we will continue to track and report on the significant changes merged for the 4.1 development cycle.
Persistent memory support progress
Persistent memory (or non-volatile memory) has a number of nice features: it doesn't lose its contents when power is cycled, it is fast, and it is expected to be available in large quantities. Enabling proper support for this memory in the kernel has been a topic of discussion and development for some years; it was, predictably, an important topic at this year's Linux Storage, Filesystem, and Memory Management Summit. The 4.1 kernel will contain a new driver intended to improve support for persistent memory, but there is still a fair amount of work to be done.At a first glance, persistent memory looks like normal RAM to the processor, so it might be tempting to simply use it that way. There are, though, some good reasons for not doing that. The performance characteristics of persistent memory are still not quite the same as RAM; in particular, write operations can be slower. Persistent memory may not wear out as quickly as older flash arrays did, but it is still best to avoid rewriting it many times per second, as could happen if it were used as regular memory. And the persistence of persistent memory is a valuable feature to take advantage of in its own right — but, to do so, the relevant software must know which memory ranges in the system are persistent. So persistent memory needs to be treated a bit differently.
The usual approach, at least for a first step, is to separate persistent memory from normal RAM and treat it as if it were a block device. Various drivers implementing this type of access have been circulating for a while now. It appears that this driver from Ross Zwisler will be merged for the 4.1 release. It makes useful reading as it is something close to the simplest possible example of a working block device driver. It takes a region of memory, registers a block device to represent that memory, and implements block read and write operations with memcpy() calls.
In his pull request to merge this driver, Ingo Molnar noted that a number of features that one might expect, including mmap() and execute-in-place, are not supported yet, and that persistent-memory contents would be copied in the page cache. What Ingo had missed is that the DAX patch set providing direct filesystem access to persistent memory was merged for the 4.0 release. If a DAX-supporting filesystem (ext4 now, XFS soon) is built in a persistent memory region, file I/O will avoid the page cache and operations like mmap() will be properly supported.
That said, there are a few things that still will not work quite as expected. One of those is mlock(), which, as Yigal Korman pointed out, may seem a bit strange: data stored in persistent memory is almost by definition locked in memory. As noted by Kirill Shutemov, though, supporting mlock() is not a simple no-op; the required behavior depends on how the memory mapping was set up in the first place. Private mappings still need copy-on-write semantics, for example. A perhaps weirder case is direct I/O: if a region of persistent memory is mapped into a process's address space, the process cannot perform direct I/O between that region and an ordinary file. There may also be problems with direct memory access (DMA) I/O operations, some network transfers, and the vmsplice() system call, among others.
Whither struct page?
In almost all cases, the restrictions with persistent memory come down to the lack of page structures for that memory. A page structure represents a page of physical memory in the system memory map; it contains just about everything the kernel knows about that page and how it is being used. See this article for the gory details of what can be found there. These structures are used with many internal kernel APIs that deal with memory. Persistent memory, lacking corresponding page structures, cannot be used with those APIs; as a result, various things don't work with persistent memory.
Kernel developers have hesitated to add persistent memory to the system memory map because persistent-memory arrays are expected to be large — in the terabyte range. With the usual 4KB page size, 1TB of persistent memory would need 256 million page structures which would occupy several gigabytes of RAM. And they do need to be stored in RAM, rather than in the persistent memory itself; page structures can change frequently, so storing them in memory that is subject to wear is not advisable. Rather than dedicate a large chunk of RAM to the tracking of persistent memory, the development community has, so far, chosen to treat that memory as a separate type of device.
At some point, though, a way to lift the limitations around persistent memory will need to be found. There appear to be two points of view on how that might be done. One says that page structures should never be used with persistent memory. The logical consequence of this view is that the kernel interfaces that currently use page structures need to be changed to use something else — page-frame numbers, for example — that works with both RAM and persistent memory. Dan Williams posted a patch removing struct page usage from the block layer in March. It is not for the faint of heart: just over 100 files are touched to make this change. That led to complaints from some developers that getting rid of struct page usage in APIs would involve a lot of high-risk code churn and remove a useful abstraction while not necessarily providing a lot of benefit.
The alternative would be to bite the bullet and add struct page entries for persistent memory regions. Boaz Harrosh posted a patch to that end in August 2014; it works by treating persistent memory as a range of hot-pluggable memory and allocating the memory-map entries at initialization time. The patch is relatively simple, but it does nothing to address the memory-consumption issue.
In the long run, the solution may take the form of something like a page structure that represents a larger chunk of memory. One obvious possibility is to make a version of struct page that refers to a huge page; that has the advantage of using a size that is understood by the processor's memory-management unit and would integrate well with the transparent huge page mechanism. An alternative would be a variable-size extent structure as is used by more recent filesystems. Either way, the changes required would be huge, so this is not something that is going to happen in the near future.
What will happen is that persistent memory devices will work on Linux as a storage medium for the major filesystems, providing good performance. There will be some rough edges with specific features that do not work, but most users are unlikely to run into them. With 4.1, the kernel will have a level of support for persistent-memory devices to allow that hardware to be put to good use, and to allow users to start figuring out what they actually want to do with that much fast, persistent storage.
Load tracking in the scheduler
The scheduler is an essential part of an operating system, tasked with, among other things, ensuring that processes get their fair share of CPU time. This is not as easy as it may seem initially. While some processes perform critical operations and have to be completed at high priority, others are not time-constrained. The former category of processes expect a bigger share of CPU time than the latter so as to finish as quickly as possible. But how big a share should the scheduler allocate to them?Another factor that adds to the complexity in scheduling is the CPU topology. Scheduling on uniprocessor systems is simpler than scheduling on the multiprocessor systems that are more commonly found today. The topology of CPUs is only getting more complex with hyperthreads and heterogeneous processors like the big.LITTLE taking the place of symmetric processors. Scheduling a process on the wrong processor can adversely affect its performance. Thus, designing a scheduling algorithm that can keep all processes happy with the computing time allocated to them can be a formidable challenge.
The Linux kernel scheduler has addressed many of these challenges and matured over the years. Today there are different scheduling algorithms (or "scheduling classes") in the kernel to suit processes having different requirements. The Completely Fair Scheduling (CFS) class is designed to suit a majority of today's workloads. The realtime and deadline scheduling classes are designed for latency-sensitive and deadline-driven processes respectively. So we see that the scheduler developers have answered a range of requirements.
The Completely Fair Scheduling class
The CFS class is the class to which most tasks belong. In spite of the robustness of this algorithm, an area that has always had scope for improvement is process-load estimation.
If a CPU is associated with a number C that represents its ability to process tasks (let's call it "capacity"), then the load of a process is a metric that is expressed in units of C, indicating the number of such CPUs required to make satisfactory progress on its job. This number could also be a fraction of C, in which case it indicates that a single such CPU is good enough. The load of a process is important in scheduling because, besides influencing the time that a task spends running on the CPU, it helps to estimate overall CPU load, which is required during load balancing.
The question is how to estimate the load of a process. Should it be set statically or should it be set dynamically at run time based on the behavior of the process? Either way, how should it be calculated? There have been significant efforts at answering these questions in the recent past. As a consequence, the number of load-tracking metrics has grown significantly and load estimation itself has gotten quite complex. This landscape appears quite formidable to reviewers and developers of CFS. The aim of this article is to bring about clarification on this front.
Before proceeding, it is helpful to point out that the granularity of scheduling in Linux is at a thread level and not at a process level. However, the scheduling jargon for thread is "task." Hence throughout this article the term "task" has been used to mean a thread.
Scheduling entities and task groups
The CFS algorithm defines a time duration called the "scheduling period," during which every runnable task on the CPU should run at least once. This way no task gets starved for longer than a scheduling period. The scheduling period is divided among the tasks into time slices, which are the maximum amount of time that a task runs within a scheduling period before it gets preempted. This approach may seem to avoid task starvation at first. However it can lead to an undesirable consequence.
Linux is a multi-user operating system. Consider a scenario where user A spawns ten tasks and user B spawns five. Using the above approach, every task would get ~7% of the available CPU time within a scheduling period. So user A gets 67% and user B gets 33% of the CPU time during their runs. Clearly, if user A continues to spawn more tasks, he can starve user B of even more CPU time. To address this problem, the concept of "group scheduling" was introduced in the scheduler, where, instead of dividing the CPU time among tasks, it is divided among groups of tasks.
In the above example, the tasks spawned by user A belong to one group and those spawned by user B belong to another. The granularity of scheduling is at a group level; when a group is picked to run, its time slice is further divided between its tasks. In the above example, each group gets 50% of the CPU's time and tasks within each group divide this share further among themselves. As a consequence, each task in group A gets 5% of the CPU and each task in group B gets 10% of the CPU. So the group that has more tasks to run gets penalized with less CPU time per task and, more importantly, it is not allowed to starve sibling groups.
Group scheduling is enabled only if CONFIG_FAIR_GROUP_SCHED is set in the kernel configuration. A group of tasks is called a "scheduling entity" in the kernel and is represented by the sched_entity data structure:
struct sched_entity {
struct load_weight load;
struct sched_entity *parent;
struct cfs_rq *cfs_rq;
struct cfs_rq *my_rq;
struct sched_avg avg;
/* ... */
};
Before getting into the details of how this structure is used, it is worth considering how and when groups of tasks are created. This happens under two scenarios:
- Users may use the control group ("cgroup") infrastructure to partition
system resources between tasks. Tasks belonging to a cgroup are
associated with a group in the scheduler (if the scheduler controller
is attached to the group).
- When a new session is created through the set_sid() system call. All tasks belonging to a specific session also belong to the same scheduling group. This feature is enabled when CONFIG_SCHED_AUTOGROUP is set in the kernel configuration.
Outside of these scenarios, a single task becomes a scheduling entity on its own. A task is represented by the task_struct data structure:
struct task_struct {
struct sched_entity se;
/* ... */
};
Scheduling is always at the granularity of a sched_entity. That is why every task_struct is associated with a sched_entity data structure. CFS also accommodates nested groups of tasks. Each scheduling entity contains a run queue represented by:
struct cfs_rq {
struct load_weight load;
unsigned long runnable_load_avg;
unsigned long blocked_load_avg;
unsigned long tg_load_contrib;
/* ... */
};
Each scheduling entity may, in turn, be queued on a parent scheduling entity's run queue. At the lowest level of this hierarchy, the scheduling entity is a task; the scheduler traverses this hierarchy until the end when it has to pick a task to run on the CPU.
The parent run queue on which a scheduling entity is queued is represented by cfs_rq, while the run queue that it owns is represented by my_rq in the sched_entity data structure. The scheduling entity gets picked from the cfs_rq when its turn arrives, and its time slice gets divided among the tasks on my_rq.
Let us now extend the concept of group scheduling to multiprocessor systems. Tasks belonging to a group can be scheduled on any CPU. Therefore it is not sufficient for a group to have a single scheduling entity; instead, every group must have one scheduling entity for each CPU. Tasks belonging to a group must move between the run queues in these per-CPU scheduling entities only, so that the footprint of the task is associated with the group even during task migrations. The data structure that represents scheduling entities of a group across CPUs is:
struct task_group {
struct sched_entity **se;
struct cfs_rq **cfs_rq;
unsigned long shares;
atomic_long_t load_avg;
/* ... */
};
For every CPU c, a given task_group tg has a sched_entity called se and a run queue cfs_rq associated with it. These are related as follows:
tg->se[c] = &se;
tg->cfs_rq[c] = &se->my_rq;
So when a task belonging to tg migrates from CPUx to CPUy, it will be dequeued from tg->cfs_rq[x] and enqueued on tg->cfs_rq[y].
Time slice and task load
The concept of a time slice was introduced above as the amount of time that a task is allowed to run on a CPU within a scheduling period. Any given task's time slice is dependent on its priority and the number of tasks on the run queue. The priority of a task is a number that represents its importance; it is represented in the kernel by a number between zero and 139. The lower the value, the higher the priority. A task that has a stricter time requirement needs to have higher priority than others.
But the priority value by itself is not helpful to the scheduler, which also needs to know the load of the task to estimate its time slice. As mentioned above, the load must be the multiple of the capacity of a standard CPU that is required to make satisfactory progress on the task. Hence this priority number must be mapped to such a value; this is done in the array prio_to_weight[].
A priority number of 120, which is the priority of a normal task, is mapped to a load of 1024, which is the value that the kernel uses to represent the capacity of a single standard CPU. The remaining values in the array are arranged such that the multiplier between two successive entries is ~1.25. This number is chosen such that if the priority number of a task is reduced by one level, its gets 10% higher share of CPU time than otherwise. Similarly if the priority number is increased by one level, the task will get a 10% lower share of the available CPU time.
Let us consider an example to illustrate this. If there are two tasks, A and B, running at a priority of 120, the portion of available CPU time given to each task is calculated as:
1024/(1024*2) = 0.5
However if the priority of task A is increased by one level to 121, its load becomes:
(1024/1.25) = ~820
(Recall that higher the number, lesser is the load). Then, task A's portion of the CPU becomes:
820/(1024+820)) = ~0.45
while task B will get:
(1024/(1024+820)) = ~0.55
This is a 10% decrease in the CPU time share for Task A.
The load value of a process is stored in the weight field of the load_weight structure (which is, in turn, found in struct sched_entity):
struct load_weight {
unsigned long weight;
};
A run queue (struct cfs_rq) is also characterized by a "weight" value that is the accumulation of weights of all tasks on its run queue.
The time slice can now be calculated as:
time_slice = (sched_period() * se.load.weight) / cfs_rq.load.weight;
where sched_period() returns the scheduling period as a factor of the number of running tasks on the CPU. We see that the higher the load, the higher the fraction of the scheduling period that the task gets to run on the CPU.
Runtime and task load
We have seen how long a task runs on a CPU when picked, but how does the scheduler decide which task to pick? The tasks are arranged in a red-black tree in increasing order of the amount of time that they have spent running on the CPU, which is accumulated in a variable called vruntime. The lowest vruntime found in the queue is stored in cfs_rq.min_vruntime. When a new task is picked to run, the leftmost node of the red-black tree is chosen since that task has had the least running time on the CPU. Each time a new task forks or a task wakes up, its vruntime is assigned to a value that is the maximum of its last updated value and cfs_rq.min_vruntime. If not for this, its vruntime would be very small as an effect of not having run for a long time (or at all) and would take an unacceptably long time to catch up to the vruntime of its sibling tasks and hence starve them of CPU time.
Every periodic tick, the vruntime of the currently-running task is updated as follows:
vruntime += delta_exec * (NICE_0_LOAD/curr->load.weight);
where delta_exec is the time spent by the task since the last time vruntime was updated, NICE_0_LOAD is the load of a task with normal priority, and curr is the currently-running task. We see that vruntime progresses slowly for tasks of higher priority. It has to, because the time slice for these tasks is large and they cannot be preempted until the time slice is exhausted.
Per-entity load-tracking metrics
The load of a CPU could have simply been the sum of the load of all the scheduling entities running on its run queue. In fact, that was once all there was to it. This approach has a disadvantage, though, in that tasks are associated with load values based only on their priorities. This approach does not take into account the nature of a task, such as whether it is a bursty or a steady task, or whether it is a CPU-intensive or an I/O-bound task. While this does not matter for scheduling within a CPU, it does matter when load balancing across CPUs because it helps estimate the CPU load more accurately. Therefore the per-entity load tracking metric was introduced to estimate the nature of a task numerically. This metric calculates task load as the amount of time that the task was runnable during the time that it was alive. This is kept track of in the sched_avg data structure (stored in the sched_entity structure):
struct sched_avg {
u32 runnable_sum, runnable_avg_period;
unsigned long load_avg_contrib;
};
Given a task p, if the sched_entity associated with it is se and the sched_avg of se is sa, then:
sa.load_avg_contrib = (sa.runnable_sum * se.load.weight) / sa.runnable_period;
where runnable_sum is the amount of time that the task was runnable, runnable_period is the period during which the task could have been runnable.
Therefore load_avg_contrib is the fraction of the time that the task was ready to run. Again, the higher the priority, the higher the load.
So tasks showing peaks of activity after long periods of inactivity and tasks that are blocked on disk access (and thus non-runnable) most of the time have a smaller load_avg_contrib than CPU-intensive tasks such as code doing matrix multiplication. In the former case, runnable_sum would be a fraction of the runnable_period. In the latter, both these numbers would be equal (i.e. the task was runnable throughout the time that it was alive), identifying it as a high-load task.
The load on a CPU is the sum of the load_avg_contrib of all the scheduling entities on its run queue; it is accumulated in a field called runnable_load_avg in the cfs_rq data structure. This is roughly a measure of how heavily contended the CPU is. The kernel also tracks the load associated with blocked tasks. When a task gets blocked, its load is accumulated in the blocked_load_avg metric of the cfs_rq structure.
Per-entity load tracking in presence of task groups
Now what about the load_avg_contrib of a scheduling entity, se, when it is a group of tasks? The cfs_rq that the scheduling entity owns accumulates the load of its children in runnable_load_avg as explained above. From there, the parent task group of cfs_rq is first retrieved:
tg = cfs_rq->tg;
The load contributed by this cfs_rq is added to the load of the task group tg:
cfs_rq->tg_load_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
tg->load_avg += cfs_rq->tg_load_contrib;
The load_avg_contrib of the scheduling entity se is now calculated as:
se->avg.load_avg_contrib =
(cfs_rq->tg_load_contrib * tg->shares / tg->load_avg);
Where tg->shares is the maximum allowed load for the task group. This means that the load of a sched_entity should be a fraction of the shares of its parent task group, which is in proportion to the load of its children.
tg->shares can be set by users to indicate the importance of a task group. As is clear now, both the runnable_load_avg and and blocked_load_avg are required to estimate the load contributed by the task group.
There are still drawbacks in load tracking. The load metrics that are currently used are not CPU-frequency invariant. So if the CPU frequency increases, the load of the currently running task may appear smaller than otherwise. This may upset load-balancing decisions. The current load-tracking algorithm also falls apart in a few places when run on big.LITTLE processors. It either underestimates or overestimates the capacity of these processors. There are efforts ongoing to fix these problems. So there is good scope for improving the load-tracking heuristics in the scheduler. Hopefully this writeup has laid out the basics to help ease understanding and reviewing of the ongoing improvements on this front.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Filesystems and block I/O
Janitorial
Memory management
Networking
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
