LWN.net Logo

Kernel development

Brief items

Kernel release status

The 2.6.35 kernel was released on August 2. The relatively long announcement includes some thoughts on this development cycle (he's happy with how it went) and some concerns about the state of linux-next heading into 2.6.36. Some headline features in 2.6.35 include receive packet steering and receive flow steering, memory compaction, direct I/O support for Btrfs, and the usual pile of new drivers. As always, lots of details can be found on the excellent KernelNewbies 2.6.35 page.

Stable updates: the 2.6.27.49, 2.6.32.17, 2.6.33.7, and 2.6.34.2 stable updates are out; they all have the usual pile of important fixes. 2.6.33.7 is the final 2.6.33 update, so users should be thinking of moving on; Greg suggests 2.6.35, since 2.6.34 "is not going to be maintained for much longer."

Comments (none posted)

Quotes of the week

The Linux 2.6.35 release is also noteworthy in that it is the first Linux kernel release for which Torvalds specifically attempted to limit the number of changes made during its development to help limit the growing size and complexity of kernel updates.
-- Sean Michael Kerner, LinuxPlanet

So, in between snapshotting the image and actually hibernating, we have two parallel universes, one frozen in the image, the other writing that out to swap: with the danger that the latter (which is about to die) will introduce fatal inconsistencies in the former by placing pages in swap locations innocently reallocated from it. (Excuse me while I go write the movie script.)
-- Hugh Dickins

I regret putting the ordering into the original barrier code...it definitely did help reiserfs back in the day but it stinks of magic and voodoo. When it goes wrong, we'll only notice .000000001% of the time, and even then it'll only be when people report some random corruption which we'll blindly blame on either axboe or the drive.
-- Chris Mason

Comments (none posted)

AppArmor set to be merged for 2.6.36

It has been more than four years since LWN first reported on the AppArmor security module and the opposition to its addition to the mainline. Over that time, there has been much discussion of pathname-based security, the value of having multiple security modules, and more; meanwhile, AppArmor has mostly faded from view. Canonical developer John Johansen has picked up this module, though, and has been working toward its inclusion. The latest "what's coming" post from security maintainer James Morris (click below) now shows that AppArmor has been queued for the next merge window (the "Yama" security module from Canonical is also queued). Unless some last-minute opposition turns up, this should be the end of a long-running story.

Full Story (comments: 33)

Yama: not so fast

By Jonathan Corbet
August 3, 2010
James Morris's 2.6.36 security subsystem preview included, among other things, the Yama security module, which contains a number of security-related changes from Canonical. James later updated his posting, saying:

I'm going to revert the Yama stuff for 2.6.36 -- Christoph has nacked it to me off-list.

An off-list shootdown was always going to raise eyebrows, but Christoph (Hellwig) was quick to make his concerns public. He said:

As mentioned a few times during the past discussion moving broken code into a LSM doesn't magically fix it. In fact YAMA is not any kind of (semi-)coherent security policy like Selinux, smack or similar but just a random set of hacks that you didn't get past the subsystem maintainers.

Christoph, it seems, would rather that these changes went directly into the subsystems affected, rather than being swept into a separate security module. The problem, of course, is that's just how Yama author Kees Cook had started; he was told in no uncertain terms that putting his security-related changes directly into the VFS and ptrace() code was unwelcome. The advice at that time was that his changes should be put into a security module where the rest of the world could ignore them. Even Christoph suggested that approach back in June.

The "not a coherent security model" objection was heard from some other directions as well. According to Valdis Kletnieks:

In other words - if you want to be an LSM, you need to be full-featured enough to cover all the bases, not just a few cherry-picked ones.

Some developers, it seems, would rather not see a set of security-related tweaks gathered together into a module without an overall policy behind it. There have also been the usual claims that everything done by Yama can also be accomplished in SELinux, though Kees seems to disagree.

This rejection leaves Kees in the difficult position of trying to upstream his changes (something his employer has been criticized for not doing) but having no apparent way to actually get them merged. But it may be that all that's really required is a bit of patience. New security modules always seem to bring opposition out of the woodwork, but, with some persistence, they tend to get merged in the end.

Comments (31 posted)

Whack-a-droid

By Jonathan Corbet
August 3, 2010
Paul McKenney, it seems, is now working with the Linaro project, an assignment which has given him a new interest in power management. He has decided to start off with a bang by attempting to summarize the suspend blocker discussion with the goal of really understanding what Android's requirements are. Needless to say, he has kicked off a new, lengthy discussion which has cast the players' positions in a new light.

To oversimplify: one side seems to believe in addressing power management (and poorly-behaving applications in particular) by shining a light on the problems and fixing them (or applying social pressure to get them fixed) one at a time. This is the approach taken by developers like Arjan van de Ven, who have developed and used tools like PowerTop to great effect. The other side pushes for a more general solution; Paul describes the difference in view this way:

I agree that much progress has been made over the past four years. My laptop's battery life has increased substantially, with roughly half of the increase due to software changes. Taken over all the laptops and PCs out there, this indeed adds up to substantial and valuable savings.

So, yes, you have done quite well.

However, your reliance on application-by-application fixes, while good up to a point, is worrisome longer term. The reason for my concern is that computing history has not been kind to those who fail to vigorously automate. The Android guys have offered up one way of automating power efficiency. There might be better ways, but their approach does at the very least deserve a fair hearing -- and no one who read the recent LKML discussion of their approach could possibly mistake it for anything resembling a fair hearing.

So far, the conversation has not yet really returned to the Android approach; it has stayed more focused on the requirements and whether the "whack-a-mole" approach to power management is sufficient in the long term. Chances are good that Paul will be sending out an updated version of his requirements description sometime in the near future. Then, perhaps, there can be a calm discussion of how those requirements might best be satisfied.

Comments (16 posted)

Kernel development news

2.6.36 merge window part 1

By Jonathan Corbet
August 4, 2010
The 2.6.36 merge window got off to a rather slow start; Linus, perhaps, has been spending too much time with his coffee maker and not enough at the keyboard. Things got rolling, though, on the afternoon of August 4; as of this writing, about 2600 patches have been merged into the mainline. Here is a summary of what has been seen so far.

User-visible changes include:

  • The 9p filesystem has gained support for extended attributes and a new, Linux-specific variant of the 9p2000 protocol called 9p2000.L.

  • The CIFS filesystem can now make use of FS-Cache to keep local copies of files for performance.

  • The TOMOYO Linux security module has a new "interactive enforcing mode," allowing an administrator to permit policy violations at run time. It is intended to help when installing application updates (which require policy changes) on running production systems.

  • At long last, the AppArmor security module has been merged.

  • Rafael Wysocki's wakeup_count mechanism has been merged. This feature is intended to make it possible to suspend the system without having to worry about races with wakeup events; it thus hopes to solve part of the problem addressed by suspend blockers.

  • Support for the LIRC infrared controller API has been merged, along with a long list of LIRC drivers. LIRC is one of the larger pieces of out-of-tree code which is still shipped by many distributors, so this merge should help bring distributor kernels that much closer to the mainline.

  • New drivers:

    • Boards and systems: Bluewater Systems Snapper 9260/9G20 modules, HP t5325 thin client systems, NXP Semiconductor LPC32xx-based systems, and Eukrea CPUIMX51 and CPUIMX35 modules.

    • Input: Atmel QT602240 I2C touchscreens, Analog Devices ADXL34x three-axis digital accelerometers, and Cypress cy8ctmg110 touchscreens.

    • Miscellaneous: ARM Ltd. character LCD displays, HTC "Dream" (G1 handset) GPIO lines, and Intel "intelligent power sharing" controllers.

    • Networking: Freescale Flexcan CAN controllers, ESD USB/2 CAN/USB interfaces, Chelsio T4-based gigabit and 10Gb Ethernet adapters with PCI-E SR-IOV virtual functions, and CAIF protocol drivers on slave SPI interfaces

    • Video4Linux: i.MX27/i.MX25 camera sensor interfaces, SunPlus SPCA1528-based USB cameras, SQ Technologies SQ930X-based USB cameras, Windows Media Center Edition eHome infrared transceivers, and Freescale VIU video engines.

Changes visible to kernel developers include:

  • The ARM architecture has lost support for the "discontigmem" memory model; it is expected that everybody is using sparsemem at this point. ARM has also switched from the old bootmem allocator to memblock (formerly LMB) and added support for the -fstack-protector GCC feature.

  • The DMAPI hooks have been dropped from the XFS filesystem, indicating that the XFS developers do not ever expect to get hierarchical storage management at this level merged.

  • The PM_QOS API has changed again; quality-of-service requests are now added with:

        void pm_qos_add_request(struct pm_qos_request_list *request,
    			    int pm_qos_class, s32 value);
    

    The biggest change is that the request structure must now be allocated by the caller; this shifts a bit of work but, importantly, allows this function to be called in atomic context.

The merge window can be expected to remain open until around August 15, unless Linus decides to surprise developers by making it shorter.

Comments (8 posted)

Data temperature in Btrfs

By Jonathan Corbet
August 3, 2010
Linux, like most other operating systems, has long tried to keep frequently-accessed data in main memory. The cost of fetching a page from disk is high, so every I/O operation which can be eliminated by keeping data in a faster location yields a significant performance improvement. Recently, there has been an increasing level of interest in adding more levels of cache; the result has been patches like bcache, Cleancache/Frontswap, zcache, and more. The latest contribution in this area is a set of patches aimed at enabling multi-level caching within the Btrfs filesystem.

The patches, posted by Ben Chociej, are not a complete solution at this time. This code, instead, is meant to add the infrastructure needed to determine which data within a filesystem is "hot"; other work, to be done in the near future, will then be able to make use of this information to determine which data would benefit from being hosted on faster media - on a solid-state storage device, perhaps. The copy-on-write nature of Btrfs, along with its built-in volume management code, should make the implementation of this functionality relatively easy. We should find out in "a few weeks," when the first of these patches is promised; meanwhile, there is some interesting instrumentation work to look at.

These patches work by hooking into the small number of places in Btrfs where new I/O operations are initiated. Each of these places gets a call to:

    void btrfs_update_freqs(struct inode *inode, u64 start, u64 len, 
			    int create);

Where inode is the inode for the file being operated on, start is the beginning offset (in bytes), len is the number of bytes being transferred, and the mildly confusing create parameter is nonzero iff the operation is a write. This function maintains two red-black trees; the first, which is filesystem-wide, tracks the "hottest" inodes. For each inode, there is another tree tracking the hottest parts of the file. For each tree, the btrfs_update_freqs() call will update the stored parameters with the passed-in values.

The code tracks six independent parameters: the number of reads, a running average of the time between reads, and the time since the last read - along with the same information for writes. In the end, that information gets passed to a piece of deep magic called btrfs_get_temp() which boils those numbers down to a single "temperature" value. Your editor would love to simply provide the formula which is used, but it's not that simple - there's a lot of trickery with magic constants and various provisions against integer overflow problems. For those who would like to figure it out for themselves, here's the source for btrfs_get_temp().

There are three new ioctl() operations added by the patch set. To get the heat information for a specific file, BTRFS_IOC_GET_HEAT_INFO may be used. There are also BTRFS_IOC_GET_HEAT_OPTS and BTRFS_IOC_SET_HEAT_OPTS for querying and setting the state of heat tracking and (someday) migration of data based on the measured temperature data. A debugfs interface is also provided for those who would like to look at all of the data collected by this instrumentation.

There has not been a huge response to this patch set so far. The biggest complaint should be somewhat predictable: this capability looks like something which would be useful for many filesystems, so implementing it just for Btrfs looks like working at the wrong level. The virtual filesystem (VFS) layer is well placed to track I/O operations and could manage this kind of data collection. The VFS could also, perhaps, use this data to make better decisions on which pages to keep in the page cache. But, as long as the data is locked up within Btrfs, the VFS layer cannot use it, and it cannot be used to benefit any other filesystems.

The response to this complaint is that only Btrfs has the multiple device support needed to make use of this data. Dave Chinner finds that justification unconvincing, saying:

Why does it even need multiple devices in the filesystem? All the filesystem needs to know is the relative speed of regions of it's block address space and to be provided allocation hints.

There is often a degree of tension between those who would add features to specific filesystems and those who would rather see that functionality done at the VFS level. As a general rule, widely-useful features benefit from being done in the VFS, where they are more widely used and more closely scrutinized. But, often, an individual filesystem implementation can serve as a useful proof of concept and a place where important lessons are learned. All of which is to say that "hot data tracking" will likely make it into the kernel at some point, but it's not clear whether what is merged will resemble the current patches or not.

Comments (9 posted)

The IRMOS realtime scheduler

August 3, 2010

By T. Cucinotta and F. Checconi

In the context of the IRMOS European Project (Interactive Real-Time Applications on Service-Oriented Infrastructures), a new realtime scheduler for Linux has been developed by the Real-Time Systems Laboratory of Scuola Superiore Sant'Anna in Pisa. The purpose of this article is to provide a general overview of this new scheduler, describe its features and how it can be practically used, provide a few details about the implemented algorithms, and gathering feedback by the community about possible improvements.

The IRMOS realtime scheduler (a.k.a., EDF throttling or realtime throttling), allows the administrator to reserve a "slice" of the processing capability of a system for a group of Linux tasks. It is based on a direct modification of the POSIX realtime scheduling class within the Linux kernel, and in particular, to the throttling mechanism already built into the kernel for realtime tasks. Basically, the realtime throttling mechanism is changed from a mechanism that exclusively limits the computation power granted to groups of realtime tasks, to one that provides them with both a limit and precise scheduling guarantees (in terms of a guaranteed runtime every period, on each of the available CPUs). Also, it has been designed from scratch with SMP support in mind, and it implements a hierarchical scheduling policy based on both deadlines and priorities. Specifically, POSIX fixed priority (FP) realtime scheduling is nested inside EDF (Earliest Deadline First) scheduling.

The IRMOS realtime scheduler allows for the provisioning of scheduling guarantees to individual task groups. This provisioning is done by specifying two scheduling parameters: a budget Q and a period P. The tasks in the group are entitled to run on each of the CPUs (processor, or cores when present) available on the platform, for Q time units every period of P time units. This constitutes a scheduling guarantee and a limitation at the same time.

[Partitions]

For example, on a single-CPU system, a single task attached to a reservation of 10ms every 100ms is guaranteed to be scheduled on the CPU for 10ms every 100ms. If the task tries to execute for more than 10ms, then the scheduler removes it from the run queue until the next period, at which point its budget is refilled. So if the system has no other ready tasks to schedule, then the CPU goes idle in this time.

Note that periods of different reservations may be specified independently from each other, and the above guarantee is still valid.

The EDF-based scheduler applies a simple admission control rule that decides whether a new reservation may be accepted; it works by ensuring that the sum of the utilizations (budget over period) of all the reservations is less than or equal to the maximum configured share assigned to realtime tasks. This limit may be configured through the cpu.rt_runtime_us and cpu.rt_period_us entries of the root-level cgroup filesystem (see the tutorial below). Theoretically, the EDF-based scheduling algorithm allows for full utilization of each CPU by realtime tasks, provided that those tasks can be properly partitioned across the CPUs. However, from a practical perspective, this is far from being a desirable working condition.

The CBS: EDF-based Scheduling and Temporal Isolation

The deadline-based part of the IRMOS scheduler is an implementation of a hard-reservation variant of the Constant Bandwidth Server (CBS) algorithm, described in "Integrating Multimedia Applications in Hard Realtime Systems" by Abeni and Buttazzo. Let us take a peek at how this works, focusing on a single-CPU system, where independent reservations are scheduled.

For each reservation, in addition to the configured budget and period values, the kernel manages a current budget and a current deadline. Reservations are scheduled on each CPU depending on their current deadlines, using the earliest deadline first algorithm. The first time a reservation is activated, the current deadline is initialized to the activation time plus the configured period, and the current budget is set equal to the configured budget. Each time any task in the reservation is scheduled for some time on the CPU, the current budget is decreased by the same time value. Once the current budget goes to zero (it may also become negative due to non-interruptible kernel sections -- see below), the reservation is suspended (throttled) till the next activation period, when the current budget is refilled again to the configured value, and the deadline is moved forward by a value equal to the period.

A sample schedule is shown in the diagram below, for two tasks with reservations of 5ms every 9ms and 2ms every 6ms, respectively, for an overall utilization of about 88.9%.

[Scheduler diagram]

However, this is not enough yet, in order to guarantee temporal isolation among independent reservations. If one of the reserved tasks tried to consume more CPU than allocated, then it could potentially cause a deadline miss for another task which is, instead, behaving according to the declared parameters:

[Deadline miss]

To avoid this problem, the offending task is suspended by the kernel until the next period:

[Period enforcement]

Furthermore, whenever the reservations become non-runnable (e.g., all of the attached tasks block, then someone wakes up later) in a way that does not fit into the classical periodic activation pattern, we have another potential problem. For example, if a reservation becomes runnable too close to its current deadline, and the current deadline is not changed, then it will be selected by the EDF scheduler as the most urgent one to schedule, causing a potentially arbitrarily long delay to any other reservation on the same CPU:

[Blocking tasks]

To mitigate this problem, when a reservation wakes up as a consequence of a task unblocking itself, the scheduler may behave in one of two ways: if a relatively small time has elapsed since the process blocked, then the kernel keeps the same deadline and budget for the reservation. However, if an excessive amount of time has elapsed, then the kernel "resets" the deadline to the current time plus the reservation period, and the current budget to the allocated reservation budget:

[Shifting deadlines]

More specifically, if the remaining budget divided by the time left until the current deadline does not exceed the bandwidth allocated to the task (equal to the configured reservation budget over the reservation period), then the current deadline and budget are preserved, otherwise they are reset. See the paper on the CBS algorithm for details and a formal proof that this rule ensures temporal isolation among reserved independent task groups, regardless of their actual temporal behavior.

The above described mechanism has also the desirable property of self-synchronizing the scheduler with the temporal behavior of realtime tasks. In fact, when a reservation is attached to a single classical periodic realtime task, as soon as it wakes up in response to some (almost) periodic event, the scheduler will probably move its current deadline back to the wake-up time plus the period. On the other hand, such action is not usually done for very short sleeps of the task during its main execution body, e.g., in case it blocks on short critical sections for sharing data with other tasks of the same application.

Hierarchical Scheduling

[Hierarchical scheduling] The IRMOS scheduler features hierarchical scheduling, mixing both deadline-based and priority-based scheduling. Specifically, POSIX priority-based realtime scheduling is nested inside EDF-based scheduling. The situation is depicted in the figure to the right.

When a reservation is selected to run by the partitioned EDF-based scheduler, a global POSIX priority-based scheduling policy decides what tasks belonging to that reservation will actually run on each CPU. If there are M CPUs, at most the M tasks with the highest priority (among the ones belonging to the reservation group) are the ones which actually run. The system performs admission control over admitted reserved groups, so that the overall system capacity may be properly partitioned among concurrently running activities in the system, without overloading it. Also, the scheduler has a hierarchical configuration capability, by which it is possible to define groups and nested subgroups of realtime tasks with given scheduling parameters.

Further details about the IRMOS realtime scheduler are omitted for the sake of brevity, however the interested reader can refer to the paper "Hierarchical Multiprocessor CPU Reservations for the Linux Kernel" describing the scheduler which appeared at OSPERT 2009. Any comments and feedback on the project by Linux users and developers is more than welcome. Authors can be contacted by using the AQuoSA mailing lists.

Implementation Details

The IRMOS scheduler is implemented as a partitioned scheduling strategy: each reserved task group corresponds to a set of CBS reservations allocated (with identical parameters) on the CBS schedulers running independently on all of the available CPUs. The overall design of the current sched_rt implementation does not change; it still keeps one private run queue per CPU, and, thus, each CPU is scheduled independently of the others.

That said, using EDF to schedule groups and a fixed priority (FP) scheme among the tasks of each group requires using a different representation for groups and tasks within the run queues, so a sched_entity represents only tasks within the group they belong to, and the EDF-related parameters (deadline, budget, period) are kept inside the rt_rq describing the actual run queue associated to each cgroup (note that the rt_rq is, at the same time, the data structure enqueued with EDF parameters into the run queue it belongs to, and the fixed-priority queue responsible, once selected, for the priority-based scheduling of its own tasks).

The existing code represents groups of tasks using struct task_group objects; tasks are grouped on the basis of the cgroup they belong to, and each task group contains an array of per-processor run queues (rt_rqs). Tasks are represented by their own scheduling entity. The full hierarchy seen by the user is used internally only for admission control, as the scheduler itself enqueues all the rt_rqs based on their deadlines in the first-level run queue, and all the tasks are enqueued into the priority queue of the rt_rq they belong to. On each scheduling decision the (unthrottled) rt_rq with the smallest deadline is selected, and its highest-priority task is executed. When the tasks inside an rt_rq consume their assigned budget they will be throttled; their rt_rq is dequeued from the EDF run queue, and a timer is posted to recharge the run time, update the deadline, and requeue the rt_rq for the next period.

Using the full hierarchical setup for admission control introduces an extra element of complexity in the interface, because, in general, for each group, the user needs to specify the overall bandwidth assigned to the group and to its child groups, as well as the bandwidth assigned to the tasks belonging to the group itself. This increases the number of parameters for each group from two to four.

To avoid priority inversion problems, the scheduler uses, as the old throttling mechanism does, boosting; it lets groups with tasks inside critical sections run even if they should be throttled, charging them with the extra CPU time consumed only after they exit the critical section. From the implementation perspective this means that the EDF run queue also may contain boosted groups, which are scheduled only according to the highest priority among those of the tasks they contain; boosted rt_rqs take precedence over the other ones.

Admission control and deadline guarantees

When considering realtime systems, we might be concerned about how, exactly, to exploit the described realtime scheduling policy in order to provide proper realtime guarantees to applications. In relatively simple cases, the answer is straightforward. For example, a classical periodic realtime task with a known worst case execution time (WCET) of C and minimum inter-arrival period of T, can be scheduled within a reservation with budget equal to C and reservation period equal to T and will not miss any deadlines. Also, the admission test in this case is the well-known Liu and Layland test for EDF realtime tasks (sum of utilizations must be less than or equal to 1).

However, looking at realtime theory, one easily finds much more complex realtime task models, which include activation offsets, maximum blocking times, durations of critical sections accessing shared resources, etc. Also, real-world realtime applications are often complex multi-threaded applications (think of vlc) which are very far from behaving like foreseen by the "ideal" periodic or sporadic task model, and whose activation times are driven by disk I/O and networking instead of (or in addition to) timers. Furthermore, if the application is distributed, one has usually a distributed end-to-end deadline constraint to deal with, something out of reach for a kernel-level task scheduler.

Under such a challenging scenario, it is still possible to schedule realtime applications with a simple policy based on the fundamental principle of temporal isolation, like the one being presented in this article, and provide the necessary guarantees. However, the admission test becomes complex, involving long and involved computations, thus prohibitive for the kernel. For a list of possible admission control tests for realtime applications scheduled with various policies (including EDF and FP), the reader can have a look at the proceedings of conferences dedicated to realtime scheduling, such as the Real-Time Systems Symposium (RTSS), the EUROMICRO Conference on Real-Time Systems (ECRTS), the IEEE Real-Time and Embedded Technology and Applications Symposium, or others.

This complexity is why, in the EDF-based scheduler described above, the realtime scheduling parameters communicated to the kernel are kept at the bare minimum and are used in a very simple admission control test. This approach does not try to guarantee that all admitted applications will meet their deadlines, but rather it aims to provide to each application a guaranteed share of the available underlying computing power, with a precise timing granularity. Whether this is sufficient or not for guaranteeing the performance of specific applications must be confirmed by other means, involving a proper design methodology and benchmarking process, possibly with the help of user-space middleware.

A short tutorial

In order to try the IRMOS realtime scheduler, you can get the latest changes pushed on the on-line git repository (currently corresponding to the 2.6.34-rc5 series) with:

    git clone git://aquosa.git.sourceforge.net/gitroot/aquosa/linux-irmos

or, for the PREEMPT_RT port:

    git clone git://aquosa.git.sourceforge.net/gitroot/aquosa/linux-rt-irmos

Alternatively, you can download one of the supported kernel releases from http://www.kernel.org (currently, we have a patch for the 2.6.30.10 series), and the corresponding IRMOS patch from the AQuoSA web site. Also, you need to properly configure the kernel, ensuring the following options are enabled (most of them are already enabled by default):

    RT_GROUP_SCHED
    GROUP_SCHED
    CGROUPS
    CGROUP_SCHED
    EXPERIMENTAL
    PREEMPT (recommended)
    CGROUP_CPUACCT (recommended)
    SCHED_DEBUG (recommended)
    HIGH_RES_TIMERS
    HZ_1000 (suggested)

If preferred, a few binary RPM/DEB kernel packages can also be conveniently downloaded from the AQuoSA web site.

Usage

In order to use the realtime scheduler's capabilities, you need to mount the cgroup filesystem with something like:

    mkdir /cg
    mount -t cgroup -o cpu,cpuacct cgroup /cg

By default, up to 95% of the CPU power is allocated to standard POSIX realtime tasks in the root group, which doesn't leave much left over for reservations. So, before we can create a new group, we need to reduce the runtime for root-level tasks, e.g., lowering it to 200ms every 1s:

    echo 200000 > /cg/cpu.rt_rt_task_runtime_us

Now we can create a new group, with a reservation of 10ms every 100ms:

    mkdir /cg/g1
    echo 100000 > /cg/g1/cpu.rt_period_us
    echo 10000 > /cg/g1/cpu.rt_runtime_us
    echo 100000 > /cg/g1/cpu.rt_task_period_us
    echo 10000 > /cg/g1/cpu.rt_task_runtime_us

At this point, the new group has no associated tasks. We can attach a task by writing its Linux thread id (tid) to the tasks special file entry available in the group folder:

    echo 1421 > /cg/g1/tasks

Now the attached task has only been added to the group, but it still has its own scheduling class, defaulting to SCHED_OTHER. In order to exploit realtime scheduling, we need to assign to the task one of the realtime classes and a realtime priority:

    chrt -r -p 20 1421

At this point, the task is running with the configured scheduling guarantee (and limitation) of 10ms every 100ms.

Usability and AQuoSA integration

As shown above, the interface towards the new realtime scheduling functionality is based on the cgroup filesystem. While constituting a perfectly usable interface for scripting languages and system administrators, this kind of interface makes programming realtime applications which exploit the new scheduler functionality quite cumbersome: in order to create new reservations, folders need to be created in the cgroup filesystem; for setting scheduling parameters, numbers need to be formatted and written to cgroup entries; for reading them, cgroup entries need to be read back and parsed; etc. The libcgroup library may be of some help for such issues, but it carries non-negligible overhead into the applications. This may be especially troublesome for adaptive applications, e.g., multimedia ones, that might need to change dynamically the reservation parameters following the dynamic workload.

Furthermore, when changing both scheduling parameters (runtime and period), operations need to be carried out in a proper order which depends on the previous values of the parameters themselves, otherwise the admission control logic may reject one of the intermediate steps. Also, while playing with the scheduling parameters (e.g., while tuning the application's performance), one is forced to use intermediate configurations which are highly undesirable. For example, reducing both the budget and the period by an order of magnitude, such as from (100,200) to (10,20), one needs to reduce the runtime first, obtaining (10,200), then the period. However, in the intermediate configuration the realtime task is likely to fail due to the insufficient resources being granted.

Also, in the future, the number of parameters needed to configure the realtime scheduler's behavior is expected to (slightly) grow. What is needed from an application development perspective, is an atomic way of setting and changing them, possibly in the form of a user-space library (or system call) interface.

However, the new scheduler is being integrated into the AQuoSA open-source project (Adaptive Quality of Service Architecture), which makes a well-designed user-space API and adaptive reservations available to application developers as a dynamically linkable library. AQuoSA provides a user-space library which implements the the AQuoSA API on top of the cgroup-based operations needed to deal with the IRMOS scheduler, easing the task of coding applications that want to use it. More details on the AQuoSA integration can be found here.

Relationship with SCHED_DEADLINE

In addition to the IRMOS realtime scheduler, the Real-Time Systems Laboratory of Scuola Superiore Sant'Anna also collaborated with Evidence in the implementation of another EDF-based realtime scheduler for Linux, SCHED_DEADLINE. It is natural to wonder how these two schedulers differ:

  • SCHED_DEADLINE allows for having one single task attached to an EDF reservation; this raises such issues as what to exactly do if the EDF task forks. Options range from setting the policy of the child to SCHED_OTHER, to setting it to SCHED_DEADLINE with the original bandwidth split in half between the father and child, to providing the child with an initial bandwidth (budget, or runtime) of zero, as happens by default in the latest implementation.

    The IRMOS scheduler has hierarchical scheduling capabilities, in that it allows for having a full-fledged POSIX realtime (sub-)scheduler nested inside each EDF-based reservation: when an EDF thread forks, the child keeps the same class (_FIFO or _RR) and priority as the parent and they both keep sharing the same EDF reservation. This allows for easy encapsulation of entire complex multi-threaded software components (e.g., an Apache web server, a KVM instance, etc.), but it also works with "traditional" realtime software components, such as a set of a few realtime threads with realtime priorities which constitute a single realtime component on the system (e.g., a control thread along with the IRQ threads related to its own I/O towards the controlled peripherals). If a strong assessment of the realtime schedulability of the system is needed, it is possible to use hierarchical realtime scheduling theory in order to analyze whether or not the individual RT threads will meet their deadlines.

  • SCHED_DEADLINE uses a new system call interface, sched_setscheduler_ex(), which only allows for creating a reservation attached to a task.

    The IRMOS scheduler exploits the realtime throttling cgroup-based interface which was already there in the kernel, thus a new empty reservation is created by creating a folder in the cgroup filesystem, tasks are attached by adding their TIDs to the tasks file, the runtime and period are set by echoing their values into the corresponding file entries in the group folder, etc.

  • SCHED_DEADLINE supports partitioned EDF, and we have a draft global EDF implementation [PDF] made as part of the Master Degree Thesis in Computer Engineering of Juri Lelli.

    In the IRMOS scheduler, reservations apply to all CPUs, and it supports Global Fixed Priority tasks nested inside partitioned EDF-based reservations. If a RT task exhausts the budget (runtime) of the EDF scheduler over a CPU, it can still run exploiting the budget available in the same reservation over another CPU (migrating the task or the bandwidth are both available options). However, affinity masks can be used in order to better control over which CPUs realtime tasks will be able to migrate.

  • SCHED_DEADLINE is implemented as a completely new scheduling class, while the IRMOS scheduler is a modification to the already available POSIX realtime scheduling classes.

  • SCHED_DEADLINE also supports soft (work-conserving) resource reservations, while the IRMOS scheduler does not (however, it is planned as future work).

Even if the two projects are currently completely separated, there is a good basis for having a common EDF-based realtime scheduling infrastructure, that might be used by using different user-space APIs in the two cases.

Conclusions

In the future, we plan to improve the scheduler on various sides. Concerning the user-space API, the cgroup-based interaction model already proved to suffer from major limitations. For example, the absence of an atomic way to set at the same time multiple scheduling parameters constitutes a major limitation of the current interface. Also, we plan to develop more options in the realtime scheduling model, such as:

  • Adding soft resource reservations, allowing for work-conserving reservations coexisting with non-work-conserving ones.

  • Improving the access-control model, making the realtime scheduling capabilities more easily accessible to unprivileged applications.

  • Adding the possibility to specify a desired budget (run time), in addition to the minimum guaranteed one subject to admission control, which could be used for implementing adaptive reservations, which, in turn, could be useful for applications showing significant workload fluctuations at run time.

  • Adding some form of deadline inheritance for better addressing the well-known priority inheritance problem, e.g., by means of the Multi-processor BandWidth Inheritance (MBWI) protocol, or some variation on the concept.

All of the above improvements would go into the direction of enhancing usability of the realtime scheduler by common multimedia applications. These would be the applications taking most of the benefits from the exploitation of realtime scheduling capabilities of the Linux kernel, such as the ones described above.

Comments (4 posted)

Patches and updates

Kernel trees

Build system

Core kernel code

Device drivers

Filesystems and block I/O

Memory management

Architecture-specific

Security-related

  • Mimi Zohar: EVM . (July 30, 2010)

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds