The 2.6.35 kernel was released on August 2
. The relatively long announcement
includes some thoughts on this
development cycle (he's happy with how it went) and some concerns about the
state of linux-next heading into 2.6.36. Some headline features in 2.6.35
include receive packet
and receive flow
, direct I/O support for Btrfs, and the usual pile of new
drivers. As always, lots of details can be found on the excellent KernelNewbies 2.6.35 page
Stable updates: the 220.127.116.11,
18.104.22.168 stable updates are out; they all
have the usual pile of important fixes. 22.214.171.124 is the final 2.6.33
update, so users should be thinking of moving on; Greg suggests 2.6.35,
since 2.6.34 "is not going to be maintained for much longer."
Comments (none posted)
The Linux 2.6.35 release is also noteworthy in that it is the first
Linux kernel release for which Torvalds specifically attempted to
limit the number of changes made during its development to help
limit the growing size and complexity of kernel updates.
So, in between snapshotting the image and actually hibernating, we
have two parallel universes, one frozen in the image, the other
writing that out to swap: with the danger that the latter (which is
about to die) will introduce fatal inconsistencies in the former by
placing pages in swap locations innocently reallocated from it.
(Excuse me while I go write the movie script.)
-- Hugh Dickins
I regret putting the ordering into the original barrier code...it
definitely did help reiserfs back in the day but it stinks of magic
When it goes wrong, we'll only notice .000000001% of the time, and
even then it'll only be when people report some random corruption
which we'll blindly blame on either axboe or the drive.
-- Chris Mason
Comments (none posted)
It has been more than four years since LWN first reported on the AppArmor security
and the opposition to its addition to the mainline. Over that
time, there has been much discussion of pathname-based
security, the value of having multiple security modules, and more;
meanwhile, AppArmor has mostly faded from view. Canonical developer John
Johansen has picked up this module, though, and has been working toward its
inclusion. The latest "what's coming" post from security maintainer James
Morris (click below) now shows that AppArmor has been queued for the next
merge window (the "Yama" security module from Canonical is also queued).
Unless some last-minute opposition turns up, this should be the end of a
Full Story (comments: 33)
James Morris's 2.6.36 security
included, among other things, the Yama security module
contains a number of security-related changes from Canonical. James later
updated his posting
I'm going to revert the Yama stuff for 2.6.36 -- Christoph has
nacked it to me off-list.
An off-list shootdown was always going to raise eyebrows, but Christoph
(Hellwig) was quick to make his concerns
public. He said:
As mentioned a few times during the past discussion moving broken
code into a LSM doesn't magically fix it. In fact YAMA is not any
kind of (semi-)coherent security policy like Selinux, smack or
similar but just a random set of hacks that you didn't get past the
Christoph, it seems, would rather that these changes went directly into the
subsystems affected, rather than being swept into a separate security
module. The problem, of course, is that's just how Yama author Kees Cook
had started; he was told in no uncertain terms that putting his
security-related changes directly into the VFS and ptrace() code
was unwelcome. The advice at that time was that his changes should be put
into a security module where the rest of the world could ignore them. Even
Christoph suggested that
approach back in June.
The "not a coherent security model" objection was heard from some other
directions as well. According to Valdis
In other words - if you want to be an LSM, you need to be
full-featured enough to cover all the bases, not just a few
Some developers, it seems, would rather not see a set
of security-related tweaks gathered together into a module without an
overall policy behind it. There have also been the usual claims that
everything done by Yama can also be accomplished in SELinux, though Kees
seems to disagree.
This rejection leaves Kees in the difficult position of trying to upstream
his changes (something his employer has been criticized for not doing) but
having no apparent way to actually get them merged. But it may be that all
that's really required is a bit of patience. New security modules always
seem to bring opposition out of the woodwork, but, with some persistence,
they tend to get merged in the end.
Comments (31 posted)
Paul McKenney, it seems, is now working with the Linaro project, an
assignment which has given him a new interest in power management. He has
decided to start off with a bang by attempting
to summarize the suspend blocker discussion
with the goal of really
understanding what Android's requirements are. Needless to say, he has
kicked off a new, lengthy discussion which has cast the players' positions
in a new light.
To oversimplify: one side seems to believe in addressing power management
(and poorly-behaving applications in particular) by shining a light on the
problems and fixing them (or applying social pressure to get them fixed)
one at a time. This is the approach taken by developers like Arjan van de
Ven, who have developed and used tools like PowerTop to great effect. The
other side pushes for a more general solution; Paul describes the difference in view this way:
I agree that much progress has been made over the past four years.
My laptop's battery life has increased substantially, with roughly
half of the increase due to software changes. Taken over all the
laptops and PCs out there, this indeed adds up to substantial and
So, yes, you have done quite well.
However, your reliance on application-by-application fixes, while
good up to a point, is worrisome longer term. The reason for my
concern is that computing history has not been kind to those who
fail to vigorously automate. The Android guys have offered up one
way of automating power efficiency. There might be better ways,
but their approach does at the very least deserve a fair hearing --
and no one who read the recent LKML discussion of their approach
could possibly mistake it for anything resembling a fair hearing.
So far, the conversation has not yet really returned to the Android
approach; it has stayed more focused on the requirements and whether the
"whack-a-mole" approach to power management is sufficient in the long
term. Chances are good that Paul will be sending out an updated version of
his requirements description sometime in the near future. Then, perhaps,
there can be a calm discussion of how those requirements might best be
Comments (16 posted)
Kernel development news
The 2.6.36 merge window got off to a rather slow start; Linus, perhaps, has
been spending too much time with his
and not enough at the keyboard. Things got rolling,
though, on the afternoon of August 4; as of this writing, about 2600
patches have been merged into the mainline. Here is a summary of what has
been seen so far.
User-visible changes include:
- The 9p filesystem has gained support for extended attributes and
a new, Linux-specific variant of the 9p2000 protocol called 9p2000.L.
- The CIFS filesystem can now make use of FS-Cache to keep local
copies of files for performance.
- The TOMOYO Linux security module has a new "interactive enforcing
mode," allowing an administrator to permit policy violations at run
time. It is intended to help when installing application updates
(which require policy changes) on running production systems.
- At long last, the AppArmor security module has been merged.
- Rafael Wysocki's wakeup_count mechanism has
been merged. This feature is intended to make it possible to suspend
the system without having to worry about races with wakeup events; it
thus hopes to solve part of the problem addressed by suspend blockers.
- Support for the LIRC infrared controller API has been merged, along
with a long list of LIRC drivers. LIRC is one of the larger pieces of
out-of-tree code which is still shipped by many distributors, so this
merge should help bring distributor kernels that much closer to the
- New drivers:
- Boards and systems:
Bluewater Systems Snapper 9260/9G20 modules,
HP t5325 thin client systems,
NXP Semiconductor LPC32xx-based systems, and
Eukrea CPUIMX51 and CPUIMX35 modules.
Atmel QT602240 I2C touchscreens,
Analog Devices ADXL34x three-axis digital accelerometers, and
Cypress cy8ctmg110 touchscreens.
- Miscellaneous: ARM Ltd. character LCD displays,
HTC "Dream" (G1 handset) GPIO lines, and
Intel "intelligent power sharing" controllers.
Freescale Flexcan CAN controllers,
ESD USB/2 CAN/USB interfaces,
Chelsio T4-based gigabit and 10Gb Ethernet
adapters with PCI-E SR-IOV virtual functions, and
CAIF protocol drivers on slave SPI interfaces
i.MX27/i.MX25 camera sensor interfaces,
SunPlus SPCA1528-based USB cameras,
SQ Technologies SQ930X-based USB cameras,
Windows Media Center Edition eHome infrared transceivers, and
Freescale VIU video engines.
Changes visible to kernel developers include:
- The ARM architecture has lost support for the "discontigmem"
memory model; it is expected that everybody is using sparsemem at this
point. ARM has also switched from the old bootmem allocator to
memblock (formerly LMB) and added support for the -fstack-protector
- The DMAPI hooks have been dropped from the XFS filesystem, indicating
that the XFS developers do not ever expect to get hierarchical storage
management at this level merged.
- The PM_QOS API has changed again; quality-of-service requests are now
void pm_qos_add_request(struct pm_qos_request_list *request,
int pm_qos_class, s32 value);
The biggest change is that the request structure must now
be allocated by the caller; this shifts a bit of work but,
importantly, allows this function to be called in atomic context.
The merge window can be expected to remain open until around
August 15, unless Linus decides to surprise developers by making it
Comments (8 posted)
Linux, like most other operating systems, has long tried to keep
frequently-accessed data in main memory. The cost of fetching a page from
disk is high, so every I/O operation which can be eliminated by keeping
data in a faster location yields a significant performance improvement.
Recently, there has been an increasing level of interest in adding more
levels of cache; the result has been patches like
, and more. The latest
contribution in this area is a set of patches aimed at enabling multi-level
caching within the Btrfs filesystem.
The patches, posted by Ben
Chociej, are not a complete solution at this time. This code, instead, is
meant to add the infrastructure needed to determine which data within a
filesystem is "hot"; other work, to be done in the near future, will then
be able to make use of this information to determine which data would
being hosted on faster media - on a solid-state storage device, perhaps.
The copy-on-write nature of Btrfs, along with its built-in volume
management code, should make the implementation of this functionality
relatively easy. We should find out in "a few weeks," when the first of
these patches is promised; meanwhile, there is some interesting
instrumentation work to look at.
These patches work by hooking into the small number of places in Btrfs
where new I/O operations are initiated. Each of these places gets a call
void btrfs_update_freqs(struct inode *inode, u64 start, u64 len,
Where inode is the inode for the file being operated on,
start is the beginning offset (in bytes), len is the
number of bytes being transferred, and the mildly confusing create
parameter is nonzero iff the operation is a write.
This function maintains two red-black trees; the first, which is
filesystem-wide, tracks the "hottest" inodes. For each inode, there is
another tree tracking the hottest parts of the file. For each tree, the
btrfs_update_freqs() call will update the stored parameters with
the passed-in values.
The code tracks six independent parameters: the number of reads, a running
average of the time between reads, and the time since the last read - along
with the same information for writes. In the end, that information gets
passed to a piece of deep magic called btrfs_get_temp() which
boils those numbers down to a single "temperature" value. Your editor
would love to simply provide the formula which is used, but it's not that
simple - there's a lot of trickery with magic constants and various
provisions against integer overflow problems. For those who would like to
figure it out for themselves, here's the source for
There are three new ioctl() operations added by the patch set. To
get the heat information for a specific file,
BTRFS_IOC_GET_HEAT_INFO may be used. There are also
BTRFS_IOC_GET_HEAT_OPTS and BTRFS_IOC_SET_HEAT_OPTS for
querying and setting the state of heat tracking and (someday) migration of
data based on the measured temperature data. A debugfs interface is also
provided for those who would like to look at all of the data collected by
There has not been a huge response to this patch set so far. The biggest
complaint should be somewhat predictable: this capability looks like
something which would be useful for many filesystems, so implementing it
just for Btrfs looks like working at the wrong level. The virtual
filesystem (VFS) layer is well placed to track I/O operations and could
manage this kind of data collection. The VFS could also, perhaps, use this
data to make better decisions on which pages to keep in the page cache.
But, as long as the data is locked up within Btrfs, the VFS layer cannot
use it, and it cannot be used to benefit any other filesystems.
The response to this complaint is that only Btrfs has the multiple device
support needed to make use of this data. Dave Chinner finds that justification unconvincing, saying:
Why does it even need multiple devices in the filesystem? All the
filesystem needs to know is the relative speed of regions of it's
block address space and to be provided allocation hints.
There is often a degree of tension between those who would add features to
specific filesystems and those who would rather see that functionality done
at the VFS level. As a general rule, widely-useful features benefit from
being done in the VFS, where they are more widely used and more closely
scrutinized. But, often, an individual filesystem implementation can serve
as a useful proof of concept and a place where important lessons are
learned. All of which is to say that "hot data tracking" will likely make
it into the kernel at some point, but it's not clear whether what is merged
will resemble the current patches or not.
Comments (9 posted)
In the context of the IRMOS
(Interactive Real-Time Applications on
Service-Oriented Infrastructures), a new realtime scheduler for Linux
has been developed by the Real-Time
The purpose of this article is to provide a general overview of this new
scheduler, describe its features and how it can be
practically used, provide a few details about the implemented
algorithms, and gathering feedback by the community about possible
The IRMOS realtime scheduler (a.k.a., EDF throttling or realtime
throttling), allows the administrator to reserve a "slice" of the processing
capability of a system for a group of Linux tasks. It is based on a
direct modification of the POSIX realtime scheduling class within the
Linux kernel, and in particular, to the throttling mechanism already built
into the kernel for realtime tasks. Basically, the realtime
throttling mechanism is changed from a mechanism that exclusively
limits the computation power granted to groups of realtime tasks, to
one that provides them with both a limit and precise
scheduling guarantees (in terms of a guaranteed runtime every
period, on each of the available CPUs). Also,
it has been designed from scratch with SMP support in mind, and it
implements a hierarchical scheduling policy based on both deadlines
and priorities. Specifically, POSIX fixed priority (FP) realtime
scheduling is nested inside EDF (Earliest Deadline First) scheduling.
The IRMOS realtime scheduler allows for the provisioning of
scheduling guarantees to individual task groups.
This provisioning is done by specifying two scheduling parameters:
a budget Q and a period P. The tasks in the group
are entitled to run on each of the CPUs (processor, or cores when
present) available on the platform, for Q time units every period of P
time units. This constitutes a scheduling guarantee and a limitation
at the same time.
For example, on a single-CPU system, a single task attached to a
reservation of 10ms every 100ms is guaranteed to be scheduled on the
CPU for 10ms every 100ms. If the task tries to execute for more than
10ms, then the scheduler removes it from the run queue until the
next period, at which point its budget is refilled. So if the system has no
other ready tasks to schedule, then the CPU goes idle in this time.
Note that periods of different reservations may be specified
independently from each other, and the above guarantee is still valid.
The EDF-based scheduler applies a simple admission control
rule that decides whether a new reservation may be accepted; it works by
ensuring that the sum of the utilizations (budget over period) of all
the reservations is less than or equal to the maximum configured share
assigned to realtime tasks. This limit may be configured through
the cpu.rt_runtime_us and cpu.rt_period_us entries
of the root-level cgroup filesystem (see the tutorial below).
Theoretically, the EDF-based scheduling algorithm allows for
full utilization of each CPU by realtime tasks, provided that those tasks
can be properly partitioned across the CPUs. However, from a practical
perspective, this is far from being a desirable working condition.
The CBS: EDF-based Scheduling and Temporal Isolation
The deadline-based part of the IRMOS scheduler is an
implementation of a hard-reservation variant of the Constant Bandwidth
Server (CBS) algorithm, described in
Multimedia Applications in Hard Realtime Systems" by Abeni and
Buttazzo. Let us
take a peek at how this works, focusing on a single-CPU system, where
independent reservations are scheduled.
For each reservation, in addition to the configured budget and period
values, the kernel manages a current budget and a current
deadline. Reservations are scheduled on each CPU depending on
their current deadlines, using the earliest deadline first algorithm.
The first time a reservation is activated, the current deadline is
initialized to the activation time plus the configured period,
and the current budget is set equal to the configured budget. Each time
any task in the reservation is scheduled for some time on the CPU, the
current budget is decreased by the same time value. Once the current
budget goes to zero (it may also become negative due to
non-interruptible kernel sections -- see below), the reservation is
suspended (throttled) till the next activation period, when the current budget is
refilled again to the configured value, and the deadline is moved forward
by a value equal to the period.
A sample schedule is
shown in the diagram below, for two tasks with reservations of 5ms
every 9ms and 2ms every 6ms,
respectively, for an overall utilization of about 88.9%.
However, this is not enough yet, in order to guarantee temporal
isolation among independent reservations. If one of the
reserved tasks tried to consume more CPU than allocated, then it could
potentially cause a deadline miss for another task which is, instead,
behaving according to the declared parameters:
To avoid this problem, the offending task is suspended by the kernel
until the next period:
Furthermore, whenever the reservations become non-runnable (e.g., all
of the attached tasks block, then someone wakes up later) in a way
that does not fit into the classical periodic activation pattern, we
have another potential problem. For example, if a reservation becomes runnable
too close to its current deadline, and the current deadline is not
changed, then it will be selected by the EDF scheduler as the most
urgent one to schedule, causing a potentially arbitrarily long delay
to any other reservation on the same CPU:
To mitigate this problem, when a reservation wakes up as a consequence of a task
unblocking itself, the scheduler may behave in one of two ways: if a
relatively small time has elapsed since the process blocked, then the kernel
keeps the same deadline and budget for the reservation. However, if
an excessive amount of time has elapsed, then the
kernel "resets" the deadline to the current time plus the reservation
period, and the current budget to the allocated reservation budget:
More specifically, if the remaining budget
divided by the time left until the current deadline does not exceed the
bandwidth allocated to the task (equal to the configured reservation
budget over the reservation period), then the current deadline and
budget are preserved, otherwise they are reset. See the paper on
algorithm for details and a formal proof that this rule ensures
temporal isolation among reserved independent task groups, regardless of
their actual temporal behavior.
The above described mechanism has also the desirable
property of self-synchronizing the scheduler with the
temporal behavior of realtime tasks. In fact, when a reservation is
attached to a single classical periodic realtime task, as soon as it
wakes up in response to some (almost) periodic event, the scheduler
will probably move its current deadline back to the wake-up time plus the
period. On the other hand, such action is not usually done for very
short sleeps of the task during its main execution body, e.g., in case
it blocks on short critical sections for sharing data with other tasks
of the same application.
The IRMOS scheduler features hierarchical scheduling,
mixing both deadline-based and priority-based
scheduling. Specifically, POSIX priority-based realtime scheduling is
nested inside EDF-based scheduling. The situation is depicted in the
figure to the right.
When a reservation is selected to run by the partitioned
EDF-based scheduler, a global POSIX priority-based scheduling policy
decides what tasks belonging to that reservation will actually run on
each CPU. If there are M CPUs, at most the M tasks with the
highest priority (among the ones belonging to the reservation group)
are the ones which actually run. The system performs admission control
over admitted reserved groups, so that the overall system capacity may
be properly partitioned among concurrently running activities in the
system, without overloading it. Also, the scheduler has a hierarchical
configuration capability, by which it is possible to define groups and
nested subgroups of realtime tasks with given scheduling parameters.
Further details about the IRMOS realtime scheduler are omitted for
the sake of brevity, however the interested reader can refer to the
Multiprocessor CPU Reservations for the Linux Kernel" describing the
scheduler which appeared
Any comments and feedback on the project by Linux users and developers
is more than welcome. Authors can be contacted by using the
AQuoSA mailing lists.
The IRMOS scheduler is implemented as a partitioned scheduling
strategy: each reserved task group corresponds to a set of CBS
reservations allocated (with identical parameters) on the CBS
schedulers running independently on all of the available CPUs. The
overall design of the current sched_rt implementation does not change;
it still keeps one private run queue per CPU, and, thus, each CPU is
scheduled independently of the others.
That said, using EDF to schedule groups
and a fixed priority (FP) scheme among the tasks of each group
requires using a different representation for groups and tasks within
the run queues, so a sched_entity represents only tasks within
the group they belong to, and the EDF-related parameters (deadline,
budget, period) are kept inside the rt_rq describing the
actual run queue associated to each cgroup (note that
the rt_rq is, at the same time, the data structure enqueued
with EDF parameters into the run queue it belongs to, and the
fixed-priority queue responsible, once selected, for the
priority-based scheduling of its own tasks).
The existing code represents groups of tasks using struct
task_group objects; tasks are grouped on the basis of the cgroup
they belong to, and each task group contains an array of per-processor
run queues (rt_rqs).
Tasks are represented by their own scheduling entity. The full hierarchy
seen by the user is used internally only for admission control, as the
scheduler itself enqueues all the rt_rqs based on their deadlines
in the first-level run queue, and all the tasks are enqueued into the
priority queue of the rt_rq they belong to. On each scheduling
decision the (unthrottled) rt_rq with the smallest deadline is
selected, and its highest-priority task is executed. When the tasks
inside an rt_rq consume their assigned budget they will be throttled;
their rt_rq is dequeued from the EDF run queue, and a
timer is posted to recharge the run time, update the deadline, and
requeue the rt_rq for the next period.
Using the full hierarchical setup for admission control introduces an
extra element of complexity in the interface, because, in general, for
each group, the user needs to specify the overall bandwidth assigned
to the group and to its child groups, as well as the bandwidth assigned to
the tasks belonging to the group itself. This increases the number
of parameters for each group from two to four.
To avoid priority inversion problems, the scheduler uses, as the old
throttling mechanism does, boosting;
it lets groups with tasks inside critical sections run
even if they should be throttled, charging them with the extra CPU
time consumed only after they exit the critical section. From the
implementation perspective this means that the EDF run queue also
may contain boosted groups, which are scheduled only according to
the highest priority among those of the tasks they contain; boosted
rt_rqs take precedence over the other ones.
Admission control and deadline guarantees
When considering realtime systems, we might be concerned about
how, exactly, to exploit the described realtime scheduling
policy in order to provide proper realtime guarantees to
applications. In relatively simple cases, the answer is straightforward.
For example, a classical periodic realtime task with a known worst
case execution time (WCET) of C and minimum inter-arrival period of T,
can be scheduled within a reservation with budget equal to C and reservation
period equal to T and will not miss any deadlines. Also, the admission
test in this case is the well-known Liu and Layland test for EDF
realtime tasks (sum of utilizations must be less than or equal to 1).
However, looking at realtime theory, one easily finds much
more complex realtime task models, which include activation
offsets, maximum blocking times, durations of
critical sections accessing shared resources,
etc. Also, real-world realtime applications are often complex
multi-threaded applications (think of vlc) which are very far
from behaving like foreseen by the "ideal" periodic or sporadic task
model, and whose activation times are driven by disk I/O and networking
instead of (or in addition to) timers. Furthermore, if the application
is distributed, one has usually a distributed end-to-end deadline
constraint to deal with, something out of reach for a kernel-level
Under such a challenging scenario, it is still possible to schedule
realtime applications with a simple policy based on the fundamental
principle of temporal isolation, like the one being presented in this
article, and provide the necessary guarantees. However, the admission
test becomes complex, involving long and involved computations,
thus prohibitive for the kernel. For a list of possible admission
control tests for realtime applications scheduled with various
policies (including EDF and FP), the reader can have a
look at the proceedings of conferences dedicated to realtime scheduling, such as
the Real-Time Systems Symposium
(RTSS), the EUROMICRO
Conference on Real-Time Systems (ECRTS),
the IEEE Real-Time and Embedded
Technology and Applications Symposium, or others.
This complexity is why, in the EDF-based scheduler described above, the realtime
scheduling parameters communicated to the kernel are kept at the bare
minimum and are used in a very simple admission control test. This approach
not try to guarantee that all admitted applications will meet
their deadlines, but rather it aims to provide to each application a
guaranteed share of the available underlying computing power,
with a precise timing granularity. Whether this is sufficient
or not for guaranteeing the performance of specific applications must be
confirmed by other means, involving a proper design methodology and
benchmarking process, possibly with the help of user-space middleware.
A short tutorial
In order to try the IRMOS realtime scheduler, you can get the latest
changes pushed on
(currently corresponding to the 2.6.34-rc5
git clone git://aquosa.git.sourceforge.net/gitroot/aquosa/linux-irmos
or, for the PREEMPT_RT port:
git clone git://aquosa.git.sourceforge.net/gitroot/aquosa/linux-rt-irmos
Alternatively, you can download one of the supported kernel releases
(currently, we have a patch for the 126.96.36.199 series), and
the corresponding IRMOS patch from the AQuoSA web site.
Also, you need to properly configure the kernel, ensuring the
following options are enabled (most of them are already enabled by
If preferred, a few binary RPM/DEB kernel packages can also be
conveniently downloaded from
the AQuoSA web
In order to use the realtime scheduler's capabilities, you need to mount
the cgroup filesystem with something like:
mount -t cgroup -o cpu,cpuacct cgroup /cg
By default, up to 95% of the CPU power is allocated to standard POSIX
realtime tasks in the root group, which doesn't leave much left over for
reservations. So, before we can create a new group,
we need to reduce the runtime for root-level tasks, e.g., lowering it
to 200ms every 1s:
echo 200000 > /cg/cpu.rt_rt_task_runtime_us
Now we can create a new group, with a reservation of 10ms every 100ms:
echo 100000 > /cg/g1/cpu.rt_period_us
echo 10000 > /cg/g1/cpu.rt_runtime_us
echo 100000 > /cg/g1/cpu.rt_task_period_us
echo 10000 > /cg/g1/cpu.rt_task_runtime_us
At this point, the new group has no associated tasks. We can attach a
task by writing its Linux thread id (tid) to the tasks
special file entry available in the group folder:
echo 1421 > /cg/g1/tasks
Now the attached task has only been added to the group, but it still
has its own scheduling class, defaulting to SCHED_OTHER. In
order to exploit realtime scheduling, we need to assign to the task
one of the realtime classes and a realtime priority:
chrt -r -p 20 1421
At this point, the task is running with the configured scheduling
guarantee (and limitation) of 10ms every 100ms.
Usability and AQuoSA integration
As shown above, the interface towards the new realtime scheduling
functionality is based on the cgroup filesystem. While constituting a
perfectly usable interface for scripting languages and system
administrators, this kind of interface makes programming realtime
applications which exploit the new scheduler functionality quite
cumbersome: in order to create new reservations, folders need to be
created in the cgroup filesystem; for setting scheduling parameters,
numbers need to be formatted and written to cgroup entries; for
reading them, cgroup entries need to be read back and parsed; etc.
The libcgroup library may be of
some help for such issues, but it
carries non-negligible overhead into the applications. This may be
especially troublesome for adaptive applications, e.g., multimedia
ones, that might need to change dynamically the reservation parameters
following the dynamic workload.
Furthermore, when changing both scheduling parameters
(runtime and period), operations need to be carried out in a proper
order which depends on the previous values of the parameters
themselves, otherwise the admission control logic may reject one of
the intermediate steps. Also, while playing with the scheduling
parameters (e.g., while tuning the application's performance), one is
forced to use intermediate configurations which are highly
undesirable. For example, reducing both the
budget and the period by an order of magnitude, such as from (100,200)
to (10,20), one needs to reduce the runtime first,
obtaining (10,200), then the period. However, in the
intermediate configuration the realtime task is likely to fail due
to the insufficient resources being granted.
Also, in the future, the number of parameters needed to configure the
realtime scheduler's behavior is expected to (slightly) grow. What is
needed from an application development perspective, is an atomic way
of setting and changing them, possibly in the form of a user-space
library (or system call) interface.
However, the new scheduler is being integrated into the
AQuoSA open-source project
(Adaptive Quality of Service Architecture), which makes a
API and adaptive reservations available to application developers
as a dynamically linkable library. AQuoSA provides a user-space
library which implements the
API on top of the cgroup-based operations needed to deal with the
IRMOS scheduler, easing the task of coding applications that want to
use it. More details on the AQuoSA integration can be
Relationship with SCHED_DEADLINE
In addition to the IRMOS realtime scheduler, the Real-Time Systems
Laboratory of Scuola Superiore Sant'Anna also collaborated
with Evidence in the
implementation of another EDF-based realtime scheduler for
It is natural to wonder how these two schedulers differ:
SCHED_DEADLINE allows for having one single task attached to an EDF
reservation; this raises such issues as what to exactly do if the EDF
task forks. Options range from setting the policy of the child to
SCHED_OTHER, to setting it to SCHED_DEADLINE with the original
bandwidth split in half between the father and child, to providing the
child with an initial bandwidth (budget, or runtime) of zero, as
happens by default in the latest implementation.
The IRMOS scheduler has hierarchical scheduling capabilities, in that
it allows for having a full-fledged POSIX realtime (sub-)scheduler
nested inside each EDF-based reservation: when an EDF thread
forks, the child keeps the same class (_FIFO or _RR)
and priority as the parent and they both keep sharing the same EDF
reservation. This allows for easy encapsulation of entire complex
multi-threaded software components (e.g., an Apache web server, a KVM
instance, etc.), but it also works with "traditional" realtime software
components, such as a set of a few realtime threads with realtime priorities
a single realtime component on the system (e.g., a control thread
along with the IRQ threads related to its own I/O towards the
controlled peripherals). If a strong assessment of
the realtime schedulability of the system is needed, it is possible
to use hierarchical realtime scheduling theory in order to analyze
whether or not the individual RT threads will meet their deadlines.
SCHED_DEADLINE uses a new system call interface,
sched_setscheduler_ex(), which only allows for creating a
reservation attached to a task.
The IRMOS scheduler exploits the realtime throttling cgroup-based
interface which was already there in the kernel, thus a new empty
reservation is created by creating a folder in the cgroup filesystem,
tasks are attached by adding their TIDs to the tasks file,
the runtime and period are set by echoing their values into the
corresponding file entries in the group folder, etc.
SCHED_DEADLINE supports partitioned EDF, and we have a
EDF implementation [PDF] made as part of the Master Degree Thesis in
Computer Engineering of Juri Lelli.
In the IRMOS scheduler, reservations apply to all CPUs, and it
supports Global Fixed Priority tasks nested inside partitioned EDF-based
reservations. If a RT task exhausts the budget (runtime) of the EDF
scheduler over a CPU, it can still run exploiting the budget available
in the same reservation over another CPU (migrating the task or the
bandwidth are both available options). However, affinity masks can
be used in order to better control over which CPUs realtime tasks
will be able to migrate.
SCHED_DEADLINE is implemented as a completely new scheduling class,
while the IRMOS scheduler is a modification to the already available POSIX
realtime scheduling classes.
SCHED_DEADLINE also supports soft (work-conserving) resource reservations,
while the IRMOS scheduler does not (however, it is planned as future work).
Even if the two projects are currently completely separated, there is
a good basis for having a common EDF-based realtime scheduling
infrastructure, that might be used by using different user-space
APIs in the two cases.
In the future, we plan to improve the scheduler on various sides.
Concerning the user-space API, the cgroup-based
interaction model already proved to suffer from major limitations. For
example, the absence of an atomic way to set at the same time multiple
scheduling parameters constitutes a major limitation of the current
Also, we plan to develop more options in the realtime scheduling
model, such as:
- Adding soft resource reservations, allowing for
work-conserving reservations coexisting with non-work-conserving
- Improving the access-control model, making the
realtime scheduling capabilities more easily accessible to
- Adding the possibility to specify a desired budget
(run time), in addition to the minimum guaranteed one subject to
admission control, which could be used for implementing adaptive
reservations, which, in turn, could be useful for applications showing
significant workload fluctuations at run time.
- Adding some form of deadline inheritance for better addressing
the well-known priority inheritance problem, e.g., by means of the
BandWidth Inheritance (MBWI) protocol, or some variation on
All of the above improvements would go into the direction of enhancing
usability of the realtime scheduler by common multimedia
applications. These would be the applications taking most of the
benefits from the exploitation of realtime scheduling capabilities of
the Linux kernel, such as the ones described above.
Comments (4 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
- Mimi Zohar: EVM .
(July 30, 2010)
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>