Kernel development
Brief items
Kernel release status
The current development kernel is 4.10-rc8, released on February 12; Linus decided to wait another week before putting out the final 4.10 release. "But I decided that there's also no huge overriding reason to do so (other than getting back to the usual "rc7 is the last rc" schedule, which would have been nice), and with travel coming up, I decided that I didn't really need to open the merge window. I've done merge windows during travel before, but I just prefer not to."
Stable updates: 4.9.9 and 4.4.48 were released on February 9, followed by 4.9.10 and 4.4.49 on February 15.
Quote of the week
Kernel development news
Inter-event tracing
The kernel's tracing infrastructure is primarily concerned with events, which are usually tied to the execution of specific blocks of code. But the interesting information is often in what happens between events. The most obvious variable of this type that one might want to monitor is timing — how much time elapses between one event and another that follows from it? — but others exist as well. Options for the calculation of inter-event values include the use of BPF programs or postprocessing in user space; a new patch set may soon add the ability to perform these calculations directly in the kernel instead.Tom Zanussi's inter-event support patch set is clearly focused on timing measurements. It is an extension of his histogram triggers work that was merged in the 4.6 development cycle. That work provides a mechanism for the storage of data from events, but it can only do one thing with the result: generate a histogram. This storage capability has the potential for other uses and can be employed for inter-event tracing, but there are a few problems that need to be solved for that to be possible.
The first of those is to arrange for the storage of data from one event to be used later when another event fires. An example provided with the patch set involves the sched_wakeup event, which fires when the kernel decides that a sleeping process is now runnable and should wake up. Consider the following command:
echo hist:keys=pid:ts0=common_timestamp.usecs \
>> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
This command establishes a special sort of "histogram" (really just using the histogram mechanism's data-storage capability) on the sched_wakeup event. The keys=pid directive uses the ID of the process to be awakened as the key with which the data is stored. The actual data associated with that key is specified with ts0=common_timestamp.usecs. It creates a new variable, ts0, that remembers the current time when the event was fired. The common_timestamp field is also new; it makes timestamp information available on any event.
The above records when the kernel decided to wake a process; now it is time to compute how long it takes until that process actually runs in a CPU. That time is the wakeup latency experienced by the process; one generally wants it to be as low as possible and, in a realtime setting, it must not exceed the maximum the system can tolerate. The wakeup latency can be calculated using the sched_switch event, which fires when a new process is given access to the processor. That is done with a command like:
echo 'hist:keys=woken_pid=next_pid:woken_prio=next_prio:\
wakeup_lat=common_timestamp.usecs-ts0:\
onmatch().trace(wakeup_latency)' \
>> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
There are a few things happening here, needless to say. The keys=woken_pid=next_pid fragment takes the next_pid event variable (identifying the process to be switched to in the processor), assigns it to a new variable called woken_pid, and uses it as the key into the histogram data. The next fragment, woken_prio=next_prio, stores the priority of the new process in the new woken_prio variable. Things get a bit more complicated with:
wakeup_lat=common_timestamp.usecs-ts0
Here, the ts0 timestamp value that was saved in the sched_wakeup event is recalled and subtracted from the current time, yielding the latency; that value is then stored in yet another new variable called wakeup_lat.
The onmatch() directive in the above command relates to how the computed latency is reported. That value is computed from two separate events and, thus, is not really associated with either of the events described above, so it should not be reported with them. Instead, the patch set creates a new abstraction called a "synthetic event" that is used to report calculated, inter-event values. In this case, such an event can be created with a command like:
echo 'wakeup_latency lat=sched_switch:wakeup_lat \
pid=sched_switch:woken_pid \
prio=sched_switch:woken_prio' \
>> /sys/kernel/debug/tracing/synthetic_events
This command creates a new event called wakeup_latency; three variables are created, essentially as pointers to variables in the sched_switch event. One of those, wakeup_lat (redefined as lat here) is the calculated latency.
With that event in place, we can look at the final part of the previous command:
onmatch().trace(wakeup_latency)
The onmatch() pseudo-function acts if there is a match on the histogram key (the process ID in this case); when it acts, it will cause the synthetic event named in the trace() "call" to be fired. That event behaves like any other event in all respects; it can be read out to user space or used to create a histogram, for example.
With the above commands, it is possible to monitor wakeup latencies in the system. The command set has been simplified a bit from the original commands in the patch set, which include extra filtering to limit latency tracking to processes running the cyclictest testing tool. For more details, see the patch set announcement linked above or the documentation that is part of the patch set itself.
The patch set has seen a fair amount of active review, resulting in generally positive comments. It does seem likely, though, that the details of the syntax as described above will change before this work is considered ready for merging. Nobody has, so far, suggested that attached BPF programs should be used to perform these calculations, though the problem could conceivably be solved that way. In any case, for now, the patch set is being reworked to reflect the review comments received so far; a new version should be expected before too long.
A resolution on control-group network filters
The 4.10 merge window included the addition of the ability to attach a BPF program to a control group; that program could then filter network packets for all processes within the group. In January, concerns were raised about several aspects of the API for this feature. As the final 4.10 release approaches, it would seem that a last-minute solution for at least one of these concerns has been reached.
One of the strongest worries raised in January had to do with how
filter programs at multiple levels of the control-group hierarchy interact.
Consider a simple hierarchy like that shown to the right; group A is
at the top level, while groups B and C are contained within it.
What happens to a process contained within B if filter programs are
attached at both A and B? In the original implementation, only
the program attached at the lower level (B) would be run. As a
result, any process that had the ability to attach a program to B
would be able to override any restrictions imposed at higher levels of the
hierarchy.
In some settings, that may be desirable behavior but, in others, it could
be a security issue.
This is an important detail to get right from the outset; if programs come to rely on the behavior described above, it may prove impossible to change in the future without breaking working systems. So there was a certain amount of back-and-forth over whether this behavior was problematic or not and how urgently it needed to be changed. Then things went quiet for a while, and it appeared that 4.10 would ship with this behavior. And, indeed, that might have happened if 4.10 had, as most had expected, been released on February 12.
A few days before that date, Alexei Starovoitov posted a patch aimed at addressing the objections to the 4.10 behavior. It modifies the bpf() system call used to attach a filter program to a control group, adding a new flag called BPF_F_ALLOW_OVERRIDE. If the flag is present when a filter program is installed, a process attached to a lower control group will be able to override that program by attaching a new program at a lower level. So, if a filter program is attached to group A above with this flag set, it will be possible to override that program in groups B and C by attaching new filter programs there. If the flag is absent, instead, it will not be possible to attach programs to those lower groups at all.
The BPF_F_ALLOW_OVERRIDE flag, in other words, implements the current 4.10 semantics for the groups below the group where a program is attached. The flag is not set by default, though, so the default behavior has changed to prevent a program from being overridden in this way. That gives system administrators control over how this behavior is handled while defaulting to a more secure mode and preventing code from relying on the unconditional ability to override packet-filter programs on control groups.
In response to the original patch, Andy Lutomirski, who raised most of the concerns about the original interface, suggested one change: if the filter program installed at A sets BPF_F_ALLOW_OVERRIDE, any programs installed lower in the hierarchy should be required to have that flag set as well. That restriction will avoid potential confusion if, at some future date, the ability to stack BPF programs at multiple levels (so that the filters at all levels of the hierarchy run, rather than just the lowest one) is added. Starovoitov agreed, and promptly posted a new version with that additional restriction added.
At this point, everybody involved appears to be happy with the patch, and there is general agreement that it should be merged before 4.10 is released. That has not happened as of this writing, but there would appear to be no fundamental roadblock that would prevent it from happening before February 19, when the final 4.10 release will almost certainly happen. So it would appear that this story, which included some moderately acrimonious debate, has come to a reasonably happy conclusion.
Patches and updates
Kernel trees
Core kernel code
Device drivers
Device driver infrastructure
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
