LWN.net Logo

Followups: performance counters, ksplice, and fsnotify

By Jonathan Corbet
December 17, 2008
There's been progress in a few areas which LWN has covered in the past. Here's a quick followup on where things stand now.

Performance monitors

In last week's episode, a new, out-of-the-blue performance monitoring patch had stirred up discussion and a certain amount of opposition. The simplicity of the new approach by Ingo Molnar and Thomas Gleixner had some appeal, but it is far from clear that this approach is sufficiently powerful to meet the needs of the wider performance monitoring community.

Since then, version 3 and version 4 of the patch have been posted. A look at the changelogs shows that work on this code is progressing quickly. A number of change have been made, including:

  • The addition of virtual performance counters for tracking clock time, page faults, context switches, and CPU migrations.

  • A new "performance counter group" functionality. This feature is meant to address criticism that the original interface would not allow multiple counters to be read simultaneously, making it hard to correlate different counter values. Counters can now be associated into multiple groups which allow them to be manipulated as a unit. There's also a new mechanism allowing all counters to be turned on or off with a single system call.

  • The system call interface has been reworked; see the version 3 announcement for description of the new API.

  • The kerneltop utility has been enhanced to work with performance counter groups.

  • "Performance counter inheritance" is now supported; essentially, this allows a performance monitoring utility to follow a process through a fork() and monitor the child process(es) as well.

  • The new "timec" utility runs a process under performance monitoring, outputting a whole set of statistics on how the process ran.

There are still concerns about this new approach to performance monitoring, naturally. Developers worry that users may not be able to get the information they need, and it still seems like it may be necessary to put a huge amount of hardware-specific programming information into the kernel. But, to your editor's eye, this patch set also seems to be gaining a bit of the sense of inevitability which usually attaches itself to patches from Ingo and company. It will probably be some time, though, before a decision is made here.

Ksplice

In November, we looked at a new version of the Ksplice code, which allows patches to be put into a running kernel. The Ksplice developers would like to see their work go into the mainline, so they recently poked Andrew Morton to see what the status was. His response was:

It's quite a lot of tricky code, and fairly high maintenance, I expect.

I'd have _thought_ that distros and their high-end customers would be interested in it, but I haven't noticed anything from them. Not that this means much - our processes for gathering this sort of information are rudimentary at best.

The response on the list, such as it was, indicated that the distributors are, in fact, not greatly interested in this feature. Dave Jones commented:

It's a neat hack, but the idea of it being used by even a small percentage of our users gives me the creeps....

If distros can't get security updates out in a reasonable time, fix the process instead of adding mechanism that does an end-run around it. Which just leaves the "we can't afford downtime" argument, which leads me to question how well reviewed runtime patches are. Having seen some of the non-ksplice runtime patches that appear in the wake of a new security hole, I can't say I have a lot of faith.

The Ksplice developers agree that the writing of custom code to fit patches into a running kernel is a scary proposition; that is why, they say, they've gone out of their way to make such code unnecessary most of the time.

This discussion leaves Ksplice in a bit of a difficult position; in the absence of clear demand, the kernel developers are unlikely to be willing to merge a patch of this nature. If this is a feature that users really want, they should probably be communicating that fact to their distributors, who can then consider supporting it and working to get it into the mainline.

fsnotify

The file scanning mechanism known as TALPA got off to a rough start with the kernel development community. Many developers have a dim view of the malware scanning industry in general, and they did not like the implementation that was posted. It is clear, though, that the desire for this kind of functionality is not going away. So developer Eric Paris has been working toward an implementation which will pass review.

His latest attempt can be seen in the form of the fsnotify patch set. This code does not, itself, support the malware scanning functionality, but, says Eric, "you better know it's coming." What it does, instead, is to create a new, low-level notification mechanism for filesystem events.

At a first look, that may seem like an even more problematic approach than was taken before. Linux already has two separate file event notifiers: dnotify and inotify. Kernel developers tend to express their dissatisfaction with those interfaces, but there has not been a whole lot of outcry for somebody to add a third alternative. So why would fsnotify make sense?

Eric's idea seems to be to make something that so clearly improves the kernel that people will lose the will to complain about the malware scanning functionality. So fsnotify has been written - employing a lot of input from filesystem developers - to be a better-thought-out, more supportable notification subsystem. Then the existing dnotify and inotify code is ripped out and reimplemented on top of fsnotify. The end result is that the impact on the rest of the VFS code is actually reduced; there is now only one set of notifier calls where, previously, there were two. And, despite that, the notification mechanism has become more general, being able to support functionality which was not there in the past.

And, to top it off, Eric has managed to make the size of the in-core inode structure smaller. Given that there can be thousands of those structures in a running system, even a small size reduction in their size can make a big difference. So, claims Eric, "That's right, my code is smaller and faster. Eat that."

What this code needs now is detailed review from the core VFS developers. Those developers tend to be a highly-contended resource, so it's not clear when they will be able to take a close look at fsnotify. But, sooner or later, it seems likely that this feature will find its way into the mainline.


(Log in to post comments)

Followups: performance counters, ksplice, and fsnotify

Posted Dec 18, 2008 4:08 UTC (Thu) by kev009 (subscriber, #43906) [Link]

I was a big fan of Ingo and Thomas, buying into the CFS, hrtimer, and dyntick hype but in retrospect (tbench regressions, general lack of quality in kernels 2.6.23+) they seem to have an easier than others time getting big/dangerous changes in.

I hope that Linus is more careful with crack projects from this duo in the future. CFS seemed to go in about 4 kernels too early judging by the shotgun patches applied since.

Followups: performance counters, ksplice, and fsnotify

Posted Dec 18, 2008 5:18 UTC (Thu) by deater (subscriber, #11746) [Link]

Ingo's performance counter infrastructure is a bit pointless. All it does is distract from the actual performance counter implementations that are trying to be merged (admittedly those partially sat outside of the kernel a bit too long).

If Ingo ever addresses the shortcomings brought up on the linux-kernel list (he hasn't) or ever tries to implement things on a machine that isn't a Core 2 Duo machine (most notably, Pentium 4 or PowerPC) he'll find out that things get complicated quickly. And the kernel is not the place for these complications.

It's a big mess, and a big frustration to those of us who use performance counters regularly and have to look forward to the prospect of patching our kernels by hand for years to come because an inferior infrastructure useful more-or-less only to Ingo gets merged.

Followups: performance counters, ksplice, and fsnotify

Posted Dec 18, 2008 15:05 UTC (Thu) by niner (subscriber, #26151) [Link]

I wonder where this sentiment that hardware specific stuff doesn't belong
into the kernel comes from. I thought one of the kernel's main purposes
was to abstract the hardware and hide it from user space. Why then put
hardware specific stuff into user space libraries instead of the kernel?

Followups: performance counters, ksplice, and fsnotify

Posted Dec 19, 2008 5:02 UTC (Fri) by deater (subscriber, #11746) [Link]

I wonder where this sentiment that hardware specific stuff doesn't belong into the kernel comes from. I thought one of the kernel's main purposes was to abstract the hardware and hide it from user space. Why then put hardware specific stuff into user space libraries instead of the kernel?

The kernel should abstract the hardware, but in as minimal way as possible. With performance counters this means that the kernel should enable starting and stopping of monitoring, enforce some sanity checks, and provide user-space with a common way to set up events.

What it does *not* mean is including 200k of library code that maps meaningful textual names to the numeric counter identifiers, or including all the subtle limitations of the counters (not all counters can count all events, not all counters are available on all steppings of a CPU, etc).

Putting that all into the kernel would definitely be a losing proposition. Perfmon does it from userspace. Ingo's method would have it all in the kernel.

This is a similar argument about whether video4linux should include format conversions into the kernel or not.

It's important to know what the correct level of abstraction for your interface is.

Followups: performance counters, ksplice, and fsnotify

Posted Dec 21, 2008 12:05 UTC (Sun) by Ze (guest, #54182) [Link]

>Putting that all into the kernel would definitely be a losing proposition. Perfmon does it from userspace. Ingo's method would have it all in the kernel.
>This is a similar argument about whether video4linux should include format conversions into the kernel or not.
It seems to me that this is an argument in favour of a microkernel approach.

I mean we've already got loadable kernel modules , fuse , and a move to push usb drivers out of kernel space into libusb.

Personally I can see the day when someone leverages the kernel driver model code but puts it in a kernel based around a microkernel.

Patching runtime kernel

Posted Dec 18, 2008 13:20 UTC (Thu) by NAR (subscriber, #1313) [Link]

If this is a feature that users really want, they should probably be communicating that fact to their distributors, who can then consider supporting it and working to get it into the mainline.

I guess most people who can't afford downtime due to installing a (possibly security) patch already have some kind of HA system where they can already patch the system without downtime...

Patching runtime kernel

Posted Dec 19, 2008 14:48 UTC (Fri) by ballombe (subscriber, #9523) [Link]

Not necessarily... I maintain servers which offer ssh access to researcher to run long running computation. They perform their computation anyway they want (using interactive tools under screen, by running their own code, etc.), and there is no way I can checkpoint their computation, so I cannot reboot the system without killing their computation, which is precisely what I am supposed to prevent. Planned power outage and the like can be handled through suspend-to-disk, but updating the kernel require a reboot.

Since we provide ssh access to a number of users, local privilege escalations are a problem, but I cannot just reboot the system whenever I want.

Of course I would need a high level of trust in ksplice before using it.

Patching runtime kernel

Posted Dec 20, 2008 1:35 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

What you have isn't actually an aversion to downtime, it's an aversion to reboots. Which another reason to like patching a kernel on the fly.

People who are averse to down time (i.e. they can't afford to be offline for three minutes) do usually have some kind of redundant system thing so they can reboot one system, then the other, and thereby install a kernel patch without ever being offline.

People who are averse to a reboot (i.e. they don't want to lose the state gathered by the past four days of calculation) might use checkpointing in order to tolerate reboots, but you're at least one example that they don't. Because your users use generic tools, the only way I know that checkpointing could eliminate the pain of reboot is those new virtual machine-based things.

Patching runtime kernel

Posted Jun 25, 2009 12:08 UTC (Thu) by epa (subscriber, #39769) [Link]

I think in ten years' time we will look back and see how silly it is to require a reboot every time the kernel is patched. Nowadays it's obvious that waiting for fsck after an unclean shutdown is unacceptable, even though that's the way it was for many years. Anything which can cut the number of reboots is a step forward for desktop usability. We don't want Linux to be that annoying system that wants to restart itself all the time, a title currently held by Windows, but by a thin margin given the frequency of kernel updates by many distros.

So yes, ksplice is wanted; for the remaining 30% of kernel updates that can't be spliced into a running system, working suspend/resume should help to keep downtime to a minimum.

Followups: performance counters, ksplice, and fsnotify

Posted Dec 19, 2008 2:54 UTC (Fri) by stevef (subscriber, #7712) [Link]

Changing the notify mechanism again may be a good idea, but I don't know whether it will map easily to the network protocol so this needs more analysis. Since notify is most useful over a network filesystem (if you are running on a local file system, notify is useful but polling is not as expensive as it is over a network, and there are other ways that applications can detect new files). CIFS and SMB2 have a notify mechanism (and Samba server's need to call this led to the initial Linux kernel implementation, which matched fairly closely with the cifs wire protocol) but I don't know if the new mechanism will make it harder or easier to finish the notify support in the cifs client (which is currently turned off in mainline).

Followups: performance counters, ksplice, and fsnotify

Posted Dec 19, 2008 22:50 UTC (Fri) by oak (guest, #2786) [Link]

> if you are running on a local file system, notify is useful but polling
is not as expensive as it is over a network,

Maybe you were thinking of something tethered to a power cord? For
battery powered devices, polling is evil.

(Before dynticks got into mainstream, Desktop crowd using laptops might
not have noticed / cared, but with dynticks you suddenly start to see how
much less power your laptop uses and how much longer it can run when
polling wakeups get reduced radically.)

KSplice: Yes, please

Posted Dec 19, 2008 13:54 UTC (Fri) by walles (guest, #954) [Link]

I think there are lots of people who
a) sometimes upgrade their kernels
b) don't like rebooting their systems

So why are people worried that patching of a running kernel wouldn't be used by anybody?

Personally I can't see who *wouldn't* use it, and I'd love to see these patches go in!

//Johan

KSplice: Yes, please

Posted Dec 20, 2008 22:18 UTC (Sat) by man_ls (guest, #15091) [Link]

I think there are two problems here. For desktop systems the biggest issue is stability: if a live-patched kernel is going to be less stable, then people are not going to like it; distributions will probably disable ksplice just in case and it will slowly rot.

Meanwhile, for servers there is the added issue of predictability. To diagnose any problems you want to know for sure the exact state of a system. After a few months of live patching it is hard to know which code is running, so it will add noise to any problem-solving efforts on these machines. The safest course of action is again to disable it, and that is what server distros will probably do.

The inherent coolness of ksplice is big, but before it hits big time it has to meet at least three requirements:

  • stability: live-patched kernel must be rock solid,
  • predictability: the state of a patched kernel must be exactly like a fresh one, or at least as well defined,
  • and accountability: there has to be tools to audit the exact state of the kernel.
Until then it is IMHO best left alone.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds