LWN.net Logo

In brief

By Jonathan Corbet
May 13, 2009
Editor's note: it's no secret that far more happens on the kernel mailing lists than can ever be reported on this page. As a result, interesting discussions and developments often slip by without a mention here. This article is the beginning of an experimental attempt to improve that situation. The idea is to briefly mention important topics which have not, yet, been developed into a full Kernel Page article. Some items will be followups from previous discussions; others may foreshadow full articles to come.

The "In brief" article will probably not appear every week. But, if it works out, it should become a semi-regular feature filling out LWN's kernel coverage. Comments are welcome.

reflink(): the proposed reflink() system call was covered last week. Since then, there have been some followup postings. reflink() v2, posted on May 7, maintained the reflink-as-snapshot semantics. When asked about that decision, Joel Becker responded "reflink() is a snapshotting call, not a kitchen sink." It seemed like there was to be no comfort for those wanting reflink-as-copy semantics.

reflink() v4, posted on the 11th, changed that tune somewhat. In this version, a process which either (1) owns the target file, or (2) has sufficient capabilities will create a link which copies the original security information - reflink-as-snapshot, essentially. A process lacking ownership and privilege, but having read access to the target file, will get a reflink with "new file" security information - reflink-as-copy. The idea is to do the right thing in all situations, but some developers are now concerned about a system call which has different semantics for processes running as root. This conversation has a while to go yet.

devtmpfs was also covered last week. This patch, too, has been reposted; the resulting conversation, again, looks to go on for a while. The return of devfs was always going to be controversial; the first version, after all, inspired flame wars for years before being merged. The devtmpfs developers feel that they need this feature to provide distributions which boot quickly and reliably in a number of situations; others think that there are better solutions to the problem. There is no consensus on merging this code at this time, but it is worth noting that the discussion has slowly shifted away from general opposition and toward fixing problems with the code.

Wakelocks are back, but now the facility has been rebranded suspend block. The core idea is the same: it allows code in kernel or user space to keep the system from suspending for a brief period of time. The user-space API has changed; there is now a /dev/suspend_blocker device which provides a couple of ioctl() calls. Closing the device releases the block, eliminating a potential problem with the wakelock API where a failed process could leave a block in place indefinitely.

There has been relatively little discussion of the new code; either everybody is happy with it now, or nobody has really noticed the new posting yet.

Doctor, it HZ. Much of the kernel is now tickless and equipped with high-resolution timers. So, says Alok Kataria, there is really no need to run x86 systems with a 1ms clock tick anymore. Running with HZ=1000 measurably slows the execution of a CPU-bound loop. So why not lower it?

There are problems with a lower HZ value, though, many of which have, at their source, the same problem which makes HZ=1000 more expensive: the kernel is still not truly tickless. Yes, the periodic clock interrupt is turned off when the processor is idle. But, when the CPU is busy, the clock ticks away as usual. Making the system fully tickless is a harder job than just making the idle state tickless; among other things, it pretty much requires doing away with the jiffies variable and all that depends on it. But, until that happens, lowering HZ will have costs of its own.

Wu Fengguang has been trying for a while to extend /proc/kpageflags, his patch adds a great deal of information about the usage of memory in the system. One might think that adding more useful information would be uncontroversial, but Ingo Molnar continues to oppose its inclusion. Ingo does not like the interface or the fact that it lives in /proc; his preferred solution looks more like an extension to ftrace. More thought toward the creation of uniform instrumentation interfaces is probably a good idea, but the current /proc/kpageflags interface has proved useful. It's also an established kernel ABI, so it's not going away anytime soon. But whether /proc/kpageflags will be extended further remains to be seen.


(Log in to post comments)

In brief -- very useful

Posted May 14, 2009 3:53 UTC (Thu) by knobunc (subscriber, #4678) [Link]

Or slightly more verbose:
Please continue this feature if you can!! It is great.
Thanks, -ben

In brief -- very useful

Posted May 14, 2009 19:06 UTC (Thu) by Azazel (subscriber, #3724) [Link]

Definite thumbs-up.

In brief -- very useful

Posted May 18, 2009 1:07 UTC (Mon) by vonbrand (subscriber, #4458) [Link]

Seconded. Thanks!

In brief

Posted May 14, 2009 4:05 UTC (Thu) by neilbrown (subscriber, #359) [Link]

This is a wonderful idea. I don't find nearly enough time for lkml - I barely skim read the subject lines. If you keep up with this summary, it will help me know which threads to bother reading for myself :-)

In brief

Posted May 14, 2009 6:13 UTC (Thu) by jengelh (subscriber, #33263) [Link]

Seems that my comment that LWN should provide something more kerneltraffic-like was heard. Yay!

In brief

Posted May 14, 2009 6:21 UTC (Thu) by jcm (subscriber, #18262) [Link]

Yeah, it's a great idea :)

In brief

Posted May 14, 2009 9:43 UTC (Thu) by cmot (subscriber, #53097) [Link]

AOL!!!!1!

I really, really missed kerneltraffic when it went offline, and have
searched for a similar source of news that close to where "stuff" is
happening ever since.

Zack's Kernel News

Posted May 14, 2009 15:54 UTC (Thu) by southey (subscriber, #9466) [Link]

Linux Magazine Pro (http://www.linux-magazine.com/) publishes this column by Zack Brown.

Zack's Kernel News

Posted May 14, 2009 16:24 UTC (Thu) by jengelh (subscriber, #33263) [Link]

I have seen some, but the broadness reaches nowhere to what the website used to. It is less than LWN even, from a subjective POV.

Zack's Kernel News

Posted May 15, 2009 6:00 UTC (Fri) by cmot (subscriber, #53097) [Link]

Thanks, haven't seen it before. Hmm, is there an easy way to subscribe to
just this column?

Zack's Kernel News

Posted May 15, 2009 12:38 UTC (Fri) by sjlyall (subscriber, #4151) [Link]

I always find it quite weird how the intro to his column implies that Kernel traffic is still being published when he stopped 4.5 years ago.

In brief

Posted May 14, 2009 9:32 UTC (Thu) by mces (subscriber, #27668) [Link]

"Yes, the periodic clock interrupt is turned off when the processor is idle. But, when the CPU is busy, the clock ticks away as usual. Making the system fully tickless is a harder job than just making the idle state tickless; among other things, it pretty much requires doing away with the jiffies variable and all that depends on it."

I doubt that a tickless system can be built on top of modern hardware. In fact, IMHO, in a truly tickless system the kernel must rely either on

  1. an accurate and fast Real-Time Clock (and current RTC's are slow and inaccurate), or
  2. an incrementer/decrementer device working at a reasonably high fixed frequency and having a large enough counter to avoid wraparounds

Line of reasoning:

  • Assume that the kernel wants to keep track of time elapsing with reasonable accuracy and cannot rely on an external time source
  • Thus, it has to rely on an internal hardware "clock" device
  • In a tick-based system, time is accounted for by counting the occurrences of a periodic timer interrupt
  • In a tickless system, time is accounted for by measuring how much time has elapsed since the last time-related event — for instance, since the last gettimeofdays() system call
  • Any existing "clock" device is based on an incrementer or decrementer that keeps running at a fixed frequency. The higher the frequency, the smaller the time interval that can be measured by the clock device
  • There could be several ways to fix the wraparound problem for the clock device's counter, but when the system is idle all of them basically require to raise an interrupt at (or right near) the wraparound event
  • How is that "wraparound" interrupt essentially different from a "tick" event?

Marco

In brief

Posted May 14, 2009 13:58 UTC (Thu) by nix (subscriber, #2304) [Link]

Nehalem has a constant-frequency TSC which does not stop unless suspended. That looks like what you're looking for (and finally! a useful TSC, after how many years?)

In brief

Posted May 14, 2009 16:18 UTC (Thu) by mces (subscriber, #27668) [Link]

A 64-bit counter ticking at modern bus frequencies would work well.
However, it should keep counting even when the CPU is halted, and I'm not sure that Nehalem's TSC does this. If TSC may halt, when the CPU awakes the kernel must access another device like the old RTC to recover wall clock time, and accuracy is lost.

IIRC, currently the kernel avoids accessing the RTC by programming an hardware interval timer so as to raise an interrupt far in the future, just for waking up and updating the wall clock time. How far in the future depends on both frequency and size of the counter in the hardware device. Anyway, you still have periodic timer interrupts, although of rather small frequency.

In brief

Posted May 14, 2009 20:52 UTC (Thu) by nix (subscriber, #2304) [Link]

If TSC may halt, when the CPU awakes the kernel must access another device like the old RTC to recover wall clock time, and accuracy is lost.
Yeah, sure, but if the CPU has just awoken from a very deep sleep state, precise accuracy isn't that important anyway. It's not as if it's going to be doing that very frequently.

(In any case, my understanding is that the Nehalem keeps its TSC running even in deep C-states. Obviously when actually powered off the TSC is halted, but *that* we can handle. You always have to get the time from a CPU-external source when you turn the power on.)

In brief

Posted May 14, 2009 17:03 UTC (Thu) by jstultz (subscriber, #212) [Link]

I suspect we really can go tickless, but that doesn't avoid the fact that we will always have to have periodic events fire, which you're right, its really amounts to about the the same thing as a tick.

However, being able to stretch that event frequency out, helps reduce overhead. So even if we never get to a point where we don't get events for days at a time, its still a win.

The clocksource infrastructure in the kernel already provides much of what you describe. Part of whats needed now are ways to formalize the clocksource's limits so the dynamic tick infrastructure can maximize the tickless/eventless intervals safely. See Jon Hunter's recent dyntick patch for some interesting work there.

In brief

Posted May 15, 2009 15:25 UTC (Fri) by rgoates (guest, #3280) [Link]

Great idea! I hope you can make this at least a semiregular feature.

In brief

Posted May 16, 2009 0:30 UTC (Sat) by MisterIO (guest, #36192) [Link]

Very interesting and useful.

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds