Brief items
The current development kernel is 3.5-rc5,
released on June 30. Linus
says: "
So nothing really worrisome in here. Despite the networking merge
(which tends to be fairly big), -rc5 is a smaller patch than -rc4 was,
even if there are a couple more commits in there. So things seem to be
going in the right direction."
Stable updates: no stable updates have been released in the last
week. The 3.2.22 update is in the review
process as of this writing; it can be expected at any time.
Comments (none posted)
Please find large crayon and write on forehead "when fixing a bug,
be sure to describe the end-user impact of that bug".
—
Andrew Morton
Fundamentally, 8k stacks on x86-64 are too small for our
increasingly complex storage layers and the 100+ function deep call
chains that occur.
—
Dave Chinner
Comments (4 posted)
Kernel.org administrator John Hawley has announced that the
kernel patchwork system is finally
back on the air. "
All the old user account still exist, though it is *HIGHLY* recommended
that once you log in you change your password."
Full Story (comments: none)
James Bottomley has distilled his hard-earned knowledge of how to set up
UEFI secure boot with QEMU and the TianoCore system and placed it into
a web page.
It has a lot of information for anybody needing to work in this area.
"
Intel has produced a project called TianoCore as an open firmware
reference implementation of UEFI. One of the sub projects within TianoCore
is OVMF which stands for Open Virtual Machine Firmware. It is OVMF that we
are using to produce the virtual machine image for qemu that will run the
UEFI secure boot environment. TianoCore secure boot is only really working
as of version r13466 of the svn repository. This version has not yet been
released as a downloadable zip file."
Full Story (comments: 2)
Kernel development news
By Jonathan Corbet
July 3, 2012
The
D-Bus
interprocess communication mechanism has, over the years, become a standard
component of the Linux desktop. For almost as long, developers have been
trying to find ways to make D-Bus faster. The latest attempt comes in the
form of a kernel patch set adding a new socket address family (called
AF_BUS) to the networking layer. Significant performance
improvements are claimed, but, like previous attempts, this one may have a
hard time getting into the mainline kernel.
D-Bus implements a mechanism by which processes can send messages to each
other. Multicast functionality is inherently a part of the protocol; one message can be
sent to multiple recipients. D-Bus promises reliable delivery, where
"reliable" means that
messages arrive in the order in which they were sent and multicast messages
will either be delivered to all recipients or, if that is not possible, to
none. There is a security model built into the protocol whereby messages
can be limited to specific recipients. All of these features are used by
contemporary systems, which expect the system to be robust, secure, and
with as little latency and overhead as possible.
The current D-Bus implementation uses Unix-domain sockets and a central
routing daemon. It works, but the routing daemon adds context switches,
overhead, and latency to each message it handles. The kernel is unable to
help get high-priority messages delivered first, so all messages cause
wakeups that slow down the processing of the most important ones; see this message for a description of how these
problems can affect a running system. It has
been evident for some time to the developers involved that a better
solution must be found.
There have been a number of attempts in that direction. The previous time
this topic came up, it was around a set of
patches adding multicast capabilities to Unix-domain sockets. This
idea was rejected with the claim that the Unix-domain socket code is
already too complicated and there was not enough justification to make
things worse by adding multicast capabilities. The D-Bus developers were
told to simply use IPv4 sockets, which already have multicast support,
instead.
What those developers actually did was to implement AF_BUS, a new address family designed to meet
the needs of D-Bus. It provides the reliable delivery that D-Bus requires;
it also has the ability to pass file descriptors and credentials from one
process to another. The security mechanism is built in, with the netfilter
code (augmented with a new D-Bus message parser) used to control which
messages can actually be delivered to any
specific process. The end result, it is claimed, is a significant
reduction in D-Bus overhead due to reduced system calls; submitter Vincent
Sanders claims "a doubling in throughput and better than halving of
latency." See the associated
documentation for details on how this address family works.
A factor-of-two improvement in a component that is widely used in Linux
systems would certainly be welcome. The patch set, however, was not;
networking maintainer David Miller immediately stated his intention to simply ignore the
patch set entirely. His objections seem to be that IPv4 sockets are
sufficient for the task and that reliable delivery of multicast messages
cannot be done, even in the limited manner needed by D-Bus. He expressed
doubts that the IPv4 approach had even been tried, and decreed: "We are not creating a full
address family in the kernel which exists for one, and only one, specific
and difficult user."
Vincent responded that a number of
approaches have been tried and found wanting. IPv4 sockets cannot provide
the needed delivery guarantees and do not allow for the passing of file
descriptors and credentials. It is also important, he said, for D-Bus to
be up and running before the networking subsystem has been configured;
setting up IP interfaces on a contemporary system often requires
communication over D-Bus. There really is no better solution, he said.
He found support from a few other developers, including Alan Cox, who pointed out that there is no
shortage of interprocess communication systems out there with requirements
similar to D-Bus:
In fact if you look up the stack you'll find a large number of
multicast messaging systems which do reliable transport built on
top of IP. In fact Red Hat provides a high level messaging cluster
service that does exactly this. (as well as dbus which does it on
the desktop level) plus a ton of stuff on top of that (JGroups etc)
Everybody at the application level has been using these 'receiver
reliable' multicast services for years (Websphere MQ, TIBCO, RTPGM,
OpenPGM, MS-PGM, you name it). There are even accelerators for PGM
based protocols in things like Cisco routers and Solarflare can do
much of it on the card for 10Gbit.
He added that latency concerns are paramount on contemporary systems and
that one of the best ways of reducing latency is to cut back on context
switches and middleman processes. Chris Friesen added that his company uses "an
out-of-tree datagram multicast messaging protocol family based on
AF_UNIX" that could almost certainly be replaced by something like
AF_BUS, were AF_BUS to be added to the mainline kernel.
There have been various other local messaging patch sets posted over the
years. So it seems clear that there is a significant level of interest in
having this sort of capability built into the Linux kernel. But interest
alone is not sufficient justification for the merging of a large patch set;
there must also be agreement from the developers who are charged with
ensuring that Linux has a top-quality networking stack in the long term.
That agreement is not yet there, so there may be a significant amount of
multicast interpersonal messaging required before we have multicast
interprocess messaging in the kernel.
Comments (46 posted)
July 3, 2012
This article was contributed by Neil Brown
Sometimes a casual comment can capture your imagination and not let
go until you do something with it. So it was for me with a comment made
by Heikki Orsila on
some
observations that Greg Kroah-Hartman made about documentation in the
Linux kernel tree:
Greg, the documentation is very bad.
The specific documentation that he or Greg were thinking of may well
be very different from the specific documentation that my thoughts
turned to, but as both a producer and a consumer of some parts of
linux/Documentation I can at least agree that some of it
isn't very good.
Heikki continued: "Linux is badly documented, but so what?" and often
that would have been the end of it. But we are an open development community
where putting up with mediocrity is neither necessary nor encouraged.
If things are broken then there is always the possibility of fixing
them if only we know how. How, then, can we fix the documentation?
"Documentation" is a broad category and I would like to start by
narrowing our focus a little and
excluding reference documentation from consideration. By this I mean
documents used by a person knowledgeable in the subject who needs to
clarify some detail such as the arguments to some function or the
required ordering between two locks. For these details the source
code is by far the best resource - as it cannot get out-of-date - and,
when the code itself is not sufficient, placing the documentation in
the source code will provide the greatest likelihood of it being
found, read, and kept up-to-date. It doesn't really belong in a
separate Documentation directory.
The class of documentation that is of interest is documentation for
the new developer, not necessarily new to development but new to a
particular project or subsystem. Such a developer combines a lack of
knowledge with a genuine interest and this is a combination that is
not stable: if one component does not disappear soon, the other is
likely to. The task of good documentation is to ensure the lack of
knowledge disappears before the interest.
I was exposed to this instability when
trying to understand some details of power management in Linux. The
documentation simply didn't help and I had to look elsewhere.
However when I went back to assess the documentation while preparing
for this article I discovered that it wasn't as bad as I remembered.
I now had enough experience that it all made sense. The paucity of
the documentation was now only in my memory and I couldn't be sure I
had given that documentation a fair trial. The temptation to just
move on might have won had Heikki Orslia's observation not
encouraged me on.
To understand what makes good documentation we need to mine the
experiences from that short window of naive interest to find out
what works and what doesn't.
A question that seems most suited for digging is
"What were you looking for that you didn't find".
I'm sure my kind reader will have their own answers
to offer, but here are three that I have found on my travels.
Wire-frame outlines
When I first went to the Linux power management documentation I was
after a "big picture" understanding. I wanted more detail than "this
code manages power" but not quite "these are the entry points that a
driver must provide". I wanted to know what the important parts were
and, significantly, how they connected together and impacted each
other. I picture this as a collection of key concepts together with
the linkage between them. These are nodes and edges in a graph,
entities and relationships, or for the more spatially oriented,
vertices and edges of a wire-frame polyhedron. This gives the shape
of the project without getting bogged down in details.
For me it is vital to have this framework first as I can only take in
and retain new details if I have something to attach them to and a
place to attach them. Without it I'll either attach new ideas to the
wrong place, or forget them completely - which is probably the safer
of the two.
The image of a wire-frame is a little misleading as it presents all
vertices as of equal value and this is rarely the case. Some concepts
are bigger and should be named and described first. Others can come
later. So maybe a ball-and-stick model might be a better picture,
with big and small balls, joined by thin and fat sticks.
In the case of Linux power management, one key concept that gives
shape to the whole is the number of multiplicities: there are multiple
sequencing states when moving away from or towards full functionality,
multiple power saving approaches such as runtime, suspend, and
hibernate, and many multiple different sorts of devices that need to
fit into the frameworks. Another concept, already hinted at but often
recurring, is that there is generally one "fully functional" state but
several "low power" states, where moving between two low power states
involves returning to fully-functional and then reducing power a
different way.
Why, not what.
"Swap over NFS" is a set of functionality that some people find
valuable, but is not at all straightforward to implement. There is a
need to avoid deadlocks in memory management, and to do so without
slowing down either the networking code or the memory allocation code,
both of which are quite performance sensitive. There is a set of
patches which provides this functionality but getting it ready for
mainline inclusion has been a slow process.
Andrew Morton was recently good enough to provide some review of these
patches and, while reading the commit-log entries and code comments is
a little different from seeking out more coherent documentation, it does
provide a good window into the thoughts of someone who, while
generally knowledgeable, is both new to the project specifics and still
interested. It can thus answer the question "what were you looking for
that you didn't find?".
One observation that he made repeatedly is most clearly
embodied in
The comment should explain "why", not "what". Particularly when the
"what" was bleedin obvious ;)
or more humorously in: "s/"what"/"why"/ !".
Documenting what a function does is very important in closed-source
projects, but less so in open source where the code can be directly
read. Of course if the code is long and complex it might be easier to
read some documentation, however the effort of writing the
documentation might be better spent in breaking up the code and making
it more readable.
Documenting why is much more valuable, whether it is "why do it
this way" or "why even do this at all". The "why" of a project is
rarely explicit in just one place of the code. Rather it permeates
throughout and can touch various fragments in different ways.
Sometimes the "why" is not technical at all but is historical,
cultural, or simply subjective. In these cases it really cannot be
extracted by reading the code and must be documented, or lost.
Were I to properly document the Linux "md" driver, for which I get
occasional requests, I would need to explain its relationship with
"dm" - for it isn't only internal edges of our wire-frame that are
interesting, but also external edges. The "why"s here are mostly
historical accident, though there would be value in observing that
"md" focuses on reliability through redundancy, while "dm" focuses
more on flexibility by hiding all the other restrictions imposed by
storage hardware. This, I think, gives the "why" for continuing to
have two separate frameworks, even if it isn't a strong technical
justification.
To continue with the analogy of the wire-frame model, if the concepts
and relationships provide the shape of the model, then the "Why"s
provide the fabric that they give shape to. They are the substance
that gives purpose and the force that gives direction. They may not
always be visible, especially once we put some skin on our model, but
understanding them is key to understanding the whole.
Examples, examples, examples.
One of the documents that I maintain is the set of manual pages for
mdadm.
I recall some years ago being challenged that there weren't enough
examples in that documentation. At the time I didn't really know what
to do with the challenge as, after all, there was an "Examples" section
at the end of the man page and there was plenty of explanatory
material from which you can deduce your own examples. Though I didn't
give it much attention then, this challenge clearly stuck in my mind
even to today and on reflection I now think quite differently to how I
thought then. Examples matter.
For those of us who enjoy binary taxonomies, there are two sorts of reasoning
processes: deductive and inductive. These are described in various ways in the
literature. One that is particularly succinct and helpful is from
Naked Science which describes the distinction as:
Deductive reasoning arrives at a specific conclusion based on
generalizations. Inductive reasoning takes events and makes
generalizations.
In the context of documentation, reasoning is the process of turning the
words in the document into a model in your mind. Different people
appear to vary in which style of reasoning they are most comfortable with,
so good documentation must attempt to play to both styles.
Documentation that plays to deductive reasoning will be filled with
generalizations. This doesn't mean that it avoid details (as
generalities would) but that it attempts to describe exactly - in
complete generality - what each interface does, or how each concept
applies, or what role each interaction plays. Such documentation can
be very useful, but is can also lead to a feeling that you are
drowning in detail. It can be a challenge to extract meaning and
importance from such details. A lot of technical documentation seems
to tend to this extreme.
Documentation that plays to inductive reasoning will be full of
examples of specific cases. It may explain each case very well but
the coverage of the cases can never be perfect and it will
inevitably leave out some information, typically the particular
information that the reader is looking for. "How-tos" are a good
example of this sort of document with maybe the extreme case being
recipe books for cooking - they are full of sample recipes with very
little space dedicated to explaining what makes a good recipe.
These are very good if they chose just the right
example, fairly good and quite accessible if they have chosen a good
variety of examples, but usually lacking when you want to get down to
the nitty-gritty.
Documentation that plays to both types of reasoning will mix examples in with the
generalizations, using them to embellish and explain those
generalizations and as an excuse to make diversions into tangentially
related topics. Examples are particularly good at highlighting
contrasts which are themselves an important part of describing key
concepts and clarifying why choices are made. The various
multiplicities noted for Linux power management can doubtlessly
provide lots of contrasts such as that between a "UART" serial driver
that must be ready to receive full-rate data whenever it is not off,
as opposed to a "USB" serial driver which only needs to be able to
respond to a wake-up signal and has plenty of time to prepare itself
for full data-rate messages. These would necessarily make different
decisions about allowable power states.
Returning to our wire-frame model which gives shape
to some substance, it hopefully is not too much of a stretch to see
examples as the skin on the model. These are the bits we can directly
see, they reveal the texture or taste of the whole, and only hint at the
bigger picture behind them. But they are an important part in closing
the gaps that are left out of the big-picture descriptions.
A worked example?
Having all these goals for introductory documentation may be nice, but
are they actually useful? Can they lead to truly "good" documentation?
Clearly they are not enough by themselves, but when combined with
enough knowledge and experience, with some story-telling ability and an
occasional touch of humor I believe that they can. To put this to
the test, I've used them as a guide to producing some introductory
documentation on Linux power management. The results will be presented next
week when you, dear reader, can be the judge of whether the
resulting documentation is actually "good".
Comments (10 posted)
By Jonathan Corbet
July 2, 2012
As most of the net is likely to have heard by now, Linux servers displayed
a notable tendency to misbehave during the leap second event at the end of
the day on June 30. The problem often presented itself as abrupt and
sustained load spikes on the affected machines. The bug that caused this
behavior has been tracked down (thanks to a determined effort by John
Stultz); a look at what happened shines an interesting light on the
trickiness of dealing with time in software systems.
The earth's rotation is slowing over time; contrary to some public claims,
this slowing is not caused by Republican administrations, government
spending, or proprietary software. In an attempt to keep the official
Coordinated Universal Time (UTC) in sync with the earth's behavior, the
powers that be occasionally insert an additional second (a "leap second")
into a day; 25 such seconds have been inserted since the practice began in
1972. This habit is not without its detractors, and there are constant
calls for its abolition, but, for now, leap seconds are a reality that the
world (and the kernel) must deal with. For the curious, the Wikipedia leap second
page has more detail than almost anybody could want.
The kernel's core time is kept in a timespec structure:
struct timespec {
__kernel_time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
It is, in essence, a count of seconds since the beginning of the epoch.
Unfortunately, that count is defined to not include leap seconds. So when
a leap second happens, the system time must be explicitly corrected; that
is done by setting the system clock back one second at the end of that leap
second. The code that handles this change is quite old and works pretty
much as advertised. It is the source of this message that most Linux
systems should have (in some form) in their logs:
Jun 30 19:59:59 dt kernel: Clock: inserting leap second 23:59:60 UTC
The kernel's high-resolution timer (hrtimer) code does not use this version
of the system time, though — at least, not directly. Instead, hrtimers
have a couple of internal time bases that are offset from the system time.
These time bases allow the implementation of different clocks; the
"realtime" clock should adjust with the time, while the "monotonic" clock
must always move forward, for example. Importantly, these timer bases are
CPU-specific, since realtime clocks can differ between one CPU and the next in
the same system. The hrtimer offsets allow the timer subsystem to quickly
turn a system time into a time value appropriate for a specific processor's
realtime clock.
If the system time changes, those offsets must be adjusted accordingly.
There is a function called clock_was_set() that handles this
task. As long as any system time change is followed by a call to
clock_was_set(), all will be well.
The problem, naturally, is that the kernel failed to call
clock_was_set() after the leap second adjustment, which certainly
qualifies as a system time change. So the hrtimer subsystem's idea of the
current time moved forward while the system time was held back for a
second; hrtimers were thereafter operating one second in the future. The
result of that offset is that timers started expiring one second sooner
than they should have; that is not quite what the timer developers had in
mind when they used the term "high resolution."
For many applications, having a timer go off one second early is not a big
problem. But there are plenty of situations where timers are set for less
than one second in the future; all such timers will naturally expire
immediately if the timer subsystem is operating one second ahead of the
system time.
Many of these timers are also recurring timers; they will be re-set
immediately after expiration, at which point they will immediately expire
again — and so on. The resulting loop is the source of the load spikes reported by
victims of this bug across the net.
The fix is to call clock_was_set()
in the leap second code—a call that had been removed
in 2007. But it's not quite that simple. The work done by
clock_was_set() must happen on every CPU, since each CPU has its
own set of timer bases. That's not something that can be done in atomic
context. So John's patch detects a call in atomic context and defers the
work to a workqueue in that case. With this patch in place, the kernel's
leap second handling should work again.
How could such a bug come about? Time-related code is notoriously tricky
in general; bugs are common. But the situation is far worse when the code
in question is almost never executed. Prior to June 30, 2012, the
last leap second was at the end of 2008. That is 3½ years in which the
leap second code could have been broken without anybody noticing. If the
kernel had a regularly-run regression test that verified the correct
functioning of hrtimers in the presence of leap second adjustments, this
problem might just have been caught before it affected production systems,
but nobody has made a habit of running such tests thus far.
Perhaps that will change in the future; if nothing else, distributors with
support obligations are likely to run some tests ahead of the next
scheduled leap second adjustment. Hopefully, that will catch any problems
in this particular little piece of code, should they happen to slip in
again. Beyond that, one can always hope for an end to leap seconds. The
kernel could also contemplate a switch to international
atomic time (TAI), which does not have leap seconds, for its internal
representation. Using TAI internally has its own challenges, though,
including a need to avoid changing the time representation as seen by user
space—meaning that the kernel would still have to track leap seconds
internally. So it seems likely that, one way or another, leap seconds are
likely to continue to be a source of irritation and bugs in the future.
Comments (90 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>