Brief items
The current 2.6 development kernel is 2.6.30-rc3,
released on April 21.
"
The diffstat really shows lots of small one-liners and two-liners,
although there are areas that are getting bigger patches (ignoring the
bulky but uninteresting arm defconfig updates): some x86 updates, some
block IO scheduling fixes, splice cleanups and fixes, and a number of
driver changes (sound, networking, staging, usb)." The short-form
changelog is in the announcement, or see
the
full changelog for all the details.
The current stable 2.6 release remains 2.6.29.1; there have been no
stable 2.6 updates since April 2.
For the fans of extreme stability, though, 2.4.37.1 was released on
April 19. "Most of these fixes concern minor
security issues which have been backported from 2.6 (mostly local DoSes).
In my opinion, only people with local users might consider upgrading, if
those people still exist!"
Comments (2 posted)
Kernel development news
The number of contributors who can write meaningful changelogs or
who can be taught to write really good changelogs is very, very
low. I'd guesstimate somewhere around 5% of all Linux
contributors. (The guesstimation is probably on the more generous
side.)
--
Ingo Molnar
No subject should ever contain the word "trivial". If it's really
trivial, you can sum it up in the subject and we'll know it's
trivial. Plus the diffstat shows it. 'trivial' is propaganda to
sneak a patch into -rc7.
--
Rusty Russell
In the past 15 years of Linux we've invested a lot of time and
effort into working around and dealing with compiler crap. We
wasted a lot of opportunities waiting years for sane compiler
features to show up. We might as well have invested that effort
into building our own compiler and could stop bothering about
externalities.
--
Ingo Molnar
Comments (11 posted)
By Jonathan Corbet
April 22, 2009
When kernel developers engage in an extended discussion on the writing of
changelogs for patches, one might well conclude that they have run out of
useful things to do. But arguments over changelogs are not the same as
spelling or grammar flames. In an environment where 10,000 or so changes
are merged in every three-month development cycle, developers need all the
help they can get to understand what is going into the kernel.
Poorly-described patches are harder to understand, and harder to find when
searching the history for something specific. So getting changelogs right
helps the development process - and the kernel - as a whole.
It all started innocently enough; Linus was engaging in a routine patch flaming when he encountered
one of the "Impact:" tags that some developers (especially those working
with Ingo Molnar's trees) have adopted in recent months:
Impact: clarify and extend confusing API
Suffice to say that he was not much impressed with it:
And what the hell is up with these bogus "Impact:" things? Who
started doing that, and why? If your single-line explanation at the
top is not good enough, and your multi-line explanation isn't clear
enough, then you should fix the OTHER parts, not add that _idiotic_
"Impact" statement.
From there, the extended conversation focused on two related topics: the
value of "impact" tags and how to write better changelogs in general. On
the former, the primary (but not only) proponent of these tags is Ingo
Molnar, who cites several benefits from
their use. Using these tags, he claims, forces developers to write smaller
patches which can be adequately described in a single line. They give
subsystem maintainers an easy way to assess the changes made by a set of
patches and their associated risk; they also make it easier to review a
patch against its declared "impact." These tags are also said to force a
certain clarity of thought, making developers think through the
consequences of a change.
Most of these arguments leave "Impact:" detractors unmoved, though. Rather
than add yet another tag to a patch, they would prefer to see developers
just write better changelogs from the outset. In a properly-documented
patch, the new tag is just irrelevant. Andrew
Morton said:
I'm getting quite a few Impact:s now and I must say that the
Impact: line is always duplicative of the Subject:. Except in a
few cases, and that's because the Subject: sucked.
Ingo disputed that claim at length,
needless to say. But he takes things further by stating that, while better
changelogs would certainly be desirable, they are not a practical goal.
According to Ingo, most
developers are simply not capable of writing good changelogs. Language
barriers and such often are part of this problem, but it goes deeper: most
developers simply lack the writing skills needed to write clear and concise
changelogs. This fact of life, as Ingo sees it, cannot really be changed,
but most developers can, at least, be trained to write a reasonable impact
tag.
It is probably fair to say that most developers do not see themselves as
being disabled in this way. That said, it is also fair to say that a lot
of patches go into the mainline with unhelpful changelogs. That can
probably be changed - to an extent at least - through pressure from
maintainers and a better understanding of what makes a good changelog.
In an attempt to help, your editor has proposed a brief addition to
Documentation/development-process:
Writing good changelogs is a crucial but often-neglected art; it's
worth spending another moment discussing this issue. When writing
a changelog, you should bear in mind that a number of different
people will be reading your words. These include subsystem
maintainers and reviewers who need to decide whether the patch
should be included, distributors and other maintainers trying to
decide whether a patch should be backported to other kernels, bug
hunters wondering whether the patch is responsible for a problem
they are chasing, users who want to know how the kernel has
changed, and more. A good changelog conveys the needed information
to all of these people in the most direct and concise way possible.
To that end, the summary line should describe the effects of and
motivation for the change as well as possible given the one-line
constraint. The detailed description can then amplify on those
topics and provide any needed additional information. If the patch
fixes a bug, cite the commit which introduced the bug if possible.
If a problem is associated with specific log or compiler output,
include that output to help others searching for a solution to the
same problem. If the change is meant to support other changes
coming in later patch, say so. If internal APIs are changed,
detail those changes and how other developers should respond. In
general, the more you can put yourself into the shoes of everybody
who will be reading your changelog, the better that changelog (and
the kernel as a whole) will be.
Other possible additions have been proposed by Ted Ts'o and Paul
Gortmaker. Of course, all of these patches are based on the optimistic
notion that developers will actually read the documentation.
One could argue that the kernel community is rather late in getting around
to this kind of discussion. That could be said to be par for the course;
in the pre-BitKeeper era (i.e. up to February, 2002), there was almost no
tracking of individual changes into the kernel at all. That the fine
points of changelogging are being discussed a mere seven years later
suggests things are going in the right direction. The level of
professionalism in the kernel community has been on the rise for a long
time; this process is likely to continue. Whether or not some variant on
the impact tag is used in the future, one can assume that the quality of
changelogs will, as a whole, be better.
Comments (10 posted)
By Jonathan Corbet
April 22, 2009
Many years ago, your editor heard Van Jacobson state that naming an
algorithm "slow start" was one of the biggest mistakes he had ever made.
The name refers to the technique of ramping up transmit rates slowly until
the carrying capacity of the connection is determined. But others just saw
"slow" and complained that they didn't want their connections to be slow.
The fact that "slow start" made the net faster was lost on them. One might
wonder if David Howells's "slow work" mechanism - merged for 2.6.30 - could
run into similar problems; no kernel developer wants things to run slowly.
But, as with slow start, running things slowly is not the point.
Slow work is a thread pool implementation - yet another thread pool, one
might say. The kernel already has workqueues and the asynchronous function call
infrastructure; the distributed storage (DST) module added to the -staging
tree for 2.6.30 also
has a thread pool hidden within it. Each of these pools is aimed at a
different set of uses. Workqueues provide per-CPU threads dedicated to
specific subsystems, while asynchronous function calls are optimized for
specific ordering of tasks. Slow work, instead, looks like a true "batch
job" facility which can be used by kernel subsystems to run tasks which are
expected to take a fair amount of time in their execution.
A kernel subsystem which wants to run slow work jobs must first declare its
intention to the slow work code:
#include <linux/slow-work.h>
int slow_work_register_user(void);
The call to slow_work_register_user() ensures that the thread pool is set
up and ready for work - no threads are created before the first user is
registered. The return value will be either zero (on success) or the usual
negative error code.
Actual slow work jobs require the creation of two structures:
struct slow_work;
struct slow_work_ops {
int (*get_ref)(struct slow_work *work);
void (*put_ref)(struct slow_work *work);
void (*execute)(struct slow_work *work);
};
The slow_work structure is created by the caller, but is otherwise
opaque. The slow_work_ops structure, created separately, is where
the real work gets done. The execute() function will be called by
the slow work code to get the actual job done. But first,
get_ref() will be called to obtain a reference to the
slow_work structure. Once the work is done, put_ref()
will be called to return that reference. Slow work items can hang around
for some time after they have been submitted, so reference counting is
needed to ensure that they are freed at the right time. The
implementation of get_ref() and put_ref() functions is
not optional.
In practice, kernel code using slow work will create its own structure
which contains the slow_work structure and some sort of
reference-counting primitive. The slow_work structure must be
initialized with one of:
void slow_work_init(struct slow_work *work, const struct slow_work_ops *ops);
void vslow_work_init(struct slow_work *work, const struct slow_work_ops *ops);
The difference between the two is that vslow_work_init()
identifies the job as "very slow work" which can be expected to run (or
sleep) for a significant period of time. The documentation suggests that
writing to a file might be "slow work," while "very slow work" might be a
sequence of file lookup, creation, and mkdir() operations. The
slow work code actually prioritizes "very slow work" items over the merely
slow ones, but only up to the point where they use 50% (by default) of the
available threads. Once the maximum number of very slow jobs is running,
only "slow work" tasks will be executed.
Actually getting a slow work task running is done with:
int slow_work_enqueue(struct slow_work *work);
This function queues the task for running. It will succeed unless the
associated get_ref() function fails, in which case
-EAGAIN will be returned.
Slow work tasks can be enqueued multiple times, but no count is kept, so a
task enqueued several times before it begins to execute will only run
once. A task which is enqueued while it is running is indeed put
back on the queue for a second execution later on.
The same task is guaranteed to not run on multiple CPUs
simultaneously.
There is no way to remove tasks which have been queued for execution, and
there is no way (built into the slow work mechanism) to wait for those
tasks to complete. A "wait for completion" functionality can certainly be
created by the caller if need be. The general assumption, though, seems to
be that slow work items can be outstanding for an indefinite period of
time. As long as tasks with a non-zero reference count exist, any
resources they depend on need to remain available.
There are three parameters for controlling slow work which appear under
/proc/sys/kernel/slow-work: min-threads (the minimum size
of the thread pool), max-threads (the maximum size), and
vslow-percentage (the maximum percentage of the available threads
which can be used for "very slow" tasks). The defaults allow for between
two and four threads, 50% of which can run "very slow" tasks.
The only user of slow work in the 2.6.30 kernel is the FS-Cache file
caching subsystem. There is a clear need for thread pool functionality,
though, so it would not be surprising to see other users show up in future
releases. What might be more surprising (though desirable) would be a
consolidation of thread pool implementations in a future development cycle.
Comments (1 posted)
April 22, 2009
This article was contributed by Goldwyn Rodrigues
The three R's of high availability are Redundancy, Redundancy and
Redundancy. However, on a typical setup built with commodity hardware,
it is not possible to add redundancy beyond a certain limit to
increase the number of 9's after your current uptime percentage (ie 99.999%).
Consider a simple example: an iSCSI server with the cluster nodes using
a distributed filesystem such as GFS2 or OCFS2. Even with
redundant power supplies and data channels on the iSCSI storage
server, there still exists a single point of failure: the storage.
The Distributed Replicated Block Device (DRBD) patch, developed by Linbit,
introduces duplicated block storage over the network with synchronous data
replication. If one of the storage nodes in the replicated
environment fails, the system has another block device to rely on, and
can safely failover. In short, it can be considered as an implementation of
RAID1 mirroring using a combination of a local disk and one on a remote node,
but with better integration with cluster software
such as heartbeat and efficient resynchronization with the ability to
exchange dirty bitmaps and data generation identifiers. DRBD currently
works only on 2-node clusters, though you could use a hybrid version to
expand this limit. When both nodes of the cluster are up, writes are
replicated and sent to both the local disk and the other node. For efficiency
reasons, reads are fetched from the local disk.
The level of data coupling used depends on the protocol chosen:
-
Protocol A: Writes are considered to complete as soon as the
local disk writes have completed, and the data packet has been placed
in the send queue for the peers. In case of a node failure, data loss
may occur because the data to be written to remote node disk may still
be in the send queue. However, the data on the failover node is
consistent, but not up-to-date. This is usually used for geographically
separated nodes.
- Protocol B: Writes on the primary node are considered to be
complete as soon as the local disk write has completed and the
replication packet has reached the peer node. Data loss may occur in
case of simultaneous failure of both participating nodes, because the
in-flight data may not have been committed to disk.
- Protocol C: Writes are considered complete only after both the
local and the remote node's disks have confirmed the writes are
complete. There is no data loss, so this is a popular schema for clustered
nodes, but the I/O throughput is dependent on the network bandwidth.
DRBD classifies the cluster nodes as either "primary" or "secondary."
Primary nodes can initiate modifications or writes whereas secondary
nodes cannot. This means that a secondary DRBD node does not
provide any access and cannot be mounted. Even read-only access is
disallowed for cache coherency reasons. The secondary node is present
mainly to act as the failover device in case of an error. The secondary
node may become primary depending on the network configuration.
Role assignment and designation is performed by the cluster
management software.
There are different ways in which a node may be
designated as primary:
-
Single Primary: The primary designation is given to one cluster
member. Since only one cluster member manipulates the data, this mode is
useful with conventional filesystems such as ext3 or XFS.
-
Dual Primary: Both cluster nodes can be primary and are
allowed to modify the data. This is typically used in cluster aware
filesystems such as ocfs2. DRBD for the current release can support a
maximum of two primary nodes in a basic cluster.
Worker Threads
A part of the communication between nodes is handled by threads to avoid deadlocks
and complex design issues. The threads used for communication are:
-
drbd_receiver: handles incoming packets. On
the secondary node, it allocates buffers, receives data blocks and
issues write requests to the local disk. If it receives a write
barrier, it sleeps until all pending write requests have been
finished.
-
drbd_sender: Sender thread for data blocks in response to a read
request. This is done in a thread other than drbd_receiver,
to avoid distributed deadlocks. If a resynchronization
process is running, its packets are generated by this thread.
-
drbd_asender: Acknowledgment sender. Hard drive drivers are informed
of request completions through interrupts. However, sending data over
the network in an interrupt callback routine may block the handler.
So, the interrupt handler places the packet in a queue which is picked up by
this thread and sent over the network.
Failures
DRBD requires a small reserve area for metadata, to handle post
failure operations (such as synchronization) efficiently.
This area can be configured either on a separate device
(external metadata), or within the DRBD block device (internal
metadata). It holds the metadata with respect to the disk including
the activity log and the dirty bitmap (described below).
Node Failures
If a secondary node dies, it does not affect the system as a whole because writes
are not initiated by the secondary node. If the failed node is primary,
the data yet to be written to disk, but for which completions are not
received, may get lost. To avoid this, DRBD maintains an "activity log,"
a reserved area on the local disk which contains
information about write operations which have not completed. The data is stored
in extents and is maintained in a least recently used (LRU) list.
Each change of the activity log causes a meta data update (single
sector write). The size of the activity log is configured by the user;
it is a tradeoff between minimizing updates to the meta data and the
resynchronization time after the crash of a primary node.
DRBD maintains a "dirty bitmap" in case it has to run without a peer node or
without a local disk. It describes the pages which have been dirtied by the
local node. Writes to the on-disk dirty bitmap are minimized by the
activity log. Each time an extent is evicted from the activity log, the part of
the bitmap associated with it which is no longer covered by the activity log
is written to disk. The dirty bitmaps are sent over the network to
communicate which pages are dirty should a resynchronization become
necessary. Bitmaps are
compressed (using run-length encoding) before sending on the network to reduce network
overhead. Since most of the of the bitmaps are sparse, it proves to be
pretty effective.
DRBD synchronizes data once the crashed node comes back up, or in response
to data inconsistencies caused by an interruption in the link.
Synchronization is performed in a linear order, by disk offset, in
the same disk layout as the consistent node. The rate of
synchronization can be configured by the rate parameter in the
DRBD configuration file.
Disk Failures
In case of local disk errors, the system may choose to deal with it
in one of the following ways, depending on the configuration:
- detach: Detach the node from the backing device and continue in
diskless mode. In this situation, the device on the peer node becomes
the main disk. This is the recommended configuration for high availability.
- pass_on: Pass the error to the upper layers on a primary
node. The disk error is ignored, but logged, when the node
is secondary.
- call-local-io-error: Invokes a script. This mode
can be used to perform a failover to a "healthy" node, and
automatically shift the primary designation to another node.
Data Inconsistency issues
In the dual-primary case, both nodes may write to the same disk sector,
making the data inconsistent. For writes at different offset, there is
no synchronization required. To avoid inconsistency issues, data
packets over the network are numbered sequentially to identify the
order of writes. However, there are still some corner-case
inconsistency problems the system can suffer from:
- Simultaneous writes by both nodes at the same time.
In such a situation, one of the node's writes are discarded. One of
the primary nodes is marked with the "discard-concurrent-writes" flag, which
causes it to discard write requests from the other node when it detects
simultaneous writes. The node with discard-concurrent-writes flag set,
sends a "discard ACK" to other nodes informing them that the write has been
discarded. The other node, on detecting the discard ACK, writes the
data from first node to keep the drives consistent.
- Local request while remote request in flight
This can happen when the disk latency exceeds the network latency.
The local node writes to a given block, sending the write operation to the
other node. The remote node then acknowledges the completion of the
request and sends a new write of its own to the same block - all before the
local write has completed. In this case, the local node
keeps the new data write request on hold until the local writes are
complete.
- Remote request while local request is still pending: this situation
comes about if the network reorders packets, causing a remote write to a
given block to arrive before the acknowledgment of a previous,
locally-generated write. Once again, the receiving node will simply hold
the new data until the ACK is received.
Conclusion
DRBD is not the only distributed storage implementation under development.
The implementation of Distributed Storage (DST) contributed by Evgeniy Polyakov
and accepted in staging tree takes a different approach.
DRBD is limited to 2-node active clusters, while DST can have
larger numbers of nodes. DST works on client-server model, where
the storage is at the server end, whereas DRBD is peer-to-peer based,
and designed for high-availability as compared to distributing
storage. DST, on the other hand, is designed for accumulative storage,
with storage nodes which can be added as needed. DST has a pluggable
module which accepts different algorithms for mapping the storage
nodes into a cumulative storage. The algorithm chosen can be mirroring
which would serve the same basic capability of replicated storage as
DRBD.
DRBD code is maintained in the git repository at
git://git.drbd.org/linux-2.6-drbd.git, under the "drbd" branch. It
contains the minor review comments posted on LKML
incorporated after the patchset was released by Philipp Reisner.
For further information, see the several PDF documents mention in the DRBD patch posting.
Comments (10 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>