Brief items
The current development kernel is 3.4-rc7,
released on May 12. Linus says:
"
This is almost certainly the last -rc in this series - things really
have calmed down, and I even considered just cutting 3.4 this weekend, but
felt that another week wouldn't hurt." Expect a 3.4 final release in
the near future.
Stable updates: 3.3.6 was released
on May 12, as was 3.2.17. The 2.6.34.12 update is in the review process as
of this writing; it can be expected on or after May 17.
Comments (none posted)
Also, when Van [Jacobson] says something, you can be fairly sure
its right, and if it's not, then you didn't understand what Van
said.
—
Eric Dumazet (thanks to Dave Täht)
The thrust of the argument seems to be that by establishing good
habits from the very beginning you can avoid the need for change.
That may well be true, but it isn't particularly "user friendly".
We should make things simple and safe so that people don't *need*
to carefully form good habits.
—
Neil Brown
Comments (5 posted)
By Jonathan Corbet
May 16, 2012
Paul Gortmaker was recently doing some cleanup work when he found the token
ring networking code getting in the way. Which led him to wonder: was
anybody still using that code? He concluded that the answer was "no":
A search on the internet for users tends to show that even the die
hard enthusiasts who cared to poke at MCA/TR just for hobby sake
have pretty much all given up somewhere in the 2003-2005 "pre-git"
timeframe, and never really moved off their 2.4.x kernels.
In response, he put together a patch to
remove the token ring subsystem altogether. The patch was presented as a
demonstration, without a lot of hope that it would be applied in the near
future. Paul's real goal was to get comments and see if he could build a
consensus for the removal of the code at some more distant time.
Thus far, there has been one objection.
But, that notwithstanding, David Miller has accepted the patch and fast-tracked it
directly into the net-next repository. Barring some sort of reversion
prior to the merge window, it looks like the 3.5 kernel will be missing
support for token ring networking.
Comments (10 posted)
Kernel development news
By Jake Edge
May 16, 2012
When transporting files between systems using USB sticks or other removable
media, one can run into an annoying problem: the UIDs or GIDs of the files
on the media don't match those on the system. In most situations, those
kinds of devices have a VFAT filesystem that avoids the problem entirely by not
storing UID/GID information. But if a user wants to use a "real"
filesystem on the device, one of the ext* family for example, it might be
useful to specify the local owner of the files. Ludwig Nussel's patch set would do just that for ext2, ext3,
and ext4
filesystems.
The patch comes from some work Nussel did "years ago", he
said, when re-introducing it. It simply
adds two new mount options for ext filesystems. Following in the footsteps
of the VFAT filesystem, the patch would add uid= and gid=
options that would treat all files in the filesystem as being owned by that
UID/GID combination.
When a filesystem is mounted using these options, files retain their
ownership on disk, but they
appear to be owned by the specified user and group. Existing files cannot
have their ownership changed, but new files will be created with the user
and group given at mount time. If a different UID/GID combination is
desired for new files—to match the UID/GID on the device for
example—they can be added to the mount option:
uid=m:n
gid=x:y
which would make the files appear to be owned by m.x and would create new
files as n.y.
One of the first questions to greet Nussel's patch was about putting the
code into specific filesystems, rather than the VFS layer. While the VFS
seems like the right place, Ted Ts'o points
out that there is no easy way to do it all there:
The problem is that there will need to be at least some
support in the individual file system, since there isn't a good place
for the VFS to intercept the internal file system iget() function to
patch in the override uid/gid values.
So the question at this point is whether it's cleaner to have the
functionality split between the VFS and the file system layers (i.e.,
with the options parsing and storing the override uid/gid values in
the super_block structure) or keeping it all in the file system layer,
and accepting the duplication of code across multiple file systems.
Ts'o leaned toward the first approach in that message, but later reluctantly accepted the code duplication.
From what he could see, there wasn't enough of a win to put it into the VFS.
There was a little more discussion when Nussel resent the patch on May 10. First off, Jan
Kara and Ts'o both wanted to see the patch split into three parts (one for
each of ext2, 3, and 4), which Nussel did and posted the next day. But,
Roland Eggner and Boaz Harrosh were both concerned about the underlying
idea of the patch. Circumventing the access restrictions on the files via
a mount option is not a sensible way to address what is, really, an
administrative problem, they said.
Eggner described how he "solves" the problem
for systems he administers by essentially creating and using a static list
of UIDs and GIDs. His position is: "If UIDs differ on machines
FORESEEN for file exchange, this is an
administrator error, not a kernel deficit." Furthermore,
exchanging files with unexpected systems requires root privileges, he said,
so there
is no need for the mount option override.
Like Eggner, Harrosh is concerned about
security issues with the proposed change. He also doesn't see anything
particularly special about the ext filesystems in terms of removable media,
noting that VFAT is the dominant choice. Beyond that, he questions the
definition of "removable media", and notes that the problem is common in
the NFS world: "we constantly encounter multiple domain
uid/gid views, and it does not mean we blow a hole in POSIX security
rules."
But Neil Brown sees things a little
differently. He notes that VFAT suffers from limitations including a 4G
file size limit and an inability to handle some special characters in file
names.
That aside, when someone has physical access to a device, it is essentially
"removable" in some sense, so that someone may want to easily access the
data:
[...] if I "own" a filesystem - whether because I hold the
physical non-encrypted devices or because I know the encryption key - then I
want to be able to leverage that "ownership" to full access rights to
the contents of the filesystem. By typing in a key or plugging in a device I
want to get full "root" access to the filesystem on the device. Not giving
that to me is just getting in my way.
When users insert a VFAT-formatted USB stick or disk, suitably configured
systems will give full access to the user by using the VFAT uid/gid
options. Nussel's patches essentially
just give that same power for ext-formatted devices. While it could
certainly lead to problems, those problems are already latent, as Brown
pointed out:
You cannot prevent data
destruction on such devices if you lose physical control, and the only
workable data privacy option is encryption. Trying to pretend that file
permission bits mean anything is extremely naive.
While Harrosh is concerned that automounters will start using the options,
Brown believes that makes sense for removable devices. In the patch,
Nussel mentions
that it could be done statically in /etc/fstab or be handled
dynamically through udev rules. The alternative
suggested by Harrosh is that root can mount the device and then
chmod (or chown, presumably) the files appropriately.
That seems like a pointless exercise that will just have to be repeated,
potentially every time the device is plugged into a new system. Eggner's
method is certainly workable, at some level, but makes things more difficult
and less "user friendly", Brown said.
In the end, it is a convenience feature. Anyone with physical access to a
unencrypted removable device already has the tools available to read the
data on it or
to put malware onto it. It's a little hard to see how making it easier for
legitimate owners of removable USB storage to access their data somehow
opens the floodgates for attackers of various sorts. Those of a malicious
bent can find any number of ways (live CD, their own Linux system, ...) to
access the device as root if they wish.
It is unclear how prevalent ext-formatted removable devices are, so there
may be an argument against adding the feature on those grounds. On the
other hand, making the ext family work better may encourage people to use those
filesystems more often for removable media. The patches do
duplicate code in the three separate filesystems, but the total number of
lines is changed is only around 100 lines for each. Moving some of that
into the VFS (like parsing the mount options and storing the flags in the
superblock) might reduce that a bit, but
it's not much code overall. Administrators who are worried about the
feature will be able to avoid it entirely, though they may need to keep an
eye on their distribution's udev rules. Given that it brings the same
convenience as VFAT to ext-formatted devices, it seems like a feature worth
having.
Comments (51 posted)
By Jonathan Corbet
May 16, 2012
For the most part, the
logging reliability
patches covered here in April have been quietly stabilizing and appear
to be set for merging for 3.5. But
printk() is a heavily-used
function, so there are a lot of people with strong opinions on how it
should work. Thus the discussion on how
printk() can be improved
has stretched out for some time. The result, so far, is a better
understanding of how continuation lines should be handled and, possibly, a
new format for timestamps.
Messages are sent to the system log with printk(), but that
function has an interesting bit of historical behavior: like
printf() in user space, printk() can be used to send
partial lines to the log. Multiple printk() calls can be used to
produce a single line in the log stream, piece by piece.
The patches for 3.5 make printk() much
more record-oriented internally, but the API does not change. So there is
a bit of an impedance mismatch between a record-oriented logging system and
its stream-oriented API. That mismatch has been there since the beginning,
but it has become more clear over time.
The mixed nature of kernel logging leads to a bit of an ambiguity, because
any message can be either of two things: (1) a new message to be logged
or (2) a continuation of a previous log message.
The kernel decides which of the two situations holds by remembering whether
the previous log message ended with a newline or not. If there was no
trailing newline, a new message will be appended to the previous line.
This approach works much of the time, but it is not without its hazards.
In particular, there is nothing that guarantees that two successive
printk() calls will be executed one right after the other. Even on a
uniprocessor system, interrupt handlers can emit messages between two
printk() calls that are supposed to produce a single line of
output. Adding more processors to the system clearly makes the situation
worse; there is only one log buffer containing messages from all
processors, so it is easy for one processor to jump into the middle of a
sequence of printk() calls being executed on another. What
happens then is not especially pretty: messages get mashed together and
corrupted. The result is a log that is harder for humans to read, and
which can totally confuse automated log-processing tools.
This patch set was supposed to be about increasing logging reliability, so
that sort of message corruption is not welcome. The original plan devised
by developer Kay Sievers was to require an explicit KERN_CONT "log
level" marker for continuations. In this scheme, every printk()
call will generate a new log line unless merging has been explicitly
requested with the KERN_CONT "log level." There is a little
problem in that most
continuation lines are not so-marked in current kernels, leading to lines
being split up; Kay's plan was to audit the kernel and fix all of those
calls to work properly in the new scheme.
Linus didn't like that idea, saying that
things work well as they are now; to him, adding all those
KERN_CONT markers just represented unnecessary noise. After some
back-and-forth, Kay came around to Linus's point of view, but he still
wanted to avoid the corruption of messages whenever possible. The result
was a new patch that tries to explicitly
remember partial printk() calls and associate them with a specific
process. Lines passed to printk() will be merged only if they
both come from the same process and only if the second line is clearly not
the start of a new log message. The end result is not perfect: if two
processors try
to output partial lines at the same time, at least one of them will be
split. But there will be no more joining of unrelated messages, and that
seems like a good thing.
A different branch of the same discussion got into the formatting of
timestamps, which will always be present in the new scheme. In current
kernels, that timestamp comes in the form of seconds and microseconds since
the system booted. But what developers often really want to see is some
combination of the absolute time of an event and the relative time from
previous events. After some discussion with Sasha Levin, Linus requested a format that looks like this:
[May12 11:27] foo
[May12 11:28] bar
[ +5.077527] zoot
[ +10.235225] foo
[ +0.002971] bar
[May12 11:29] zoot
[ +0.003081] foo
In other words, events that are relatively far apart in time would be
marked with the absolute time with one-minute precision. When things
happen more closely in time, the elapsed time between successive events
would be printed instead. For any driver developer trying to figure out
the relative timing of device-related events, this kind of output format
would help to save a lot of mental arithmetic.
The patches to produce this format have not yet been posted, so it is
looking likely that we will not see it in the 3.5 kernel. The rest of the
logging work should be there for 3.5, though, taking Linux one small step
closer to the sort of structured and reliable logging that many users and
developers would like to see.
Comments (10 posted)
By Jonathan Corbet
May 14, 2012
Flash-based solid-state storage devices (SSDs) have a lot to recommend
them; in particular, they can be quite fast even when faced with highly
non-sequential I/O patterns. But SSDs are also relatively small and
expensive; for that reason, for all their virtues, they will not be fully
replacing rotating storage devices for a long time. It would be nice to
have a storage device that provided the best features of both SSDs and
rotating devices—the speed of flash combined with the cheap storage
capacity of traditional drives. Such a device could simultaneously reduce
the performance pain that comes with rotating storage and the financial
pain associated with solid-state storage.
The classic computer science response to such a problem is to add another
level of indirection in the form of another layer of caching. In
this case, a large array of drives could be hidden behind a much smaller
SSD-based cache that provides quick access to frequently-accessed data and
turns random access patterns in something closer to sequential access.
Hybrid drives and high-end storage arrays have provided this kind of
feature for some time, but Linux does not currently have the ability to
construct such two-level drives from independent components. That
situation could change, though, if the bcache patch set finds its way into the
mainline.
LWN last looked at bcache almost two years
ago. Since then, the project has been relatively quiet, but development
has continued. With the current v13 patch set, bcache creator Kent Overstreet
says:
Bcache is solid, production ready code. There are still bugs being
found that affect specific configurations, but there haven't been
any major issues found in awhile - it's well past time I started
working on getting it into mainline.
The idea behind bcache is relatively straightforward: given an SSD and one
or more storage devices, bcache will interpose the SSD between the kernel
and those devices, using the SSD to speed I/O operations to and from the
underlying "backing store" devices. If a read request can be satisfied
from the SSD, the backing store need not be involved at all. Depending on
its configuration, bcache can also buffer write operations; in this mode,
it serves as a sort of extended I/O scheduler, reordering operations so
that they can be sent to the backing device in a more seek-friendly manner.
Once one gets into the details, though, the problem starts to become more
complex than one might imagine.
Consider the buffering and reordering of write operations, for example.
Some users may be uncomfortable with anything that delays the arrival of
data on the backing device; for such situations, bcache can be run in a
write-through caching mode. When write-through behavior is selected, no
write operation is considered to be complete until it has made it to the
backing device. Clearly, in this case, the SSD cache is not going to
improve write performance at all, though it may still improve performance
overall if that data is read while it remains in the cache.
If, instead, writeback caching is enabled, bcache will mark the completion of
writes once they make it to the SSD. It can then flush those dirty blocks
out to the backing device at its leisure. Writeback caching can allow the
system to coalesce multiple writes to the same blocks and to achieve better
on-disk locality when the writes are eventually flushed out; both of those
should improve performance. Obviously, writeback caching also carries the
risk of losing data if the system is struck by a meteorite before the
writeback operation is complete. Bcache includes a fair amount of code
meant to address this concern; the SSD contains an index as well as the
cached data, so dirty blocks can be located and written back after the
system comes back up. Providing meteorite-proof drives is beyond the scope
of the bcache patch set, though.
Of course, maintaining this index on the SSD has some performance costs of
its own, especially since bcache takes pains to only write full erase
blocks at a time. One write operation from the kernel can turn into
several operations at the SSD level to ensure that the on-SSD data
structures are consistent at all times. To mitigate this cost, bcache
provides an optional journaling layer that can speed up operations at the
SSD level.
Another interesting problem that comes with writeback caching is the
implementation of barrier operations. Filesystems use barriers
(implemented as synchronous "force to media" operations in contemporary
kernels) to ensure that the on-disk filesystem structure is consistent at
all times. If bcache does not recognize and implement those barriers, it
runs the risk of wrecking the filesystem's careful ordering of operations
and corrupting things on the backing device. Unfortunately,
bcache does indeed lack such support at the moment, leading to a strong
recommendation to mount filesystems with barriers disabled for now.
Multi-layer solutions like bcache must face another hazard: what happens if
somebody accesses the underlying backing device directly, routing around
bcache? Such access could result in filesystem corruption. Bcache handles
this possibility by requiring exclusive access to the backing device. That
device is formatted with a special marker, and its leading blocks are
hidden when accessing the device by way of bcache. Thus, the beginning of
the device under bcache is not the same as the beginning when the device is
accessed directly. That means that a filesystem created through bcache
will not be recognized by the filesystem code if an attempt is made to
mount the backing device directly. Simple attempts to shoot one's own feet
should be defeated by this mechanism; as always, there is little point in
doing more to protect those who are really determined to injure themselves.
There seems to be a reasonable level of consensus that bcache would be a
useful functionality to add to the kernel. There are some obstacles to
overcome before this code can be merged, though. One of those is that
bcache adds its own management interface involving a set of dedicated tools
and a complex sysfs structure. There is resistance to adding another API
for block device management, so Kent has been encouraged to integrate
bcache into the device mapper code. Nobody seems to be working on that
project at the moment, but Dan Williams has posted a set of patches integrating bcache into the
MD RAID layer. With these patches, a simple mdadm command is
sufficient to set up an array with SSD caching added on top. Once that
code gets into shape, presumably the user-space interface concerns will be
somewhat lessened.
A harder problem to get around may be the simple fact that the bcache patch
set is large, adding over 15,000 lines of code to the kernel. Included
therein is a fair amount of tricky data structure work such as a complex
btree implementation and "closures," being "asynchronous refcounty
things based on workqueues." The complexity of the code will make
it hard to review, but, given the potential for trouble when adding a new
stage to the block I/O path, developers will want this code to be well
reviewed indeed. Getting enough eyeballs directed toward this code could
be a challenge, but the benefit, in the form of faster storage devices,
could well be worth the trouble.
Comments (38 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>