Brief items
The current development kernel is 3.12-rc1,
released on September 16. Linus said:
"
I personally particularly like the scalability improvements that got
merged this time around. The tty layer locking got cleaned up and in the
process a lot of locking became per-tty, which actually shows up on some
(admittedly odd) loads. And the dentry refcount scalability work means that
the filename caches now scale very well indeed, even for the case where you
look up the same directory or file (which could historically result in
contention on the per-dentry d_lock)."
Stable updates:
3.0.96,
3.4.62,
3.10.12, and
3.11.1 were all released on
September 14.
Comments (none posted)
Yo Dawg, I heard you like kernel compiles, so I put a kernel
compile in your kernel compile so that you can compile the kernel
while you compile the kernel.
—
Linus Torvalds (Thanks to Josh Triplett)
Comments (1 posted)
The Linux Foundation has
announced
the release of its roughly annual report on the kernel development
community; this report is written by Greg Kroah-Hartman, Amanda McPherson,
and LWN editor Jonathan Corbet. There won't be much new there for those
who follow the development statistics on LWN, but it does take a bit of a
longer time perspective.
Comments (20 posted)
The OpenZFS project has
announced its
existence. "
ZFS is the world's most advanced filesystem, in
active development for over a decade. Recent development has continued in
the open, and OpenZFS is the new formal name for this open community of
developers, users, and companies improving, using, and building on
ZFS. Founded by members of the Linux, FreeBSD, Mac OS X, and illumos
communities, including Matt Ahrens, one of the two original authors of ZFS,
the OpenZFS community brings together over a hundred software developers
from these platforms."
Comments (35 posted)
Kernel development news
By Jonathan Corbet
September 17, 2013
Despite
toying with the idea of closing the
merge window rather earlier than expected, Linus did, in the end, keep it
open until September 16. He repeated past grumbles about
maintainers who send their pull requests at the very end of the merge
window, though; increasingly, it seems that wise maintainers should behave
as if the merge window were a single week in length. Pull requests that
are sent too late run a high risk of being deferred until the next
development cycle.
In the end, 9,479 non-merge changesets were pulled into the mainline
repository for the 3.12 merge window; about 1,000 of those came in after
the writing of last week's summary.
Few of the changes merged in the final days of the merge window were hugely
exciting, but there have been a number of new features and improvements.
Some of the more significant, user-visible changes include:
- Unlike its predecessor, the 3.12 kernel will not be known as "Linux
for Workgroups." Instead, for reasons that are not entirely clear,
the new code name was "Suicidal Squirrel" for a few days; it then was
changed to "One giant leap for frogkind."
- It is now possible to provide block device partition tables on the
kernel command line; see Documentation/block/cmdline-partition.txt
for details.
- The memory management subsystem has gained the ability to migrate huge
pages between NUMA nodes.
- The Btrfs filesystem has the beginning of support for offline
deduplication of data blocks. A new ioctl() command
(BTRFS_IOC_FILE_EXTENT_SAME) can be used by a user-space
program to inform the kernel of extents in two different files that
contain the same data. The kernel will, after checking that the
data is indeed the same, cause the two files to share a single copy of
that data.
- The HFS+ filesystem now supports POSIX access control lists.
- The reliable out-of-memory killer
patches have been merged. This work should make OOM handling more
robust, but it could possibly confuse user-space applications by
returning "out of memory" errors in situations where such errors were
not seen before.
- The evdev input layer has gained a new EVIOCREVOKE
ioctl() command that revokes all access to a given file
descriptor. It can be used to ensure that no evil processes lurk on
an input device across sessions. See this
patch for an example of how this functionality can be used.
- New hardware support includes:
- Miscellaneous:
MOXA ART real-time clocks,
Freescale i.MX SoC temperature sensors,
Allwinner A10/A13 watchdog devices,
Freescale PAMU I/O memory management units,
TI LP8501 LED controllers,
Cavium OCTEON GPIO controllers, and
Mediatek/Ralink RT3883 PCI controllers,
- Networking:
Intel i40e Ethernet interfaces.
Changes visible to kernel developers include:
- The seqlock locking primitive has gained a new "locking reader" type.
Normally, seqlocks allow for data structures to be changed while being
accessed by readers; the readers are supposed to detect the change (by
checking the sequence number) and retry if need be. Some types of
readers cannot tolerate changes to the structure, though; in current
kernels, they take an expensive write lock instead. The "locking
reader" lock will block writers and other locking readers, but allow
normal readers through. Note that locking readers could share
access with each other;
the fact that this sharing does not happen now is an implementation
limitation. The functions for working with this type of
lock are:
void read_seqlock_excl(seqlock_t *sl);
void read_sequnlock_excl(seqlock_t *sl);
There are also the usual variants for blocking hardware and software
interrupts; the full set can be found in
<linux/seqlock.h>.
- The new shrinker API has been merged.
Most code using this API needed to be changed; the result should be
better performance and a better-defined, more robust API. The new
"LRU list" mechanism that was a part of that patch set has also been
merged.
- The per-CPU IDA ID allocator patch set
has been merged.
Now begins the stabilization phase for the 3.12 kernel. If the usual
pattern holds, the final release can be expected on or shortly after
Halloween; whether it turns out to be a "trick" or a "treat" depends on how
well the testing goes between now and then.
Comments (18 posted)
By Jonathan Corbet
September 18, 2013
The ongoing disclosures of governmental attempts to weaken communications
security have caused a great deal of concern. Thus far, the evidence would
seem to suggest that the core principles behind cryptography remain sound,
and that properly encrypted communications can be secure. But the
"properly encrypted" part is a place where many things can go wrong. One
of those things is the generation of random numbers; without true
randomness (or, at least, unpredictability), encryption algorithms can be far
easier to break than their users might believe. For this reason,
quite a bit of attention has been paid to the integrity of random number
generation mechanisms, including the random number generator (RNG) in the
kernel.
Random number generation in Linux seems to have been fairly well
thought out with no obvious mistakes. But that does not mean that all is
perfect, or that improvements are not possible. The kernel's random number
generator has been the subject of a few different conversations recently,
some of which will be summarized here.
Hardware random number generators
A program running on a computer is a deterministic state machine that
cannot, on its own, generate truly random numbers. In the absence of a
source of randomness from the outside world, the kernel is reduced to the
use of a pseudo-random number generator (PRNG) algorithm that, in theory,
will produce numbers that could be guessed by an attacker. In practice,
guessing the results of the kernel's PRNG will not be an easy task, but
those who concern themselves with these issues still believe that it
is better to incorporate outside entropy (randomness) whenever it is
possible.
One obvious source of such randomness would be a random number generator
built into the hardware. By sampling quantum noise, such hardware could
create truly random data. So it is not surprising that some processors
come with RNGs built in; the RDRAND instruction provided by some Intel
processors is one example. The problem with hardware RNGs is that they
are almost entirely impossible to audit; should some country's spy agency
manage to compromise a hardware RNG, this tampering would be nearly
impossible to detect. As a result, people who are concerned about
randomness tend to look at the output of hardware RNGs with a certain level
of distrust.
Some recently
posted research [PDF] can only reinforce that distrust. The
researchers (Georg T. Becker, Francesco Regazzoni, Christof Paar, and Wayne
P. Burleson) have documented a way to corrupt a hardware RNG by changing the
dopant polarity in just a few transistors on a chip. The resulting numbers
still pass tests of randomness, and, more importantly, the hardware still
looks the
same at almost every level, regardless of whether one looks at the masks
used or whether one looks at the chip directly with an electron microscope.
This type of hardware compromise is thus
nearly impossible to detect; it is also relatively easy to carry out. The
clear conclusion is that hostile hardware is a real possibility; the
corruption of a relatively simple and low-level component like an RNG is
especially so. Thus, distrust of hardware RNGs would appear to be a
healthy tendency.
The kernel's use of data from hardware RNGs has been somewhat controversial
from the beginning, with some developers wanting to avoid such sources of
entropy altogether. The kernel's approach, though, is that using all
available sources of entropy is a good thing, as long as it is properly
done. In the case of a hardware RNG, the random data is carefully mixed
into the buffer known as the "entropy pool" before being used to generate
kernel-level random numbers. In theory, even if the data from the hardware
RNG is entirely hostile, it cannot cause the state of the entropy pool to
become known and, thus, it cannot cause the kernel's random numbers to be
predictable.
Given the importance of this mixing algorithm, it was a little surprising
to see, earlier this month, a patch that
would allow the user to request that the hardware RNG be used exclusively
by the kernel. The argument for the patch was based on performance:
depending entirely on RDRAND is faster than running the kernel's full mixing
algorithm. But the RNG is rarely a performance bottleneck in the kernel,
and the perceived risk of relying entirely on the hardware RNG was seen as
being far too high. So the patch was not received warmly and had no real
chance of being merged; sometimes it is simply better not to tempt users to
compromise their security in the name of performance.
H. Peter Anvin raised a related question:
what about hardware RNGs found in other components, and, in particular, in
trusted platform module (TPM) chips? Some TPMs may have true RNGs in them;
others are known to use a PRNG and, thus, are fully deterministic. What
should the kernel's
policy be with regard to these devices, which, for the most part, are
ignored currently? The consensus seemed to be that no particular trust
should be put into TPM RNGs, but that using some data from the TPM to seed
the kernel's entropy pool at boot could be beneficial. Many systems have
almost no entropy to offer at boot time, so even suspect random data from
the TPM would be helpful early in the system's lifetime.
Overestimated entropy
As noted above, the kernel attempts to pick up entropy from the outside
world whenever possible. One source of entropy is the timing of device
interrupts; that random data is obtained by (among other things) reading
the time stamp counter (TSC) with a call to get_cycles() and using
the least significant bits. In this way, each interrupt adds a little
entropy to the pool. There is just one little problem, pointed out by Stephan Mueller: on a number of
architectures, the TSC does not exist and get_cycles() returns
zero. The amount of entropy found in a constant stream of zeroes is rather
less than one might wish for; the natural consequence is that the kernel's
entropy pool may contain less entropy than had been thought.
The most heavily used architectures do not suffer from this problem; on the
list of those that do, the most significant may be MIPS, which is used in a
wide range of home network routers and other embedded products. As it
turned out, Ted Ts'o had already been working
with the MIPS maintainers to find a solution to this problem. He
didn't like Stephan's proposed solution — reading a hardware clock if
get_cycles() is not available — due to the expense; hardware
clocks can take a surprisingly long time to read. Instead, he is hoping
that each
architecture can, somehow, provide some sort of rapidly increasing counter
that can be used to contribute entropy to the pool. In the case of MIPS,
there is a small counter that is incremented each clock cycle; it doesn't
hold enough bits to work as a TSC, but it's sufficient for entropy
generation.
In the end, a full solution to this issue will take a while, but, Ted said, that is not necessarily a problem:
If we believed that /dev/random was actually returning numbers
which are exploitable, because of this, I might agree with the "we
must do SOMETHING" attitude. But I don't believe this to be the
case. Also note that we're talking about embedded platforms, where
upgrade cycles are measured in years --- if you're lucky. There
are probably home routers still stuck on 2.6 [...]
So, he said, it is better to take some time and solve the problem properly.
Meanwhile, Peter came to another conclusion about the entropy pool: when
the kernel writes to that pool, it doesn't account for the fact that it
will be overwriting some of the entropy that already exists there. Thus,
he said, the kernel's estimate for the amount of entropy in the pool is
almost certainly too high. He put together a
patch set to deal with this problem, but got little response. Perhaps
that's because, as Ted noted in a different
conversation, estimating the amount of entropy in the pool is a hard
problem that cannot be solved without knowing a lot about where the
incoming entropy comes from.
The kernel tries to deal with this problem by being conservative in its
accounting for entropy. Quite a few sources of unpredictable data are
mixed into the pool with no entropy credit at all. So, with luck, the
kernel will have a vague handle on the amount of entropy in the pool, and
its mixing techniques and PRNG should help to make its random numbers
thoroughly
unpredictable. The end result should be that anybody wanting to attack the
communications security of Linux users will not see poor random numbers as
the easiest approach; in this world, one cannot do a whole lot better than
that.
Comments (24 posted)
By Jonathan Corbet
September 18, 2013
One of the most common things to do on a computer is to copy a file, but
operating systems have traditionally offered little in the way of
mechanisms to accelerate that task. The
cp program can replicate
a filesystem hierarchy using links — most useful for somebody wanting to
work with multiple kernel trees — but that trick speeds things up by not
actually making copies of the data; the linked files cannot be modified
independently of each other. When it is necessary to make an
independent copy of a file, there is little alternative to reading the
whole thing through the page cache and writing it back out. It often seems
like there should be a better way, and indeed, there might just be.
Contemporary systems often have storage mechanisms that could speed copy
operations. Consider a filesystem mounted over the network using a
protocol like NFS, for example; if a file is to be copied to another
location on the same server, doing the copy on the server would avoid a lot
of work on the client and a fair amount of network traffic as well.
Storage arrays often operate at the file
level and can offload copy operations in a similar way. Filesystems like
Btrfs can "copy" a file by sharing a single copy of the data between the
original and the copy; since that sharing is done in a copy-on-write mode,
there is no way for user space to know that the two files are not
completely independent. In each of these cases, all that is needed is a
way for the kernel to support this kind of accelerated copy operation.
Zach Brown has recently posted a patch
showing how such a mechanism could be added to the splice() system
call. This system call looks like:
ssize_t splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out,
size_t len, unsigned int flags);
Its job is to copy len bytes from the open file represented by
fd_in to
fd_out, starting at the given offsets for each. One of the key
restrictions, though, is that one of the two file descriptors must be a
pipe. Thus, splice() works for feeding data into a pipe or for
capturing piped data to a file, but it does not perform the simple task of
copying one file to another.
As it happens, the machinery that implements splice() does not
force that limitation; instead, the "one side must be a pipe" rule comes
from the history of how the splice() system call came about.
Indeed, it already does file-to-file copies when it is invoked behind the
scenes from the sendfile() system call. So there should be no
real reason why splice() would be unable to do accelerated
file-to-file copies. And that is exactly what Zach's patch causes it to
do.
That patch set comes in three parts. The first of those adds a new flag
(SPLICE_F_DIRECT) allowing users to request a direct file-to-file
copy. When this flag is present, it is legal to provide values for both
off_in and off_out (normally, the offset corresponding to
a pipe must be NULL); when an offset is provided, the file will be
positioned to that offset before the copying begins. After this patch, the
file copy will happen without the need to copy any data in memory and
without filling up the page cache, but it will not be optimized in any
other way.
The second patch adds a new entry to the ever-expanding
file_operations structure:
ssize_t (*splice_direct)(struct file *in, loff_t off_in, struct file *out,
loff_t off_out, size_t len, unsigned int flags);
This optional method can be implemented by filesystems to provide an
optimized implementation of SPLICE_F_DIRECT. It is allowed to
fail, in which case the splice() code will fall back to copying
within the kernel in the usual manner.
Here, Zach worries a
bit in the comments about how the SPLICE_F_DIRECT flag works: it
is used to request both
direct file-to-file copying and filesystem-level optimization. He suggests
that the two requests should be separated, though it is hard to imagine a
situation where a developer who went to the effort to use splice()
for a file-copy operation would not want it to be optimized. A
better question, perhaps, is why SPLICE_F_DIRECT is required at
all; a call to splice() with two regular files as arguments would
already appear to be an unambiguous request for a file-to-file copy.
The last patch in the series adds support for optimized copying to the
Btrfs filesystem. In truth, that support already exists in the form of the
BTRFS_IOC_CLONE ioctl() command; Zach's patch simply
extends that support to splice(), allowing it to be used in a
filesystem-independent manner. No other filesystems are supported at this
point; that work can be done once the interfaces have been nailed down and
the core work accepted as the right way forward.
Relatively few comments on this work have been posted as of this writing;
whether that means that nobody objects or nobody cares about this
functionality is not entirely clear. But there is an ongoing level of
interest in the idea of optimized copy operations in general; see the lengthy discussion of the proposed
reflink() system call for an example from past years. So,
sooner or later, one of these mechanisms needs to make it into the
mainline. splice() seems like it could be a natural home for this
type of functionality.
Comments (4 posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>