Brief items
The current development kernel is 2.6.31-rc6,
released on August 14.
"
There's nothing hugely interesting there, although I'm personally
gratified by the fact that I have a few more commits than usual. Even if
they are all really small, it makes me feel useful." See the
announcement for the short-form changelog and some comments on the latest
NULL pointer vulnerability fix.
The current stable kernel is 2.6.30.5, released (along with 2.6.27.30) on August 16.
Both contain a long list of fixes for a number of significant problems.
Note that 2.6.27.30 had a compilation problem; 2.6.27.31 was released shortly
thereafter with a fix for that one issue.
The current ancient kernel is 2.4.37.5, released on August 13 with
a fix for the latest null pointer vulnerability (which affects 2.4 kernels
as well as 2.6).
Comments (none posted)
Kernel development news
And I'm tired of being stupid. So this patch makes me smart and
forward-thinking instead.
--
Linus Torvalds
There is also a feeling from some people in this community (they've
told me themselves in private) that certain mainline folk have been
actively trying to undermine the ARM community - given the recent
"discussions" (more like flames) over things like the device tree,
the HTC stuff, etc.
--
Russell King
So System.map is part of the kernel API too? Sigh.
--
Andrew Morton
Comments (none posted)
By Jonathan Corbet
August 19, 2009
Low-memory mapping. The recent set of null-pointer vulnerabilities
has not been helped by the confusion around how the
mmap_min_addr
knob and security modules interact. The 2.6.31-rc7 kernel will see a
couple of changes intended to clarify and rationalize this interaction.
With these patches, SELinux will no longer bypass the
mmap_min_addr check; any process wanting to map memory below that
address will require the
CAP_SYS_RAWIO capability. However,
SELinux will also implement its own low-memory limit, controlled by a
separate knob in the SELinux policy. As a result, it will be possible to
turn off the
mmap_min_addr protection but still only allow
specific programs to do low-memory mapping.
Module loading security. Another SELinux-related change - not yet
merged into the mainline - adds a new hook to request_module().
The idea here is to try to limit the ability of user-space programs to load
arbitrary modules into a running kernel. In future versions of the SELinux
policy, the ability to trigger module loads is likely to be reduced to a
much smaller set of roles.
HugeTLB mappings. The "hugetlb" feature allows processes to create
pseudo-anonymous memory mappings backed by pages which are larger (perhaps much
larger) than the normal system page size. For certain kinds of
applications, these mappings can improve performance by reducing pressure
on the CPU's translation lookaside buffer (TLB). The kernel code resides
within such a mapping for the same reason. Using hugetlb pages in user
space is a bit awkward, though; it requires mounting the special hugetlbfs
filesystem and mapping files from there.
Eric Munson has put together a patch
implementing an easier way. With this patch, the mmap() system
call understands a new MAP_HUGETLB flag; when that flag is
present, the kernel will attempt to create a mapping backed by huge pages.
Underneath it all, the mapping is still implemented as a hugetlbfs file,
but user space need no longer be aware of that fact.
spin_is_locked(). Kernel code can test the current state of a
spinlock with spin_is_locked(). But what should this function
return on a single-processor system, where spinlocks do not exist at all?
Kumar Gala ran into trouble because one
uniprocessor spin_is_locked() implementation returned zero. The
problem was code in this form:
/* Ensure we have the requisite lock */
BUG_ON(!spin_is_locked(&some_lock));
So Kumar thinks that the return value should always be true. But there are
other situations where that is just the wrong thing to do; Linus gave an example where code is waiting for a lock to
become free.
The real problem is that a predicate like spin_is_locked() simply
lacks a well-defined meaning when the spinlock does not exist. So there is
no way to always give the "right" answer in such situations. What may
happen instead is that, in a future kernel, spin_is_locked() will
be deprecated. Instead, there will be new expect_spin_locked()
and expect_spin_unlocked() primitives for testing the state of a
spinlock. When the code is this explicit about what it is looking for, the
default answer can make sense; both would return true on uniprocessor
systems.
localmodconfig. Many kernel testers want to build a kernel which
looks like the kernel shipped with their distribution. But distributor
kernels come with a configuration which builds almost everything. So our
poor tester ends up waiting for a very long time as the system builds a
bunch of modules which will never be used. One could avoid this problem by
creating a new configuration from scratch, but that process can be a little
daunting as well. There are a lot of configuration options in a
contemporary kernel.
Steven Rostedt has posted a build
system change intended to help with this problem. A user who types:
make localmodconfig
will get a configuration which builds all modules currently loaded into the
running system, but no others. That should be a configuration which
supports the system's hardware, but which lacks hundreds of useless
modules. There is also a localyesconfig option which builds the
required drivers directly into the kernel.
Comments (21 posted)
By Jonathan Corbet
August 19, 2009
While a great deal of power management work has been done on Linux systems
in recent years, much of that work has been directed toward the creation of
working suspend and hibernation capabilities. But there is more to power
management than that; there is real value in being able to minimize the
power consumption of a running system. That is as true for large data
center machines as it is for laptops; reduced power usage and lower air
conditioning requirements have both economic and environmental benefits.
Now that the suspend problem is mostly solved, increasing amounts of
attention are being paid to the other aspects of power management; some
recent patches show how the infrastructure for runtime power management is
coming together.
The core of the future power management structure appears certain to be
Rafael Wysocki's runtime power management
patch. This patch set adds structure to the power management code to
facilitate the suspending and resuming of individual system components at
run time. The dev_pm_ops structure is augmented with three new
functions:
int (*runtime_suspend)(struct device *dev);
int (*runtime_resume)(struct device *dev);
int (*runtime_idle)(struct device *dev);
These functions are to be implemented by the core device code for each bus
type; they may then be turned into bus-specific driver callbacks. The
power management code will call runtime_suspend() to prepare a
specific device for a lower-power state. This call does not imply that the
device itself must suspend, but the device does need to prepare for a condition
where it is no longer able to communicate with the CPU or memory. In other
words, even if the device does not suspend, hardware between that device
and the rest of the system might suspend. A return value of -EBUSY or
-EAGAIN will abort the suspend operation.
A call to runtime_resume() should prepare the device to operate
again with the rest of the system; the driver should power up the device,
restore registers, and do anything else needed to get the device
functioning again. The runtime_idle() callback, instead, is
called when the core thinks that the device is idle and might be a good
candidate for suspending. The callback should decide whether the device
can really be suspended (this could include checking the state of any other
devices connected to it) and, if all seems well, initiate a suspend
operation.
Along with these callbacks is, of course, a set of core functions designed
to manage suspend and resume activities, deal with mid-course
cancellations, allow outside code to make power management changes, and so
on. See the associated document file for
more information on how this subsystem works.
The code described above has been through several review iterations and
would appear to be on track for merging in 2.6.32. Rafael's asynchronous suspend and resume
patch, instead, is rather newer and may take a little longer. The idea
behind this patch is to extend the runtime power management code to allow
suspend and resume callbacks to be invoked asynchronously; that, in turn,
would allow them to be run in parallel. As long as there are no
dependencies between a pair of devices, suspending or resuming them in
parallel should make full-system transitions faster.
The problem is in the dependencies; running a bunch of power management
operations in parallel increases the risk of getting the order wrong. To
avoid this outcome, the patch adds a new completion object to each device;
when a device is to be suspended, the completions will be used to ensure
that any dependent devices are suspended first. At resume time the
completions are used in the reverse direction: devices wait for their
parent devices to be resumed before resuming themselves.
As long as the dependency
information is correct, this mechanism should ensure that a set of power
management threads can run in parallel without breaking the system.
Ensuring that the dependencies are correct was one of the reasons for the
creation of the Linux device model years ago. With a properly-constructed
tree, the system can know, for example, that it cannot suspend a USB
controller until all USB devices plugged into it have been suspended. In
turn, the PCI controller to which the USB controller is attached must
remain functional until the USB controller is suspended, and so on. The
problem is that system dependencies are not always that simple. A PCI
device may also require the services of an I2C controller, for example, or
devices can be combined on multi-function chips in surprising ways. So the
device tree has proved unable to represent all of the
power management dependencies in the system.
Rafael has addressed this problem in a later version of the patch, which adds
a new framework for representing power management dependencies. At the
core of it is this structure:
struct pm_link {
struct device *master;
struct list_head master_hook;
struct device *slave;
struct list_head slave_hook;
};
One of these structures exists for each dependency known to the system. It
indicates that the "master" device must always be functional whenever the
"slave" device is; the master must be suspended after the slave and resumed
before it. Many of these links can be created by the power management core
itself; others will have to be generated by the relevant drivers. There
have been some concerns raised about the
memory use of this structure, but a better solution has not yet been
proposed.
Meanwhile, Matthew Garrett has taken the core power management code one
step further with a set of
runtime power management patches of his own. He has pushed the new
power management calls down into the PCI and USB bus layers and used them
to suspend devices opportunistically as the system runs; he reports a power
savings of around 0.2 watts as a result. Review comments resulted in these
patches being withdrawn for now for repairs, but they show the direction
things are heading. With sufficient software support and cooperative
hardware, Linux should be able to further reduce the operating power needed
for whole classes of systems. That cannot fail to be a good thing.
Comments (4 posted)
By Jake Edge
August 19, 2009
The kernel FIFO implementation, kfifo, is not that
widely used and Stefani Seibold would like to see that change. To that
end, she has proposed a new
kfifo API that embodies many of the design patterns for data
structures that were described by Neil Brown. The new interface is
simpler in many ways, so it should be easier to use, which, in turn, could
lead to more use throughout the kernel.
The basic problems with the current kfifo interface stem from the
constraints it places on its users. A spinlock is required, though it is
not needed by the majority of current kfifo users. Also, the
kfifo structure can only be allocated dynamically, so users cannot
put FIFOs inside of other structures—only pointers to FIFOs.
Moreover, according to Seibold, the current API is too simple and doesn't
provide a number of important features.
The new API has 23 separate functions, while the current has 9, but there
are quite a few variants that increase the total. Those variants include
copying from or to user space, supporting fixed-length records, as well as
being able to use spinlocks to synchronize adding or removing data from the
FIFO. It
supports many different use cases in the style of Brown's "Broad
Interfaces" pattern.
A kfifo is declared using the DECLARE_KFIFO() macro which
can be used inside of a
struct or union declaration. FIFOs declared with with
DECLARE_KFIFO() must be initialized using INIT_KFIFO().
Alternatively, DEFINE_KFIFO()
handles both the declaration and initialization for FIFOs outside of
struct/union declarations. The macros take name (to name the
variable or struct/union element) and size (in bytes) parameters:
DECLARE_KFIFO(name, size)
INIT_KFIFO(name)
DEFINE_KFIFO(name, size)
There are two functions to support dynamic buffer allocation:
int kfifo_alloc(struct kfifo *fifo, unsigned int size, gfp_t gfp_mask)
which allocates a buffer of
size bytes using the
gfp_mask
as flags to pass to
kmalloc(). Or, a pre-allocated buffer can be
attached to a
kfifo using:
void kfifo_init(struct kfifo *fifo, unsigned char *buffer, unsigned int size)
Any buffer allocated by
kfifo_alloc() should later be freed by
calling
kfifo_free().
To put data into the FIFO, kfifo_in() is used:
unsigned int kfifo_in(struct kfifo *fifo,
unsigned char *from, unsigned int n)
which returns the number of bytes stored. As mentioned above there are
variants for getting the data from user space, as well as for locking:
unsigned int kfifo_from_user(struct kfifo *fifo,
const void __user *from, unsigned int n)
unsigned int kfifo_in_locked(struct kfifo *fifo,
const unsigned char *from, unsigned int n, spinlock_t *lock)
As one might guess, the calls to remove data from the FIFO look
similar—if reversed:
unsigned int kfifo_out(struct kfifo *fifo,
unsigned char *to, unsigned int n)
unsigned int kfifo_to_user(struct kfifo *fifo,
void __user *to, unsigned int n)
unsigned int kfifo_out_locked(struct kfifo *fifo,
unsigned char *to, unsigned int n, spinlock_t *lock)
In the common case, with only one reader and one writer, extra locks are not
required to add or remove data from the FIFO, but for more complicated
scenarios, the
*_locked() variants allow the caller to use the
appropriate lock.
The expected tests for FIFO full and empty (kfifo_is_full() and
kfifo_is_empty()) are present, as are ways to get FIFO size and
available space (kfifo_size(), kfifo_len(), and
kfifo_avail()). One can also consume some FIFO bytes
without returning them using kfifo_skip() or clear the entire FIFO with
kfifo_reset().
There is also support for fixed-length records, with either 1- or 2-byte
lengths stored with each entry. Each call must pass a recsize
parameter that specifies the size of the length field (i.e. 1 or 2, though
0 is supported for no length) stored with each
entry:
unsigned int kfifo_in_rec(struct kfifo *fifo,
void *from, unsigned int n, unsigned int recsize)
unsigned int kfifo_from_user_rec(struct kfifo *fifo,
const void __user *from, unsigned int n, unsigned int recsize)
unsigned int kfifo_out_rec(struct kfifo *fifo,
void *to, unsigned int n, unsigned int recsize,
unsigned int *total)
unsigned int kfifo_to_user_rec(struct kfifo *fifo,
void __user *to, unsigned int n, unsigned int recsize,
unsigned int *total)
These functions return the number of bytes
not copied, which is a
bit strange. For the functions that remove data from the FIFO,
*total stores the number of bytes actually copied. This part of
the interface seems a little dubious, and may not survive in its present
form, though no comments along those lines have been seen.
Overall, the interface has been fairly well-received. There were a few
comments on an earlier
version of the API, which Seibold has mostly addressed. The only
comments on the most recent version (v0.4) were a disagreement between Alan
Cox and Andrew Morton over naming conventions.
Cox would rather not see the current kfifo_put() and
kfifo_get() calls get removed quite yet, noting
"We are about to set fifo loose through all the USB and some other
char/serial drivers all of which will use the spinlock facility."
The current calls use the spinlock, so the kind of change Seibold is
proposing would need to be reflected in the USB and char/serial driver code. But
Morton thinks that this is the right time
to make those changes, because
"kfifo has no business assuming that the caller wants to use
spin_lock() locking".
The basic problem is that Morton would like to see a convention followed
that things like kfifo_put() (or, kfifo_in() as Seibold's
API defines it) do not assume locking. He agrees with Seibold's decision
to add a separate kfifo_*_locked() variants. Cox points out that the convention is very
inconsistently followed, but Morton is adamant:
Plus, as I've said enty en times and keep getting ignored: the current
naming is wrong. The generic kfifo_get() should not be assuming that
the caller wants spinlock-based locking.
After initially NAK-ing part of the patch, Cox seems to have relented,
so that particular barrier is gone. It would seem far too late in the
2.6.31 process for this kind of change to go in, but unless some other
major issues crop up, it is quite possible that a new kfifo API
could show up in 2.6.32.
Comments (2 posted)
By Jonathan Corbet
August 18, 2009
Traditionally, storage devices have managed the blocks of data given to them
without being concerned about how the system used those blocks.
Increasingly, though, there is value in making more information available
to storage devices; in particular, there can be advantages to telling the
device when specific blocks no longer contain data of interest to the host
system. The "discard" concept was
added to the kernel one year ago
to communicate this information to storage devices. One year later, it seems that the
original discard idea will not survive contact with real hardware -
especially solid-state storage devices.
There are a number of use cases for the discard functionality. Large,
"enterprise-class" storage arrays can implement virtual devices with a much
larger storage capacity than is actually installed in the cabinet; these
arrays can use information about unneeded blocks to reuse the physical
storage for other data. The compcache compressed in-memory
swapping mechanism needs to know when
specific swap slots are no longer needed to be able to free the memory used
for those slots. Arguably, the strongest pressure driving the discard concept
comes from solid-state storage devices (SSDs). These devices must move data
around on the underlying flash storage to implement their wear-leveling
algorithms. In the absence of discard-like functionality, an SSD will end
up shuffling around data that the host system has long since stopped caring
about; telling the device about unneeded blocks should result in better
performance.
The sad truth of the matter, though, is that this improved performance does
not actually happen on SSDs. There are two reasons for this:
- At the ATA protocol level, a discard request is implemented by a
"TRIM" command sent to the device. For reasons unknown to your
editor, the protocol committee designed TRIM as a non-queued command.
That means that, before sending a TRIM command to the device, the
block layer must first wait for all outstanding I/O operations on that
device to complete; no further operations can be started until the
TRIM command completes. So every TRIM operation stalls the request
queue. Even if TRIM were completely free, its non-queued nature would
impose a significant I/O performance cost. (It's worth noting that
the SCSI equivalent to TRIM is a tagged command which doesn't suffer
from this problem).
- With current SSDs, TRIM appears to be anything but free. Mark Lord
has measured regular delays of
hundreds of milliseconds. Delays on that scale would be most
unwelcome on a rotating storage device. On an SSD,
hundred-millisecond latencies are simply intolerable.
One would assume that the second problem will eventually go away as the
firmware running in SSDs gets smarter. But the first problem can only be
fixed by changing the protocol specification, so any possible fix would be
years in the future. It's a fact of life that we will simply have to live
with.
There are a few proposals out there for how we might live with the
performance problems associated with discard operations. Matthew Wilcox
has a plan to
reimplement the whole discard concept using a cache in the block layer.
Rather than sending discard operations directly to the device, the block
layer will remember them in its own cache.
Any new write operations will then be compared against the discard cache;
whenever an operation overwrites a sector marked for discard, the block
layer will know that the discard operation is no longer necessary and can,
itself, be discarded. That, by itself, would reduce the number of TRIM
operations which must be sent to the device. But if the kernel can work to
increase locality on block devices, performance should improve even more.
One relatively easy-to-implement example would be actively reusing
recently-emptied swap slots instead of scattering swapped pages across the
swap device. As Matthew
puts it: "there's a better way for the drive to find out that the
contents of a block no longer matter -- write some new data to it."
In Matthew's scheme, the block layer would occasionally flush the discard
cache, sending the actual operations to the device. The caching should
allow the coalescing of many operations, further improving performance.
Greg Freemyer, instead, suggests that
flushing the discard cache could be done by a user-space process. Greg
says:
Assuming we have a persistent bitmap in place, have a background
scanner that kicks in when the cpu / disk is idle. It just
continuously scans the bitmap looking for contiguous blocks of
unused sectors. Each time it finds one, it sends the largest
possible unmap down the block stack and eventually to the device.
When normal cpu / disk activity kicks in, this process goes to
sleep.
A variant of this approach was posted by Christoph Hellwig, who has
implemented batched discard support in
XFS. Christoph's patch adds a new ioctl() which wanders
through the filesystem's free-space map and issues large discard operations
on each of the free extents. The advantage of doing things at the
filesystem level is that the filesystem already knows which blocks are
uninteresting; there is no additional accounting required to obtain that
information. This approach will also naturally generate large operations;
larger discards tend to suit the needs of the hardware better. On the
other hand, regularly discarding all of the free space in a filesystem makes it
likely that some time will be spent
telling the device to discard sectors which it already knows to be free.
It is far too soon to hazard a guess as to which of these approaches - if
any - will be merged into the mainline. There is a fair amount of coding
and benchmarking work to be done still. But it is clear that the code
which is in current mainline kernels is not up to the task of getting the
best performance out of near-future hardware.
Your editor feels the need to point out one possibly-overlooked aspect of
this problem. An SSD is not just a dumb storage device; it is, instead, a
reasonably powerful computer in its own right, running complex software,
and connected via what is, essentially, a high-speed, point-to-point
network. Some of the more
enterprise-oriented devices are more explicitly organized this way; they
are separate boxes which hook into an IP-based local net. Increasingly,
the value in these devices is not in the relatively mundane pile of flash
storage found inside; it's in the clever firmware which causes the device
to look like a traditional disk and, one hopes, causes
it to perform well. Competition in this area has brought about
some improvements in this firmware, but we should see a modern SSD for what it is: a
computer running proprietary software that we put at the core of our
systems.
It does not have to be that way; Linux does not need to talk to flash
storage through a fancy translation layer. We have our own translation
layer code (UBI), and a few filesystems which can work with bare flash. It
would be most interesting to see what would happen if some manufacturer
were to make competitive, bare-flash devices available as separate
components. The kernel could then take over the flash management task, and
our developers could turn their attention toward solving the problem
correctly instead of working around problems in vendor solutions. Kernel
developers made an explicit choice to avoid offloading much of the network
stack onto interface hardware; it would be nice to have a similar choice
regarding the offloading of low-level storage management.
In the absence of that choice, we'll have no option but to deal with the
translation layers offered by hardware vendors. The results look unlikely
to be optimal in the near future, but they should still end up being better
than what we have today.
Comments (36 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
- Jozsef Kadlecsik: ipset-3.1 .
(August 19, 2009)
Page editor: Jonathan Corbet
Next page: Distributions>>