Brief items
The current development kernel is 3.11-rc7,
released on
August 25 with a special announcement: "
I'm doing a (free)
operating system (just a hobby, even if it's big and professional) for 486+
AT clones and just about anything else out there under the sun. This has
been brewing since april 1991, and is still not ready. I'd like any
feedback on things people like/dislike in Linux 3.11-rc7." He is, of
course, celebrating the 22nd anniversary of
his
original "hello everybody" mail.
Stable updates: no stable updates have been released in the last
week. As of this writing, the
3.10.10,
3.4.60, and
3.0.94 updates are in the review process;
they can be expected on or after August 29.
Comments (none posted)
It can be hard to realize "this piece of your code is utter
*expletive*" actually means "I think the general concept is okay,
but you need to rethink this piece of your code, and I think you
can make it work."
On the other hand, after grad school, this is second nature.
—
Andy Lutomirski
Also historically ignoring hardware vendor management chains has
always led to better drivers. Not saying that some vendors don't
have competent engineers (many do, many don't), but as soon as
management gets involved in decisions about drivers all hope is
lost.
—
Christoph Hellwig
Comments (none posted)
By Jonathan Corbet
August 28, 2013
A few weeks back, we
noted the merging of a
patch adding support for the
flink() system call, which would
allow a program to link to a file represented by an open file descriptor.
This functionality had long been blocked as the result of security worries;
it went through this time on the argument that the kernel already provided
that feature via the
linkat() call. At that point, it seemed like
this feature was set for the 3.11 kernel.
Not everybody was convinced, though; in particular, Brad Spengler
apparently raised the issue of processes making links to file descriptors
that had been passed in from a more privileged domain. So Andy Lutomirski,
the author of the original patch, posted a
followup to restrict the functionality to files created with the
O_TMPFILE option. The only problem was that nobody much liked the
patch; Linus was reasonably clear about his
feelings.
What followed was a long discussion on how to better solve the problem,
with a number of patches going back and forth. About the only thing that
became clear was that the best solution was not yet clear. So, on
August 28, Linus reverted the original
patch, saying "Clearly we're not done with this discussion, and
the patches flying around aren't appropriate for 3.11" So the
flink() functionality will have to wait until at least 3.12.
Comments (none posted)
Kernel development news
By Jonathan Corbet
August 28, 2013
One might think that, by now, we would have a pretty good idea of how to
optimally manage data streams with the TCP protocol. In truth, there still
seems to be substantial room for improvement. Larger problems like
bufferbloat have received a lot of
attention recently, but there is ongoing work in other aspects of
real-world networking as well. A couple of patches posted recently by Eric
Dumazet show the kind of work that is being done.
TSO sizing
TCP segmentation offload (TSO) is a hardware-assisted technique for
improving the performance of outbound data streams. As a stream of data (a
"flow") is
transmitted, it must be broken up into smaller segments, each of which fits
into one packet. A network interface that supports TSO can accept a large
buffer of data and do this segmentation work within the hardware. That
reduces the number of operations performed and interrupts taken by the host
CPU, making the transmission process more efficient. The use of techniques
like TSO makes it possible for Linux systems to run high-end network
interfaces at their full speed.
A potential problem with TSO is that it can cause the interface to dump a
large number of packets associated with a single stream onto the wire in a
short period of time. In other words, data transmission can be bursty,
especially if the overall flow rate for the connection is not all that
high. All of those packets will just end up sitting in a buffer somewhere,
contributing to bufferbloat and increasing the chances that some of those
packets will be dropped. If those packets were transmitted at a more
steady pace, the stress on the net as a whole would be reduced and
throughput could well increase.
The simple TSO automatic sizing patch
posted by Eric (with a Cc to Van Jacobson at a new google.com address)
tries to spread out transmissions in just that way. There are two changes
involved, the first of which is to make intelligent choices about how much
data should be handed to the interface in a single TSO transmission.
Current kernels will provide a maximally-sized buffer — usually 64KB — to
be transmitted all at once. With the automatic sizing patch, that buffer
size is reduced to an amount that, it is estimated, will take roughly 1ms
to transmit at the current flow rate. In this way, each transmission will
produce a smaller burst of data if the flow rate is not high.
The other piece of the puzzle is called "TCP pacing"; it is a TCP
implementation change intended to set
the pace at which packets are transmitted to approximately match the pace
at which they can get through the network. The existing TCP flow control
mechanisms tell each endpoint how much data it is allowed to transmit, but
they don't provide a time period over which that transmission should be
spread, so, once again, the result tends to be bursts of packets.
TCP pacing ensures that packets
are transmitted at a rate that doesn't cause them to pile up in buffers
somewhere between the two endpoints. It can, of course, also be used to
restrict the data rate of a given flow to something lower than what the
network could handle, but that is not the objective of this patch.
Interestingly, the patch does not actually implement pacing, which would
add some significant complexity to the TCP stack — code that does not really
need more complexity. All it does is to calculate the rate that should be
used, in the hope that some other level of the stack can then enforce that
rate. For now, that other part would appear to be the new "fair queue" packet scheduler.
The FQ scheduler
A packet scheduler is charged with organizing the flow of packets through
the network stack to meet a set of policy objectives. The kernel has quite
a few of them, including CBQ for fancy class-based routing, CHOKe for routers, and a couple of variants on
the CoDel queue management algorithm. FQ
joins this list as a relatively simple scheduler designed to implement fair
access across large numbers of flows with local endpoints while keeping
buffer sizes down; it also happens to implement TCP pacing.
FQ keeps track of every flow it sees passing through the system. To do so,
it calculates an eight-bit hash based on the socket associated with the
flow, then uses the result as an index into an array of red-black trees.
The data structure is designed, according to Eric, to scale well up to
millions of concurrent flows. A number of parameters are associated with
each flow, including its current transmission quota and, optionally, the
time at which the next packet can be transmitted.
That transmission time is used to implement the TCP pacing support. If a
given socket has a pace specified for it, FQ will calculate how far the
packets should be spaced in time to conform to that pace. If a flow's next
transmission time is in the future, that flow is added to another red-black
tree with the transmission time used as the key; that tree, thus, allows
the kernel to track delayed flows and quickly find the one whose next
packet is due to go out the soonest. A single timer is then used, if
needed, to ensure that said packet is transmitted at the right time.
The scheduler maintains two linked lists of active flows, the "new" and
"old" lists. When a flow is first encountered, it is placed on the new
list. The packet dispatcher services flows on the new list first; once a
flow uses up its quota, that flow is moved to the old list. The idea here
appears to be to give preferential treatment to new, short-lived
connections — a DNS lookup or HTTP "GET" command, for example — and not let
those connections be buried underneath larger, longer-lasting flows.
Eventually the scheduler works its way through all active flows, sending a
quota of data from each; then the process starts over.
There are a number of additional details, of course. There are limits on
the amount of data queued for each flow, as well as a limit on the amount
of data buffered within the scheduler as a whole; any packet that would
exceed one of those limits is dropped. A special "internal" queue exists
for high-priority traffic, allowing it to reach the wire more quickly. And
so on.
One other detail is garbage collection. One problem with this kind of flow
tracking is that nothing tells the scheduler when a particular flow is shut
down; indeed, nothing can tell the scheduler for flows without local
endpoints or for non-connection-oriented protocols. So the scheduler must
figure out on its own when it can stop
tracking any given flow. One way to do that would be to drop the flow as
soon as there are no packets associated with it, but that would cause some
thrashing as the queues empty and refill; it is better to keep flow data
around for a little while in anticipation of more traffic. FQ handles this
by putting idle flows into a special "detached" state, off the lists of
active flows. Whenever a new flow is added, a pass is made over the
associated red-black tree to clean out flows that have been detached for a
sufficiently long time — three seconds in the current patch.
The result, says Eric, is fair scheduling of packets from any number of
flows with nice spacing in situations where pacing is being used. Given
that the comments so far have been mostly concerned with issues like white
space, it seems unlikely that anybody is going to disagree with its
merging. So TCP pacing and the FQ scheduler seem like candidates for the
mainline in the near future — though the upcoming 3.12 cycle may be just a
bit too near at this point.
Comments (4 posted)
By Jake Edge
August 28, 2013
Lightweight virtualization using containers is a technique that has finally
come together for Linux, though there are still some rough edges that may
need filing down. Containers are created by using two
separate
kernel features: control groups (cgroups) and namespaces. Cgroups are in the process of being revamped, while there may still need to
be more namespaces added to the
varieties currently
available. For example, there is no way to separate most devices into their own
namespace. That's a hole that Oren Laadan would like to see filled, so he put
out an RFC
on device namespaces recently.
Namespaces partition global system resources so that different sets of
processes have their own view of those resources. For example, mount
namespaces partition the mounted filesystems into different views, with the
result that processes in one namespace are unable to see or interact with
filesystems that are only mounted in another. Similarly, PID
namespaces give each namespace its own set of process IDs (PIDs).
Certain devices are currently handled by their own namespace or similar
functionality: network namespaces for network devices and the
devpts pseudo-filesystem for pseudo-terminals (i.e. pty). But there
is no way to partition the view of all devices in the system, which is what
device
namespaces would do.
The motivation for the feature is to allow multiple virtual phones on a
single physical phone. For example, one could have two complete Android
systems running on the phone, with one slated for work purposes, and the
other for personal uses. Each system would run in its own container that
would be isolated from the other. That isolation would allow companies to
control the apps and security of the "company half" of a phone, while
allowing the user to keep their personal information separate. A video gives an overview of the
idea. Much of that separation can be done today, but
there is a missing piece: virtualizing the devices (e.g. frame buffer,
touchscreen, buttons).
The proposal adds the concept of an "active" device namespace,
which is
the one that the user is currently interacting with. The upshot is that a
user could switch between the phone personalities (or personas) as easily
as they switch between apps today. Each personality would have access to
all of the capabilities of the phone while it was the active namespace, but
none while it was the inactive (or background) namespace.
Setting up a device namespace is done in the normal way, using the clone(),
setns(), or unshare() system
calls. One surprise is that there is no new CLONE_* flag added for
device namespaces, and the CLONE_NEWPID flag is overloaded. A
comment in the code explains why:
/*
* Couple device namespace semantics with pid-namespace.
* It's convenient, and we ran out of clone flags anyway.
*/
While coupling PID and device namespaces may work, it does seem like some
kind of long-term solution to the clone flag problem is required. Once a
process has been put into a device namespace, any
open() of a
namespace-aware device will restrict that device to the namespace.
At some level, adding device namespaces is simply a matter of virtualizing
the major and minor
device numbers so that each namespace has its own set of them. The
major/minor numbers in a namespace would correspond to the driver loaded
for that
namespace. Drivers that might be available to multiple namespaces would need
to be changed to be namespace-aware. For some kinds of drivers, for
example those
without any real state (e.g. for Android, the LED
subsystem or the backlight/LCD
subsystem), the changes would be minimal—essentially just a test. If
the namespace that contains the device is the active one, proceed,
otherwise, ignore any requested
changes.
Devices, though, are sometimes stateful. One can't
suddenly switch sending frame buffer data mid-stream (or mix two streams)
and expect the screen
contents to stay coherent. So, drivers and subsystems will need to handle
the switching behavior. For example, the
framebuffer device should only reflect changes to the screen from the
active namespace, but it should buffer changes from the background
namespace so that those changes will be reflected in the display after a
switch.
Laadan and his colleagues at Cellrox
have put together a set
of patches based on the 3.4 kernel for the Android emulator (goldfish).
There is also a fairly detailed description of the patches and the changes
made for both stateless and stateful devices. An
Android-based
demo that switches between a running phone and an app that displays a
changing color palette has also been created.
So far, there hasn't been much in the way of discussion of the idea on the
containers and lxc-devel mailing lists that the RFC was posted to. On one
hand, it makes sense to be able to virtualize all of the devices in a
system, but on the other that means there are a lot of drivers that
might need to change. There may be some "routing" issues to resolve, as
well—when the phone rings, which namespace handles it? The existing
proof-of-concept API for switching the
active namespace would also likely need some work.
While it may be a
worthwhile feature, it could also lead to a large ripple effect of driver changes. How device namespaces
fare in terms of moving toward the
mainline may well hinge on others stepping forward with additional use
cases. In the end, though, the core
changes to support the feature are fairly small, so the phone
personality use case might be enough all on its own.
Comments (4 posted)
By Jonathan Corbet
August 28, 2013
As a general rule, kernel developers prefer data structures that are
designed for readability and maintainability. When one understands the
data structures used by a piece of code, an understanding of the code
itself is usually not far away. So it might come as a surprise that one of
the kernel's most heavily-used data structures is also among its least
comprehensible. That data structure is
struct page, which
represents a page of physical memory. A recent patch set making
struct page even more complicated provides an excuse for a
quick overview of how this structure is used.
On most Linux systems, a page of physical memory contains 4096 bytes; that
means that a typical system contains millions of pages. Management of
those pages requires the maintenance of a page structure for each
of those physical pages. That puts a lot of pressure on the size of
struct page; expanding it by a single byte will cause the
kernel's memory use to grow by (possibly many) megabytes. That creates a
situation where almost any trick is justified if it can avoid making
struct page bigger.
Enter Joonsoo Kim, who has posted a patch
set aimed at squeezing more information into struct page
without making it any bigger.
In particular, he is concerned about the space occupied by
struct slab, which is used by the slab memory allocator (one
of three allocators that can be configured into the kernel, the others
being called SLUB and SLOB). A slab can
be thought of as one or more contiguous pages containing an array of
structures, each of which can be allocated separately; for example, the
kmalloc-64 slab holds 64-byte chunks used to satisfy
kmalloc() calls requesting between 32 and 64 bytes of space. The
associated slab structures are also used in great quantity;
/proc/slabinfo on your editor's system shows over 28,000 active
slabs for the ext4 inode
cache alone. A reduction in that space use would be welcome; Joonsoo
thinks this can be done — by folding the contents of
struct slab into the page structure representing the
memory containing the slab itself.
What's in struct page
Joonsoo's patch is perhaps best understood by stepping through struct
page and noting the changes that are made to accommodate the extra
data. The full definition of this structure can be found in
<linux/mm_types.h> for the curious. The first field appears
simple enough:
unsigned long flags;
This field holds flags describing the state of the page: dirty, locked,
under writeback, etc. In truth, though, this field is not as simple as it
seems; even the question of whether the kernel is running out of room for
page flags is hard to answer. See this
article for some details on how the flags field is used.
Following flags is:
struct address_space *mapping;
For pages that are in the page cache (a large portion of the pages on most
systems), mapping points to the information needed to access the
file that backs up the page. If, however, the page
is an anonymous page (user-space memory backed by swap), then
mapping will point to an anon_vma structure, allowing the
kernel to quickly find the page tables that refer to this page; see this article for a diagram and details. To
avoid confusion between the two types of page, anonymous pages will have
the least-significant bit set in mapping; since the pointer itself
is always aligned to at least a word boundary, that bit would otherwise be
clear.
This is the first place where Joonsoo's patch makes a change. The
mapping field is not currently used for kernel-space memory, so he
is able to use it as a pointer to the
first free object in the slab, eliminating the need to keep it in
struct slab.
Next is where things start to get complicated:
struct {
union {
pgoff_t index;
void *freelist;
bool pfmemalloc;
};
union {
unsigned long counters;
struct {
union {
atomic_t _mapcount;
struct { /* SLUB */
unsigned inuse:16;
unsigned objects:15;
unsigned frozen:1;
};
int units;
};
atomic_t _count;
};
};
};
(Note that this piece has been somewhat simplified through the removal of
some #ifdefs and a fair number of comments). In the first union,
index is used with page-cache pages to hold the offset into the
associated file. If, instead, the page is managed by the SLUB or SLOB
allocators,
freelist points to a list of free objects. The slab allocator
does not use freelist, but Joonsoo's patch makes slab use it the
same way the other allocators do. The pfmemalloc member, instead,
acts like a page flag; it is set on a free page if memory is tight and the
page should only be used as part of an effort to free more pages.
In the second union, both counters and the innermost anonymous
struct are used by the SLUB allocator, while units is
used by the SLOB allocator. The _mapcount and _count
fields are both usage counts for the page; _mapcount is the number
of page-table entries pointing to the page, while _count is a
general reference count. There are a number of subtleties around the use
of these fields, though, especially _mapcount, which helps with
the management of compound pages as well. Here, Joonsoo adds another field
to the second union:
unsigned int active; /* SLAB */
It is the count of active objects, again taken from struct slab.
Next we have:
union {
struct list_head lru;
struct {
struct page *next;
int pages;
int pobjects;
};
struct list_head list;
struct slab *slab_page;
};
For anonymous and page-cache pages, lru holds the page's position
in one of the least-frequently-used lists. The anonymous structure is used
by SLUB, while list is used by SLOB. The slab allocator uses
slab_page to refer back to the containing slab
structure. Joonsoo's patch complicates things here in an interesting way:
he overlays an rcu_head structure over lru to manage the
freeing of the associated slab using read-copy-update. Arguably that
structure should be added to the containing union, but the current code
just uses lru and casts instead. This trick will also involve
moving slab_page to somewhere else in the structure, but the
current patch set does not contain that change.
The next piece is:
union {
unsigned long private;
#if USE_SPLIT_PTLOCKS
spinlock_t ptl;
#endif
struct kmem_cache *slab_cache;
struct page *first_page;
};
The private field essentially belongs to whatever kernel subsystem
has allocated the page; it sees a number of uses throughout the kernel.
Filesystems, in particular, make heavy use of it. The ptl field
is used if the page is used by the kernel to hold page tables; it allows
the page table lock to be split into multiple locks if the number of CPUs
justifies it. In most configurations, a system containing four or more
processors will split the locks in this way. slab_cache is used
as a back pointer by slab and SLUB, while first_page is used
within compound pages to point to the first page in the set.
After this union, one finds:
#if defined(WANT_PAGE_VIRTUAL)
void *virtual;
#endif /* WANT_PAGE_VIRTUAL */
This field, if it exists at all, contains the kernel virtual address for
the page. It is not useful in many situations because that address is
easily calculated when it is needed. For systems where high memory is in
use (generally 32-bit systems with 1GB or more of memory), virtual
holds the address of high-memory pages that have been temporarily mapped
into the kernel with kmap(). Following private are a
couple of optional fields used when various debugging options are turned
on.
With the changes described above, Joonsoo's patch moves much of the
information previously kept in struct slab into the
page structure. The remaining fields are eliminated in other
ways, leaving struct slab with nothing to hold and, thus, no
further reason to exist. These structures are not huge, but, given that
there can be tens of thousands of them (or more) in a running system, the
memory savings from their elimination can be significant. Concentrating
activity on struct page may also have beneficial cache
effects, improving performance overall. So the patches may well be
worthwhile, even at the cost of complicating an already complex situation.
And the situation is indeed complex: struct page is a complicated
structure with a number
of subtle rules regarding its use. The saving grace, perhaps, is that it
is so heavily used that any kind of misunderstanding about the rules will
lead quickly to serious problems. Still, trying to put more information
into this structure is not a task for the faint of heart. Whether Joonsoo
will succeed remains to be seen, but he clearly is not the first to eye
struct page as a place to stash some useful memory management
information.
Comments (none posted)
Patches and updates
Kernel trees
- Sebastian Andrzej Siewior: 3.10.9-rt5 .
(August 23, 2013)
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
- Lucas De Marchi: kmod 15 .
(August 22, 2013)
Page editor: Jonathan Corbet
Next page: Distributions>>