Brief items
The current development kernel is 2.6.38-rc4,
released on February 7. "
There's nothing much that stands out here. Some arch updates (arm and
powerpc), the usual driver updates: dri (radeon/i915), network cards,
sound, media, scisi, some filesystem updates (cifs, btrfs), and some
random stuff to round it all out (networking, watchpoints,
tracepoints, etc)." The short-form changelog is in the
announcement, or see
the
full changelog for all the details.
Stable updates: the 2.6.35.11
long-term update was released on February 7 with a long list of
important fixes.
The 2.6.27.58 update was released on
February 9. This one contains a couple dozen important fixes.
Comments (none posted)
Anyone who programs ATA controllers on the basis of common sense
rather than documentation, errata sheets and actually testing
rather than speculating is naïve.
--
Alan Cox
I'm invoking the anti-discrimination statutes here on behalf of
those of us who don't like beer.
--
James Bottomley pushes the limits of
tolerance
Realize that 50% of today's professional programmers have never
written a line of code that had to be compiled.
--
Casey Schaufler
As you can see in these posts, Ralink is sending patches for the
upstream rt2x00 driver for their new chipsets, and not just dumping
a huge, stand-alone tarball driver on the community, as they have
done in the past. This shows a huge willingness to learn how to
deal with the kernel community, and they should be strongly
encouraged and praised for this major change in attitude.
--
Greg
Kroah-Hartman
There are no "rules", things have to work and that is the only rule.
--
Markus Rechberger (context
here)
Comments (15 posted)
ITWire has posted
a
lengthy interview with Linus Torvalds. "
On the other hand, one
of the things I've always enjoyed in Linux development has been how it's
stayed interesting by evolving. So maybe it's less 'fun' in the
crazy-go-lucky sense, but on the other hand the much bigger development
team and the support brought in by all the companies around Linux has also
added its own very real fun. It's a lot more social, for example. So the
project may have lost something, but it gained something else to
compensate."
Comments (32 posted)
By Jonathan Corbet
February 9, 2011
The TCP slow start algorithm, initially developed by Van Jacobson, was one
of the crucial protocol tweaks which made TCP/IP actually work on the
Internet. Slow start works by limiting the amount of data which can be in
flight over a new connection and ramping the transmission speed up slowly
until the carrying capacity of the connection is found. In this way, TCP
is able to adapt to the actual conditions on the net and avoid overloading
routers with more data than can be accommodated. A key part of slow start
is the initial congestion window, which puts an upper bound on how much
data can be in flight at the very beginning of a conversation.
That window has been capped by RFC 3390 at four segments
(just over 4KB) for the better part of a decade. In the meantime,
connection speeds have increased and the amount of data sent over a given
connection has grown despite the fact that connections live
for shorter periods of time. As a result, many connections never ramp up
to their full speed before they are closed, so the four-segment limit is
now seen as a bottleneck which increases the latency of a typical
connection considerably. That is one reason why contemporary browsers use
many connections in parallel, despite the fact that the HTTP specification
says that a maximum of two connections should be used.
Some developers at Google have been agitating for an
increase in the initial congestion window for a while; in July 2010 they
posted an
IETF draft pushing for this change and describing the motivation behind
it. Evidently Google has run some large-scale tests and found that, by
increasing the initial congestion window, user-visible latencies can be
reduced by 10% without creating congestion problems on the net. They thus
recommend that the window be increased to 10 segments; the draft suggests
that 16 might actually be a better value, but more testing is required.
David Miller has posted a patch increasing
the window to 10; that patch has not been merged into the mainline, so one
assumes it's meant for 2.6.39.
Interestingly, Google's tests worked with a number of operating systems,
but not with Linux, which uses a relatively small initial receive
window of 6KB. Most other systems, it seems, use 64KB instead. Without a
receive window at least as large as the congestion window, a larger initial
congestion window will have little effect. That problem will be fixed in
2.6.38, thanks to a
patch from Nandita Dukkipati raising the initial receive window to 10
segments.
Comments (23 posted)
A quick look at any development conference will reveal that quite a few
Linux hackers are currently carrying phones made by HTC. They obviously
like the hardware, but kernel developers have been getting increasingly
annoyed by HTC's policy of delaying source
code releases for up to 120 days after a given handset ships. In response,
Matthew Garrett has
suggested an addition
to the top-level COPYING file in the kernel source:
While this version of the GPL does not place an explicit timeframe
upon fulfilment of source distribution under section 3(b), it is
the consensus viewpoint of the Linux community that such
distribution take place as soon as is practical and certainly no
more than 14 days after a request is made.
About the only response so far has been from Alan Cox, who has suggested that getting a
lawyer's opinion on the matter might be useful. Linus, over whose name the
new text would appear, has not commented on it. So it's not clear if the
change will go in or whether it will inspire any changes in vendor
behavior if it is merged. But it does, at least, make the developers'
feelings on the matter known.
Comments (8 posted)
Kernel development news
By Jonathan Corbet
February 9, 2011
The ext4 filesystem has, at this point, moved far beyond its experimental
phase. It is now available in almost all distributions and is used by
default in many of them. Many users may soon be in danger of forgetting
that the ext2 and ext3 filesystems even exist in the kernel. But those
filesystems do
exist, and they require ongoing resources to maintain. Keeping older,
stable filesystems around makes sense when the newer code is stabilizing,
but somebody is bound to ask, sooner or later, whether it is time to say
"goodbye" to the older code.
The question, as it turns out, came sooner
- February 3, to be exact - when Jan Kara suggested that removing ext2
and ext3
could be discussed at the upcoming storage, filesystems, and memory
management summit. Jan asked:
Of course it costs some effort to maintain them all in a
reasonably good condition so once in a while someone comes and
proposes we should drop one of ext2, ext3 or both. So I'd like to
gather input what people think about this - should we ever drop
ext2 / ext3 codebases? If yes, under what condition do we deem it
is OK to drop it?
One might protest that there will be existing filesystems in the ext3 (and
even ext2) formats for the indefinite future. Removing support for those
formats is clearly not something that can be done. But removing the ext2
and/or ext3 code is not the same as removing support: ext4 has been very
carefully written to be able to work with the older formats without
breaking compatibility. One can mount an ext3 filesystem using the ext4
code and make changes; it will still be possible to mount that filesystem
with the ext3 code in the future.
So it is possible to remove ext2 and ext3 without breaking existing users
or preventing them from going back to older implementations. Beyond that,
mounting an ext2/3 filesystem under ext4 allows the system to use a number
of performance enhancing techniques - like delayed allocation - which do
not exist in the older implementations. In other words, ext4 can replace
ext2 and ext3, maintain compatibility, and make things faster at the same
time. Given that, one might wonder why removing the older code even
requires discussion.
There appear to be a couple of reasons not to hurry into this change, both
of which have to do with testing. As Eric Sandeen noted, some of the more ext3-like options are
not tested as heavily as the native modes of operation:
ext4's more, um ... unique option combinations probably get next to
no testing in the real world. So while we can say that noextent,
nodelalloc is mostly like ext3, in practice, does that ever really
get much testing?
There is also concern that ext4, which is still seeing much more change
than its predecessors, is more likely to introduce instabilities. That's a
bit of a disturbing idea; there are enough production users of ext4 now
that the introduction of serious bugs would not be pleasant. But, again,
the backward-compatible operating modes of ext4 may not be as heavily
tested as the native mode, so one might argue that operation with older
filesystems is more likely to break regardless of how careful the
developers are.
So, clearly, any move to get rid of ext2 and ext3 would have to be preceded
by the introduction of better testing for the less-exercised corners of
ext4. The developers involved understand that clearly, so there is no need
to be worried that the older code could be removed too quickly.
Meanwhile, there are also concerns that the older code, which is not seeing
much developer attention, could give birth to bugs of its own. As Jan put it:
The time I spend is enough to keep ext3 in a good shape I believe
but I have a feeling that ext2 is slowly bitrotting. Sometime when
I look at ext2 code I see stuff we simply do differently these days
and that's just a step away from the code getting broken... It
would not be too much work to clean things up and maintain but it's
a work with no clear gain (if you do the thankless job of
maintaining old code, you should at least have users who appreciate
that ;) so naturally no one does it.
Developers have also expressed concern that new filesystem authors might
copy code from ext2, which, at this point, does not serve as a good example
for how Linux filesystems should be written.
The end result is that, once the testing concerns have been addressed,
everybody involved might be made better off by the removal of ext2 and
ext3. Users with older filesystems would get better performance and a code
base which is seeing more active development and maintenance. Developers
would be able to shed an older maintenance burden and focus their efforts
on a single filesystem going forward. Thanks to the careful compatibility
work which has been done over the years, it may be possible to safely make
this move in the relatively near future.
Comments (36 posted)
By Jake Edge
February 9, 2011
With some regularity, the topic of allowing multiple Linux Security Modules
(LSMs) to all be active comes up in the Linux kernel community. There have
been some attempts at "stacking" or "chaining" LSMs in the past, but
nothing has ever made it into the mainline. On the other hand, though,
every time a developer comes up with some kind of security hardening patch
for the kernel, they are generally directed toward the LSM interface.
Because the "monolithic" security solutions (like SELinux, AppArmor, and
others) tend to have already taken the single existing LSM slot in many
distributions, these simpler, more targeted LSMs are generally unable to be
used. But a discussion on the linux-security-module mailing list
suggests that work is being done that just might solve this problem.
The existing implementation of LSMs uses a single set of function pointers
in a struct security_operations
for the "hooks" that get called when access decisions need to be made.
Once a security module gets registered (typically at boot time using the
security= flag), its implementation is stored in the structure and
any other LSM is out of luck. The idea behind LSM stacking would be to
keep multiple versions of the security_operations structure around
and to call each registered LSM's hooks for an access decision. While that
sounds fairly straightforward, there are some subtleties that need to be
addressed, especially if different LSMs give different answers for a
particular access.
This problem with the semantics of "composing" two (or more) LSMs has been
discussed at various points, without any real global solution for composing
arbitrary LSMs. As Serge E. Hallyn warned
over a year ago:
The problem is that composing any two security policies can quickly have
subtle, unforeseen, but dangerous effects. That's why so far we have
stuck with the status quo where only one LSM is 'active', but that LSM
can manually call hooks from other LSMs.
There is one example of stacking LSMs as Hallyn describes in the
kernel already; the capabilities LSM is called directly from
other
LSMs where necessary. That
particular approach is not very general, of course, as LSM maintainers are
likely to lose patience with adding calls for every other possible LSM. A
more easily expandable solution is required.
David Howells posted a set of patches that
would add that expansion mechanism. It does that by allowing multiple
calls to the register_security() initialization function, each
with its own set of security_operations. Instead of the current
situation, where each LSM manages its own data for each kind of object
(credentials, keys, files, inodes, superblocks, IPC, and sockets), Howell's
security framework will allocate and manage that data for the LSMs.
The security_operations structure gets new *_data_size
and *_data_offset fields for each kind of object, with the former
filled in by the LSM
before calling register_security() and the latter being managed by
the framework. The data size field tells the framework how much space is
needed for the LSM-specific data for that type of object, and the offset is
used by the framework to find each LSM's private data. For
struct cred, struct key,
struct file, and struct super_block, the extra
data for each registered LSM is tacked onto the end of the structure rather
than going through an intermediate pointer (as is required for the others).
Wrappers are defined that will allow an LSM to extract its data from an
object based on the new fields in the operations table.
The framework then maintains a list of registered LSMs and puts the
capabilities LSM in the first slot of the list. When one of the security
hooks is
called, the framework iterates over the list and calls the
corresponding hook for each registered LSM. Depending on the specific
hook, different kinds of iterators are used, but the usual iterator looks
for a non-zero response from an LSM's hook, which would indicate a denial
of some kind, and returns that to the framework. The other iterators are
used for specialized calls, for example when there is no return value or
when only the first hook found should be called. The upshot is that the
hooks for registered LSMs get called in order (with capabilities coming
first), and the first to deny the access "wins". Because the capabilities
calls are pulled out separately, that also means that the other LSMs no
longer have to make those calls themselves; instead the framework will
handle it for them.
But there are a handful of hooks that do not work very well in a multi-LSM
environment, in particular the secid (an LSM-specific security label
ID) handling routines (e.g. secid_to_secctx(),
task_getsecid(), etc.). Howells's current implementation just
calls the hook of the first LSM it finds that implements it, which
is not going to make it possible to use multiple LSMs that all implement
those hooks (currently just SELinux and Smack). Howells's solution is to explicitly ban that particular
combination:
I think the obvious thing is to reject any chosen module that implements any of
these interfaces if we've already selected a module that implements them. That
would mean you can choose either Smack or SELinux, but not both.
But Smack developer Casey Schaufler isn't convinced that is the right course:
"That kind of takes the wind out of the sails, doesn't it?" He
would rather see a more general solution that allows multiple
secids, and the related secctxs (security contexts), to
be handled by the framework:
It does mean that there needs to be a standard for a secctx that allows
for the presence of multiple concurrent LSMs. There will have to be an
interface whereby either the composer/stacker can break a secctx into its
constituent parts or with which an LSM can pull the bit it cares about
out. In either case the LSMs may need to be [updated] to accept a secctx
in a standardized format.
Another interesting part of Schaufler's message is that he has been working
on an "alternative approach" to the multi-LSM problem that he
calls "Glass". The code is, as yet, unreleased, but Schaufler describes
Glass as an LSM that composes other LSMs:
The Glass security blob is an array of
pointers, one for each available LSM, including commoncap, which
is always in the last slot. The Glass LSM is always registered first.
As subsequent LSMs register they are added to the glass LSM vector.
When a hook is invoked glass goes through its vector and if the
LSM provides a hook it gets called, and the return remembered.
If any other LSM provided a hook the commoncap hook is skipped,
but if no LSM was invoked commoncap is called.
Unlike Howells's proposal, Glass would leave the calls to the
capabilities LSM (aka commoncap) in the existing LSMs, and only call
commoncap if no LSM implemented a given hook. The idea is that the LSMs
already handle the capabilities calls in their hooks as needed, so it is
only when none of those get called that requires a call into commoncap. In
addition, Glass leaves the allocation and management of the security
"blobs" (LSM-specific data for objects) to the LSMs rather than
centralizing them in
the framework as Howells's patches do.
In addition to various other differences, there is a more fundamental
difference in the way that the two solutions handle multiple LSMs that all have
hooks for a particular security operation. Glass purposely calls each hook
in each registered LSM, whereas Howells's proposal typically short-circuits
the chain
of hooks once one of them has denied the access. Schaufler's idea is that
an LSM should be able to maintain state, which means that skipping its
hooks could potentially skew the access decision:
My dreaded case is an LSM that bases controls on statistical frequency
of access to files. There is no way you could skip any of its hooks,
and I don't see off hand any file access hook it wouldn't use. I have
heard people (think credit card companies) suggest such things, so
although I don't have use for it I can't discount the potential for it.
There are plenty of other issues to resolve, including things like handling
/proc/self/attr/current (which contains the security ID for the
current process) because various user-space programs already parse the
output of that file, though it is different depending on which LSM is
active. A standardized format for that file, which takes multiple
LSMs into account, might be better, but it would break the kernel ABI and
is thus not likely to pass muster. Overall, though, Howells and Schaufler
were making some good
progress on defining the requirements for supporting multiple LSMs.
Schaufler is optimistic that the
collaboration will bear fruit: "I think that we may be
able to get past the problems that have held multiple LSMs back this
time around."
So far, there is only the code from Howells to look at, but Schaufler has
promised to make Glass available soon. With luck, that will lead to a
multi-LSM solution that the LSM developers can coalesce behind, whether it
comes from Howells, Schaufler, or a collaboration between them. There may
still be a fair amount of resistance from Linus Torvalds and other kernel
hackers, but the lack of any way to combine
LSMs comes up too often for it
to be ignored forever.
Comments (2 posted)
By Jonathan Corbet
February 8, 2011
Your editor has recently seen two keynote presentations on two continents
which, using two very different styles, conveyed the same message: the
centralization of the Internet and the services built on it has given
governments far too much control. Both speakers called for an urgent
effort to decentralize the net at all levels, including the transport
level. An Internet without centralized telecommunications infrastructure
can be hard to envision; when people try the term that usually comes out is
"mesh networking." As it happens, the kernel has a mesh networking
implementation which made the move from the staging tree into the mainline
proper in 2.6.38.
Mesh networking, as its name implies, is meant to work via a large number
of short-haul connections without any sort of centralized control. A
proper mesh network should configure itself dynamically, responding to the
addition and removal of nodes and changes in connectivity. In a
well-functioning mesh, networking "just happens" without high-level
coordination; such a net should be quite hard to disrupt. What the kernel
offers now falls somewhat short of that ideal, but it is a good
demonstration of how hard mesh networking can be.
The "Better Approach To Mobile Ad-hoc Networking" (BATMAN) protocol is
described in this
draft RFC. A BATMAN mesh is made up of a set of "originators" which
communicate via network interfaces - normal wireless interfaces, for
example. Every so often, each originator sends out an "originator message"
(OGM) as a broadcast to all of its neighbors to tell the world that it
exists. Each neighbor is supposed to note the presence of the originator
and forward the message onward via a broadcast of its own. Thus, over
time, all nodes in the mesh should see the OGM, possibly via multiple
paths, and thus each node will know (1) that it can reach the
originator, and (2) which of its neighbors has the best path to that
originator. Each node maintains a routing table listing every other node
it has ever heard of and the best neighbor by which to reach each one.
This protocol has the advantage of building and maintaining the routing
tables on the fly; no central coordination is needed. It should also find
near-optimal routes to each. If a node goes away, the routing tables will
reconfigure themselves to function in its absence. There is also no node
in the network which has a complete view of how the mesh is built; nodes
only know who is out there and the best next hop. This lack of knowledge
should add to the security and robustness of the mesh.
Nodes with a connection to the regular Internet can set a bit in their OGMs
to advertise that fact; that allows others without such a connection to
route packets to and from the rest of the world.
The original BATMAN protocol uses UDP for the OGM messages. That design
allows routing to be handled with the normal kernel routing tables, but it
also imposes a couple of unfortunate constraints: nodes must obtain an IP
address from somewhere before joining the mesh, and the protocol is tied to
IPv4. The BATMAN-adv protocol
found in the Linux kernel has changed a few things to get around these
problems, making it a rather more flexible solution. BATMAN-adv works
entirely at the link layer, exchanging non-UDP OGMs directly with
neighboring nodes. The routing table is maintained within a special
virtual network device, which makes all nodes on the mesh appear to be
directly connected via that virtual interface. Thus the system can join
the mesh before it has a
network address, and any protocol can be run over the mesh.
BATMAN-adv removes some of the limitations found in BATMAN, but readers who
have gotten this far will likely be thinking of the limitations that
remain. The flooding of broadcast OGMs through the net can only scale so
far before a significant amount of bandwidth is consumed by network
overhead. The protocol trims OGMs which are obviously not of interest -
those which describe a route which is known to be worse than others, for
example - but the OGM traffic will still be significant if the mesh gets
large. The routing tables will also grow, since every node must keep track
of every other node in existence. The overhead for these tables is
probably manageable for a mesh of 1,000 nodes; it is probably hopeless for
1,000,000 nodes. Mobile devices - which are targeted by this protocol -
are especially likely to suffer as the table gets larger.
Security is also a concern in this kind of network. Simple
bandwidth-consuming denial of service attacks would seem relatively
straightforward. Sending bogus OGMs could cause the size of routing tables
to explode or disrupt the routing within the mesh. A more clever attack
could force traffic to route through a hostile node, enabling
man-in-the-middle exploits. And so on. The draft RFC quickly mentions
some of these issues, but it seems clear that security has not been a major
design goal.
So it would seem clear that BATMAN-adv, while interesting, is not the
solution to the problem of an overly-centralized network. It could be a
useful way to extend connectivity through a building or small neighborhood,
but it is not meant to operate on a large scale or in an overtly hostile
environment. The bigger problem is a hard one to solve, to say the least.
The experience
gained with protocols like BATMAN-adv may will prove valuable in the search
for that solution, but there is clearly some work to be done still.
Comments (5 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>